如何解决Selenium Web抓取没有ID或类名的嵌套div
我正在尝试使用硒从嵌套的HTML表中获取产品名称和数量。我的问题是某些div没有任何ID或类名。我尝试访问的表是重要产品列表。这是我所做的,但是我似乎对如何获取嵌套的div感到迷茫。 该网站位于代码中。
options = Options()
options.add_argument('start-maximized')
driver = webdriver.Chrome(chrome_options=options,executable_path=r'/usr/local/bin/chromedriver/')
url = 'https://www.rrpcanada.org/#/' # site I'm scraping
driver.get(url)
time.sleep(150)
page = driver.page_source
driver.quit()
html_soup = BeautifulSoup(page,'html.parser')
item_containers = html_soup.find_all('div',class_='critical-products-title hide-mobile')
if item_containers:
for item in item_containers:
for link in item.findAll('a',) # need to loop the inner divs to reach the href and then get to the left and right classes to get title and quantity
print(item)
这是检查的图像。我希望能够遍历所有div并获得标题和数量。
解决方法
您不需要漂亮的汤,也不需要保存page_source。 我使用CSS选择器选择表中的所有目标行,然后应用列表推导选择每行的左侧和右侧。我将结果输出到元组列表。
options = Options()
options.add_argument('start-maximized')
driver = webdriver.Chrome(chrome_options=options,executable_path=r'/usr/local/bin/chromedriver/')
url = 'https://www.rrpcanada.org/#/' # site I'm scraping
driver.get(url)
time.sleep(150)
elements = driver.find_elements_by_css_selector('#app > div:nth-child(1) > div.header-wrapper > div.header-right > div.critical-product-table-container > div.table.shorten.hide-mobile > div > div > div > a > div')
targetted_values = [(element.find_element_by_css_selector('.line-item-left').text,element.find_element_by_css_selector('.line-item-right').text) for element in elements]
driver.quit()
目标值的输出示例:
[('Surgical & Reusable Masks','376,713,363 available'),('Disposable Gloves','66,962,093 available'),('Gowns and Coveralls','40,502,145 available'),('Respirators','22,189,273 available'),('Surface Wipes','20,650,831 available'),('Face Shields','16,535,686 available'),('Hand Sanitizer','11,152,890 L available'),('Thermometers','8,457,993 available'),('Testing Kits','2,110,815 available'),('Surface Solutions','107,452 L available'),('Protective Barriers','10,833 available'),('Ventilators','410 available')]
,
要打印visibility_of_all_elements_located()
所需的WebDriverWait的产品名称和数量,可以使用以下任一Locator Strategies:
-
使用
CSS_SELECTOR
和 text 属性:driver.get('https://www.rrpcanada.org/#/') items = [my_elem.text for my_elem in WebDriverWait(driver,20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,"div.table.shorten.hide-mobile > div div.line-item-title")))] quantities = [my_elem.text for my_elem in WebDriverWait(driver,"div.table.shorten.hide-mobile > div div.line-item-bold.available")))] for i,j in zip(items,quantities): print(i,j)
-
使用
XPATH
和get_attribute("innerHTML")
:driver.get('https://www.rrpcanada.org/#/') items = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver,20).until(EC.visibility_of_all_elements_located((By.XPATH,"//div[@class='table shorten hide-mobile']/div//div[@class='line-item-title']")))] quantities = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver,"//div[@class='table shorten hide-mobile']/div//div[@class='line-item-bold available']")))] for i,j)
-
控制台输出:
Surgical & Reusable Masks 376,363 available Disposable Gloves 66,093 available Gowns and Coveralls 40,145 available Respirators 22,273 available Surface Wipes 20,831 available Face Shields 16,686 available Hand Sanitizer 11,890 L available Thermometers 8,993 available Testing Kits 2,815 available Surface Solutions 107,452 L available Protective Barriers 10,833 available Ventilators 410 available
-
注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC
您可以在How to retrieve the text of a WebElement using Selenium - Python
中找到相关的讨论
Outro
链接到有用的文档:
-
get_attribute()
方法Gets the given attribute or property of the element.
-
text
属性返回The text of the element.
- Difference between text and innerHTML using Selenium
您必须使用相对xpath来找到带有class="line-item-left"
的元素作为每个项目的名称,并找到带有class="line-item-right"
的元素作为可用项目的数量。
driver.find_elements_by_class_name("line-item-left") //Item names
driver.find_elements_by_class_name("line-item-right") //Number of items available
请注意元素 s
中的“ s” ,这是 product name
的选择器:
div.critical-product-table-container div.line-item-left
对于 total
:
div.critical-product-table-container div.line-item-right
但是下面的方法没有BeautifulSoup
。
time.sleep(...)
是错误的做法,请改用WebDriverWait
。
并结合上述两个变量并执行并行循环,我尝试使用zip()
函数:
url = 'https://www.rrpcanada.org/#/' # site I'm scraping
driver.get(url)
wait = WebDriverWait(driver,150)
product_names = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,'div.critical-product-table-container div.line-item-left')))
totals = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,'div.critical-product-table-container div.line-item-right')))
for product_name,total in zip(product_names,totals):
print(product_name.text +'--' +total.text)
driver.quit()
您需要进行以下导入:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。