Selenium Web抓取没有ID或类名的嵌套div

如何解决Selenium Web抓取没有ID或类名的嵌套div

我正在尝试使用硒从嵌套的HTML表中获取产品名称和数量。我的问题是某些div没有任何ID或类名。我尝试访问的表是重要产品列表。这是我所做的，但是我似乎对如何获取嵌套的div感到迷茫。 该网站位于代码中。

options = Options()
options.add_argument('start-maximized')

driver = webdriver.Chrome(chrome_options=options,executable_path=r'/usr/local/bin/chromedriver/')
url = 'https://www.rrpcanada.org/#/' # site I'm scraping
driver.get(url)
time.sleep(150)
page = driver.page_source
driver.quit()


html_soup = BeautifulSoup(page,'html.parser')
item_containers = html_soup.find_all('div',class_='critical-products-title hide-mobile')

if item_containers:
    for item in item_containers:
       for link in item.findAll('a',) # need to loop the inner divs to reach the href and then get to the left and right classes to get title and quantity
        print(item)

这是检查的图像。我希望能够遍历所有div并获得标题和数量。

解决方法

您不需要漂亮的汤，也不需要保存page_source。我使用CSS选择器选择表中的所有目标行，然后应用列表推导选择每行的左侧和右侧。我将结果输出到元组列表。

options = Options()
options.add_argument('start-maximized')

driver = webdriver.Chrome(chrome_options=options,executable_path=r'/usr/local/bin/chromedriver/')
url = 'https://www.rrpcanada.org/#/' # site I'm scraping
driver.get(url)
time.sleep(150)

elements = driver.find_elements_by_css_selector('#app > div:nth-child(1) > div.header-wrapper > div.header-right > div.critical-product-table-container > div.table.shorten.hide-mobile > div > div > div > a > div')

targetted_values = [(element.find_element_by_css_selector('.line-item-left').text,element.find_element_by_css_selector('.line-item-right').text) for element in elements]

driver.quit()

目标值的输出示例：

[('Surgical & Reusable Masks','376,713,363 available'),('Disposable Gloves','66,962,093 available'),('Gowns and Coveralls','40,502,145 available'),('Respirators','22,189,273 available'),('Surface Wipes','20,650,831 available'),('Face Shields','16,535,686 available'),('Hand Sanitizer','11,152,890 L available'),('Thermometers','8,457,993 available'),('Testing Kits','2,110,815 available'),('Surface Solutions','107,452 L available'),('Protective Barriers','10,833 available'),('Ventilators','410 available')]

要打印visibility_of_all_elements_located()所需的WebDriverWait的产品名称和数量，可以使用以下任一Locator Strategies：

使用CSS_SELECTOR和 text 属性：

driver.get('https://www.rrpcanada.org/#/')
items =  [my_elem.text for my_elem in WebDriverWait(driver,20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,"div.table.shorten.hide-mobile > div div.line-item-title")))]
quantities =  [my_elem.text for my_elem in WebDriverWait(driver,"div.table.shorten.hide-mobile > div div.line-item-bold.available")))]
for i,j in zip(items,quantities):
  print(i,j)

使用XPATH和get_attribute("innerHTML")：

driver.get('https://www.rrpcanada.org/#/')
items =  [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver,20).until(EC.visibility_of_all_elements_located((By.XPATH,"//div[@class='table shorten hide-mobile']/div//div[@class='line-item-title']")))]
quantities =  [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver,"//div[@class='table shorten hide-mobile']/div//div[@class='line-item-bold available']")))]
for i,j)

控制台输出：

Surgical &amp; Reusable Masks  376,363 available
Disposable Gloves  66,093 available
Gowns and Coveralls  40,145 available
Respirators  22,273 available
Surface Wipes  20,831 available
Face Shields  16,686 available
Hand Sanitizer  11,890 L available
Thermometers  8,993 available
Testing Kits  2,815 available
Surface Solutions  107,452 L available
Protective Barriers  10,833 available
Ventilators  410 available

注意：您必须添加以下导入：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

您可以在How to retrieve the text of a WebElement using Selenium - Python
中找到相关的讨论

Outro

链接到有用的文档：

get_attribute()方法Gets the given attribute or property of the element.
text属性返回The text of the element.
Difference between text and innerHTML using Selenium

您必须使用相对xpath来找到带有class="line-item-left"的元素作为每个项目的名称，并找到带有class="line-item-right"的元素作为可用项目的数量。

driver.find_elements_by_class_name("line-item-left") //Item names
driver.find_elements_by_class_name("line-item-right") //Number of items available

请注意元素 s

中的“ s” ,

这是 product name 的选择器：

div.critical-product-table-container div.line-item-left

对于 total ：

div.critical-product-table-container div.line-item-right

但是下面的方法没有BeautifulSoup。

time.sleep(...)是错误的做法，请改用WebDriverWait。

并结合上述两个变量并执行并行循环，我尝试使用zip()函数：

url = 'https://www.rrpcanada.org/#/' # site I'm scraping
driver.get(url)
wait = WebDriverWait(driver,150)
product_names = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,'div.critical-product-table-container div.line-item-left')))
totals = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,'div.critical-product-table-container div.line-item-right')))

for product_name,total in zip(product_names,totals):
    print(product_name.text +'--' +total.text)
    
driver.quit()

您需要进行以下导入：

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

Selenium Web抓取没有ID或类名的嵌套div

如何解决Selenium Web抓取没有ID或类名的嵌套div

解决方法

Outro

相关推荐