Python-Selenium-无法从HTML网页抓取特定文本内容

如何解决Python-Selenium-无法从HTML网页抓取特定文本内容

我尝试对html的这一部分进行网络抓取

<td class="zebraTable__td zebraTable__td--companyName"><a href="/unternehmen/8116602/schneider-electric-holding-germany-gmbh" data-gtm="companySearch__searchResult--76">
                        Schneider Electric Holding Germany GmbH
                    </a></td>

HTML Code

来自此站点：

https://de.statista.com/companydb/suche?idCountry=276&idBranch=0&revenueFrom=-1000000000000000000&revenueTo=1000000000000000000&employeesFrom=0&employeesTo=100000000&sortMethod=revenueDesc&p=4

使用此代码：

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
import pandas as pd
import time 

driver = webdriver.Chrome('/Users/rieder/Anaconda3/chromedriver_win32/chromedriver.exe')

driver.get('https://de.statista.com/companydb/suche?idCountry=276&idBranch=0&revenueFrom=-1000000000000000000&revenueTo=1000000000000000000&employeesFrom=500&employeesTo=100000000&sortMethod=revenueDesc&p=1')

driver.find_element_by_id("cookiesNotificationConfirm").click();

company_name = driver.find_element_by_class_name('zebraTable__td zebraTable__td--companyName')

print(company_name)

我尝试了4个小时，但无法获得。我用xpath，链接文本等其他方法尝试过，但我得到的只是一个空公司名称，例如“ []”。

有人知道硒如何找到“Liebherr-HausgeräteOchsenhausen GmbH”的确切文本吗？

非常感谢。

解决方法

您要查找的内容可以在页面的源代码下找到

ValueError: operands could not be broadcast together with shapes (0,) (2535,)，它是页面源代码的一部分。因此，您不需要硒即可获取硒。只需阅读带有请求的页面，然后使用Beautiful Soup查找数据。

要打印文本 Schneider Electric Holding Germany GmbH ，您必须为visibility_of_element_located()引入WebDriverWait，并且可以使用以下任一Locator Strategies：

使用CSS_SELECTOR和 text 属性：

driver.get('https://de.statista.com/companydb/suche?idCountry=276&idBranch=0&revenueFrom=-1000000000000000000&revenueTo=1000000000000000000&employeesFrom=0&employeesTo=100000000&sortMethod=revenueDesc&p=4')
WebDriverWait(driver,20).until(EC.element_to_be_clickable((By.CSS_SELECTOR,"button#cookiesNotificationConfirm"))).click()
print(WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"table.zebraTable.zebraTable--companies tr:nth-child(2)>td.zebraTable__td.zebraTable__td--companyName>a"))).text)

使用XPATH和get_attribute("innerHTML")：

driver.get('https://de.statista.com/companydb/suche?idCountry=276&idBranch=0&revenueFrom=-1000000000000000000&revenueTo=1000000000000000000&employeesFrom=0&employeesTo=100000000&sortMethod=revenueDesc&p=4')
WebDriverWait(driver,20).until(EC.element_to_be_clickable((By.XPATH,"//button[@id='cookiesNotificationConfirm']"))).click()
print(WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.XPATH,"//table[@class='zebraTable zebraTable--companies']//following::tr[2]/td[@class='zebraTable__td zebraTable__td--companyName']/a"))).get_attribute("innerHTML"))

控制台输出：

Schneider Electric Holding Germany GmbH

注意：您必须添加以下导入：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

您可以在How to retrieve the text of a WebElement using Selenium - Python
中找到相关的讨论

Outro

链接到有用的文档：

get_attribute()方法Gets the given attribute or property of the element.
text属性返回The text of the element.
Difference between text and innerHTML using Selenium

Python-Selenium-无法从HTML网页抓取特定文本内容

如何解决Python-Selenium-无法从HTML网页抓取特定文本内容

解决方法

Outro

相关推荐