如何解决使用python和beautifulsoup进行webscraping aspx,结果是html中没有来自原始html的信息
我尝试从aspx网站上抓取一些数据:
https://firmen.berlin/sites/fitber/search/showDetails.aspx
问题是,我生产的汤没有我需要的信息。我使用了以下代码:
from bs4 import BeautifulSoup
import requests
url = 'https://firmen.berlin/sites/fitber/search/defaultSearch.aspx'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content,'lxml')
print(soup)
我想从原始html中的所有链接生成结果列表,例如:
<a class="link-for-details" href="defaultSearch.aspx?SearchResult$Index=0">Züblin Spezialtiefbau GmbH Niederlassung Nord</a>
在我的汤中,我什至看不到此信息,因此我很难提取信息。我的汤看起来像这样:
</tr>
</table></td><td><img alt="" src="/WebResource.axd?d=PAq-a1as6t-LReK0Ct4W-a-FZXy55jP40uRx7Q6LRhJW2XWPBaE5o5LkeHDfHMhcfRQjpBE01XueKWdcLlg1A_aQI6me1x6xrA18XieG9iOnaJs-0&t=637103382965614113"/></td><td nowrap="nowrap"><input id="ctl00_BodyPanel_ContentPanel_FitContent_SearchPanel_PanelLegalForm_DetailSearchLegalFormSwitcher_DetailSearchLegalForm2_TreeSelector_ctl12n58CheckBox" name="ctl00_BodyPanel_ContentPanel_FitContent_SearchPanel_PanelLegalForm_DetailSearchLegalFormSwitcher_DetailSearchLegalForm2_TreeSelector_ctl12n58CheckBox" type="checkbox"/><span id="ctl00_BodyPanel_ContentPanel_FitContent_SearchPanel_PanelLegalForm_DetailSearchLegalFormSwitcher_DetailSearchLegalForm2_TreeSelector_ctl12t58">UG (haftungsbeschränkt)</span></td>
</tr>
</table><table cellpadding="0" cellspacing="0">
<tr>
<td><table width="20">
<tr>
也许有人知道如何在一个好的html汤中转换aspx,以便我可以提取链接。
非常感谢。
解决方法
我用Selenium解决了,这是代码:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
import pandas as pd
import time
driver = webdriver.Chrome('/Users/username/Anaconda3/chromedriver_win32/chromedriver.exe')
driver.get('https://firmen.berlin/sites/fitber/search/defaultSearch.aspx')
driver.find_element(By.ID,"ctl00_BodyPanel_ContentPanel_FitContent_SearchPanel_CheckBoxPanel_CheckBoxAGB1").click();
driver.find_element(By.ID,"ctl00_BodyPanel_ContentPanel_FitContent_SearchPanel_CheckBoxPanel_CheckBoxAGB2").click();
driver.find_element(By.ID,"ctl00_BodyPanel_ContentPanel_FitContent_SearchPanel_CheckBoxPanel_ButtonConfirm").click();
driver.find_element(By.ID,"ctl00_BodyPanel_ContentPanel_FitContent_SearchPanel_PanelTitleEmployee_Title").click();
driver.find_element(By.ID,"ctl00_BodyPanel_ContentPanel_FitContent_SearchPanel_DetailSearchEmployeeSwitcher_DetailSearchEmployee2_TreeSelector_ctl06n9CheckBox").click();
driver.find_element(By.ID,"ctl00_BodyPanel_ContentPanel_FitContent_SearchPanel_DetailSearchEmployeeSwitcher_DetailSearchEmployee2_TreeSelector_ctl06n10CheckBox").click();
driver.find_element(By.ID,"ctl00_BodyPanel_ContentPanel_FitContent_SearchPanel_DetailSearchEmployeeSwitcher_DetailSearchEmployee2_TreeSelector_ctl06n11CheckBox").click();
driver.find_element(By.ID,"ctl00_BodyPanel_ContentPanel_FitContent_SearchPanel_DetailSearchEmployeeSwitcher_DetailSearchEmployee2_TreeSelector_ctl06n12CheckBox").click();
driver.find_element(By.ID,"ctl00_BodyPanel_ContentPanel_FitContent_SearchPanel_DetailSearchEmployeeSwitcher_DetailSearchEmployee2_TreeSelector_ctl06n13CheckBox").click();
driver.find_element(By.NAME,"ctl00$BodyPanel$ContentPanel$FitContent$SearchPanel$DetailSearchEmployeeSwitcher$DetailSearchEmployee2$TreeSelector$ctl10").click();
time.sleep(3)
driver.find_element(By.ID,"ctl00_BodyPanel_ContentPanel_FitContent_SearchPanel_ButtonPanel_ButtonSearch").click();
company_name = driver.find_elements(By.CLASS_NAME,"link-for-details")
company_list = []
for p in range(len(company_name)):
company_list.append(company_name[p].text)
print(company_list)