如何解决python / beautifulsoup:查找具有特定属性的上一行
我正在用这样的表格抓取html文件:
handleBack
我没有问题,可以通过以下操作获取“ tr class =” highlight1“行的值并将其弹出到csv中:
<table>
<tr valign="bottom"><td colspan="3"><p> <br/><b>PASSERIFORMES: Cardinalidae</b></p></td></tr>
<tr class="highlight1"><td>Summer Tanager</td><td><a href="species.jsp?avibaseid=891798D9EFFE1F8D"><i>Piranga rubra</i></a></td><td>Piranga vermillon</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Scarlet Tanager</td><td><a href="species.jsp?avibaseid=4210163221C2E458"><i>Piranga olivacea</i></a></td><td>Piranga écarlate</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Rose-breasted Grosbeak</td><td><a href="species.jsp?avibaseid=7C2FCB13BAA660EE"><i>Pheucticus ludovicianus</i></a></td><td>Cardinal à poitrine rose</td><td>Rare/Accidental </td></tr>
<tr valign="bottom"><td colspan="3"><p> <br/><b>PASSERIFORMES: Buntings</b></p></td></tr>
<tr class="highlight1"><td>Indigo Bunting</td><td><a href="species.jsp?avibaseid=043F337AA25E7D97"><i>Passerina cyanea</i></a></td><td>Passerin indigo</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Dickcissel</td><td><a href="species.jsp?avibaseid=592E58CE67D092DA"><i>Spiza americana</i></a></td><td>Dickcissel d'Amérique</td><td>Rare/Accidental </td></tr>
</table>
我想做的是获取每一行“ tr valign =“ bottom”“中的值。基本上,我知道如何在beautifulsoup中使用css选择器前进和向下钻取,但是我无法弄清楚如何向后退并在每个“ tr类”之前选择“ tr valign =” bottom“” =“ highlight1”“ 。
我希望我的csv输出看起来像这样:
soup = BeautifulSoup(open(r"/Users/user/Downloads/birds.html"),'lxml')
english = [item.text for item in soup.select('tr[class] td:nth-of-type(1)')]
latin = [item.text for item in soup.select('tr[class] td:nth-of-type(2)')]
french = [item.text for item in soup.select('tr[class] td:nth-of-type(3)')]
status = [item.text for item in soup.select('tr[class] td:nth-of-type(4)')]
link = [item['href'] for item in soup.select('tr[class] a[href]')]
test = zip(english,latin,french,status,link)
with open('birdfile.csv','wt') as csvfile:
csv_out = csv.writer(csvfile)
csv_out.writerows(test)
我找不到像这样的示例,非常感谢您的帮助!
解决方法
您可以简单地将桌子读成大熊猫,然后将其切成薄片并切成丁,只要您认为合适即可:
import pandas as pd
langs = """your html above"""
df=pd.read_html(langs)
df[0]
输出(请格式化)
0 1 2 3
0 PASSERIFORMES: Cardinalidae PASSERIFORMES: Cardinalidae PASSERIFORMES: Cardinalidae NaN
1 Summer Tanager Piranga rubra Piranga vermillon Rare/Accidental
2 Scarlet Tanager Piranga olivacea Piranga écarlate Rare/Accidental
3 Rose-breasted Grosbeak Pheucticus ludovicianus Cardinal à poitrine rose Rare/Accidental
4 PASSERIFORMES: Buntings PASSERIFORMES: Buntings PASSERIFORMES: Buntings NaN
5 Indigo Bunting Passerina cyanea Passerin indigo Rare/Accidental
6 Dickcissel Spiza americana Dickcissel d'Amérique Rare/Accidental
,
如果您想要不带pandas
的解决方案,则可以使用以下脚本:
from bs4 import BeautifulSoup
txt = '''
<table>
<tr valign="bottom"><td colspan="3"><p> <br/><b>PASSERIFORMES: Cardinalidae</b></p></td></tr>
<tr class="highlight1"><td>Summer Tanager</td><td><a href="species.jsp?avibaseid=891798D9EFFE1F8D"><i>Piranga rubra</i></a></td><td>Piranga vermillon</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Scarlet Tanager</td><td><a href="species.jsp?avibaseid=4210163221C2E458"><i>Piranga olivacea</i></a></td><td>Piranga écarlate</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Rose-breasted Grosbeak</td><td><a href="species.jsp?avibaseid=7C2FCB13BAA660EE"><i>Pheucticus ludovicianus</i></a></td><td>Cardinal à poitrine rose</td><td>Rare/Accidental </td></tr>
<tr valign="bottom"><td colspan="3"><p> <br/><b>PASSERIFORMES: Buntings</b></p></td></tr>
<tr class="highlight1"><td>Indigo Bunting</td><td><a href="species.jsp?avibaseid=043F337AA25E7D97"><i>Passerina cyanea</i></a></td><td>Passerin indigo</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Dickcissel</td><td><a href="species.jsp?avibaseid=592E58CE67D092DA"><i>Spiza americana</i></a></td><td>Dickcissel d'Amérique</td><td>Rare/Accidental </td></tr>
</table>'''
soup = BeautifulSoup(txt,'html.parser')
all_data = []
for tr in soup.select('tr:not(:has(td[colspan]))'):
all_data.append([
tr.find_previous('td',{'colspan': True}).get_text(strip=True),*[td.get_text(strip=True) for td in tr.select('td')]
])
# print data to screen:
for row in all_data:
print(*row,sep=',')
打印:
PASSERIFORMES: Cardinalidae,Summer Tanager,Piranga rubra,Piranga vermillon,Rare/Accidental
PASSERIFORMES: Cardinalidae,Scarlet Tanager,Piranga olivacea,Piranga écarlate,Rose-breasted Grosbeak,Pheucticus ludovicianus,Cardinal à poitrine rose,Rare/Accidental
PASSERIFORMES: Buntings,Indigo Bunting,Passerina cyanea,Passerin indigo,Dickcissel,Spiza americana,Dickcissel d'Amérique,Rare/Accidental
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。