如何解决将多种数据类型刮入同一数据帧 输出
我正在尝试抓取该网站:https://www.basketball-reference.com/players/a/
我的最终目标是建立该表的数据框以及包含玩家索引的新列。例如,对于顶级玩家,该名称为abdelal01。
我当前的尝试:
url = "https://www.basketball-reference.com/players/a"
# this is the HTML from the given URL
html = urlopen(url)
soup = BeautifulSoup(html)
headers = [th.getText() for th in soup.findAll('tr')[0].findAll('th')]
headers = headers
rows = soup.findAll('tr')
player_names = [[td.getText() for td in rows[i].findAll('th')]
for i in range(len(rows))]
names = pd.DataFrame(player_names,columns = headers)
names.head(10)
player_stats = [[td.getText() for td in rows[i].findAll('td')]
for i in range(len(rows))]
stats = pd.DataFrame(player_stats,columns = headers[1:])
stats['Player'] = names['Player']
基本上,这将完全重建表,但没有播放器的URL。鉴于在html中它们具有不同的参考点,是否有更简单的方法来代替构建两个数据框?
什么是向玩家收集索引的最佳方法?
解决方法
提取表数据的最简单方法是通过pandas包。然后可以轻松地对其进行操作。
read_html()方法从页面中获取所有表数据。
import pandas as pd
df = pd.read_html('https://www.basketball-reference.com/players/a/')[0]
df
输出
Player From To Pos Ht Wt Birth Date Colleges
0 Alaa Abdelnaby 1991 1995 F-C 6-10 240 June 24,1968 Duke
1 Zaid Abdul-Aziz 1969 1978 C-F 6-9 235 April 7,1946 Iowa State
2 Kareem Abdul-Jabbar* 1970 1989 C 7-2 225 April 16,1947 UCLA
3 Mahmoud Abdul-Rauf 1991 2001 G 6-1 162 March 9,1969 LSU
4 Tariq Abdul-Wahad 1998 2003 F 6-6 223 November 3,1974 Michigan,San Jose State
... ... ... ... ... ... ... ... ...
161 Dennis Awtrey 1971 1982 C 6-10 235 February 22,1948 Santa Clara
162 Gustavo Ayón 2012 2014 C 6-10 250 April 1,1985 NaN
163 Jeff Ayres 2010 2016 F 6-9 240 April 29,1987 Arizona State
164 Deandre Ayton 2019 2020 C 6-11 250 July 23,1998 Arizona
165 Kelenna Azubuike 2007 2012 G 6-5 220 December 16,1983 Kentucky
玩家表
df['players']
输出
0 Alaa Abdelnaby
1 Zaid Abdul-Aziz
2 Kareem Abdul-Jabbar*
3 Mahmoud Abdul-Rauf
4 Tariq Abdul-Wahad
...
161 Dennis Awtrey
162 Gustavo Ayón
163 Jeff Ayres
164 Deandre Ayton
165 Kelenna Azubuike
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。