报废-熊猫read_html和bs4返回多个空行

如何解决报废-熊猫read_html和bs4返回多个空行

我正在尝试使用pandas.read_html()从表中提取一些气候数据,但是它将整个行返回为空。我认为这与网站管理员的某些要求有关,以防止进行网络抓取,但我可能是错的。 我也尝试使用bs4,但结果相同。

熊猫:

import pandas as pd

dfs = pd.read_html('https://www.tutiempo.net/clima/03-2000/ws-879380.html',match='.+',flavor='bs4')

df = dfs[2]
df

输出

    Día T   TM  Tm  SLP H   PP  VV  V   VM  VG  RA  SN  TS  FG
0   1   9.9 15  6   1007.4  55  0.76    16.9    11.1    18.3    -   NaN NaN NaN NaN
1   2   13.5    19  8.4 1006.9  45  0   17.9    13.3    24.1    51.9    NaN NaN NaN NaN
2   3   9.6 18.9    7   1004.8  77  0.76    16.4    17.4    37  50  o   NaN NaN NaN
3   4   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN o   NaN NaN NaN
4   5   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5   6   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6   7   16.6    21  12.6    1000.0  67  0   16.9    20  64.8    85.2    NaN NaN NaN NaN
7   8   12.9    21.2    7.8 1001.7  74  -   16.6    19.1    44.3    72.2    o   NaN NaN NaN
8   9   11.3    19  8.4 1005.4  83  1.02    15.9    12  29.4    -   o   NaN NaN NaN
9   10  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10  11  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN o   NaN NaN NaN
11  12  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN o   NaN NaN NaN
12  13  7.5 12  4   1007.5  85  0.25    17.9    9.4 22.2    -   NaN NaN NaN NaN
13  14  7.8 12  4.8 995.4   91  0   15.1    16.5    27.8    -   o   NaN NaN NaN
14  15  6.5 8   5   984.9   79  2.03    16.6    38.2    48.2    63  NaN NaN NaN NaN

bs4:

import bs4 as bs
import urllib.request

sauce = urllib.request.urlopen('https://www.tutiempo.net/clima/01-2000/ws-879380.html').read()
soup = bs.BeautifulSoup(sauce,'lxml')

table = soup.find("table",{"class": "medias mensuales numspan"})

table_rows = table.find_all('tr')

for tr in table_rows:
  td = tr.find_all('td') 
  row = [i.text for i in td]
  print(row)

输出

['1','8.6','13.3','4.8','996.5','64','0','18.3','39.6','59.1','-','\xa0','\xa0']
['2','9.4','13.8','5.8','999.4','69','0.76','20.6','17','55.4','o','\xa0']
['3','8','12.4','6','1001.1','79','1.27','21.1','31.3','\xa0']
['4','','\xa0']
['5','\xa0']
['6','\xa0']
['7','8.3','16.8','4','984.2','5.08','20.9','24.8','74.1','\xa0']
['8','7.3','13.2','3.5','986.3','65','0.51','15','32.6','\xa0']
['9','4.4','0.6','988.4','81','4.06','14.3','28.2','51.9','\xa0']
['10','\xa0']
['11','\xa0']
['12','\xa0']
['13','8.8','10.3','1001.9','78','0.25','18','57','70.2','\xa0']
['14','9.3','11','7.8','1003.8','76','58.2','64.8','\xa0']

如果您检查网站,则行已完成。 任何帮助。

最诚挚的问候, 欧内斯特·沙克尔顿爵士

解决方法

他们正在使用带有附加样式的<span>标签。附加的样式具有content属性,它们用于在bs4中显示为空的单元格中建立值。

数据已经存在于HTML中,但是您需要处理样式以获取数据:

enter image description here

一个快速而肮脏的解决方法是假设样式不变,并编写类似以下内容的预处理替换:

str = str.replace('<span class="ntlm">','1')str = str.replace('<span class="ntzb">','5')

更好的解决方案是使用css引擎或regex处理样式,在每次加载页面时构建地图,然后应用映射来替换文本。

,

问题在于数据在网站上的显示方式。如果检查其元素,则可以看到所需的某些数据与其他数据一起存储。我不确定我是否正确解释了这一点,但我认为最好是自己看看。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


依赖报错 idea导入项目后依赖报错,解决方案:https://blog.csdn.net/weixin_42420249/article/details/81191861 依赖版本报错:更换其他版本 无法下载依赖可参考:https://blog.csdn.net/weixin_42628809/a
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下 2021-12-03 13:33:33.927 ERROR 7228 [ main] o.s.b.d.LoggingFailureAnalysisReporter : *************************** APPL
错误1:gradle项目控制台输出为乱码 # 解决方案:https://blog.csdn.net/weixin_43501566/article/details/112482302 # 在gradle-wrapper.properties 添加以下内容 org.gradle.jvmargs=-Df
错误还原:在查询的过程中,传入的workType为0时,该条件不起作用 &lt;select id=&quot;xxx&quot;&gt; SELECT di.id, di.name, di.work_type, di.updated... &lt;where&gt; &lt;if test=&qu
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct redisServer’没有名为‘server_cpulist’的成员 redisSetCpuAffinity(server.server_cpulist); ^ server.c: 在函数‘hasActiveC
解决方案1 1、改项目中.idea/workspace.xml配置文件,增加dynamic.classpath参数 2、搜索PropertiesComponent,添加如下 &lt;property name=&quot;dynamic.classpath&quot; value=&quot;tru
删除根组件app.vue中的默认代码后报错:Module Error (from ./node_modules/eslint-loader/index.js): 解决方案:关闭ESlint代码检测,在项目根目录创建vue.config.js,在文件中添加 module.exports = { lin
查看spark默认的python版本 [root@master day27]# pyspark /home/software/spark-2.3.4-bin-hadoop2.7/conf/spark-env.sh: line 2: /usr/local/hadoop/bin/hadoop: No s
使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams[&#39;font.sans-serif&#39;] = [&#39;SimHei&#39;] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -&gt; systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping(&quot;/hires&quot;) public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate&lt;String
使用vite构建项目报错 C:\Users\ychen\work&gt;npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-