网络抓取表过滤结果

如何解决网络抓取表过滤结果

我正在使用Python Web刮擦发现here的数据表。具体来说,我想提取公司名称,URL,所有者名称,街道,城市和电话。在通过Beautiful Soup运行并拆分后,要过滤的代码显示为:

['\\\',\\\'href =“?listingid = 9758&profileid = 217Y3Q544Y&action = uweb&url = http%3a%2f%2f www.jpspa.com ” target =“ _ BLANK ”,“ Johnson Price Sprinkle PA ”,“ / a”,“”,“ / b”,“”,“ / td”,“”,“ / tr”,“”,“ / table“,”','/ td“,”','/ tr“,”,“ tr class =” GeneralBody“”,“,” td bgcolor =“#808080” height =“ 1”“, '','img border =“ 0” height =“ 1” src =“ images / dot_clear.gif” width =“ 1” /',“','/ td”,“','/ tr”,“' ,'/ table“,”','/ td“,”','/ tr“,”,“ tr class =” GeneralBody“”,“,” td align =“ left” valign =“ top”宽度=“ 90%”',' Maria Pilos ',“','',' 79 Woodfin Place,Suite 300 ”,“','',' NC,阿什维尔28801 ”,“”,“”,“”,“ b”,“电话:”,“ / b”,“ ** (828)254-2374 **',“,”,“,”,“ b”,“传真:”,'/ b“,”(828)252-9994“,” \“,\'”,“ \\\”, \\\'href =“ DirectoryEmailForm.aspx?listingid = 9758”',“发送电子邮件”,'/ a“,”','/ td“,'','td align =” right“ rowspan =” 3“ valign =“ top” width =“ 10%”','','span style =“ font-size:8pt”','\\\',\\ \'href =“ ?,'!-.. End Listing--”,'',“ / td']

我加粗了要返回的值,并确定了它们在代码中的位置。要过滤它们,代码如下。 Temp_array是上面要过滤的代码,temp_count是数组中的位置,而business_listing是我在找到值时将值附加到的数组。基本上,当temp_count ==值在数组中的位置时,它将把该值附加到数组中。

        <
        temp_count=0
            for i in temp_array:
                if temp_count ==0:
                    business_listings.append(i)
                    temp_count+=1
                elif temp_count ==2:
                    business_listings.append(i)
                    temp_count+=1
                elif temp_count ==19:
                    business_listings.append(i)
                    temp_count+=1    
                elif temp_count ==19:
                    business_listings.append(i)
                    temp_count+=1
                elif temp_count ==20:
                    business_listings.append(i)
                    temp_count+=1
                elif temp_count ==23:
                    business_listings.append(i)
                    temp_count+=1
                elif temp_count ==27:
                    business_listings.append(i)
                    temp_count+=1
                elif temp_count ==42:
                    business_listings.append(i)
                    temp_count+=1
                    
        else:
            count+=1 

输出如下: ['\\\',\\\'href =“?listingid = 9758&profileid = 2B713K5Z48&action = uweb&url = http%3a%2f%2fwww.jpspa.com” target =“ _ BLANK”']> 并且仅过滤前两个值,或者不过滤任何内容。

解决方法

此脚本将打印有关各种业务的信息:

import requests
from bs4 import BeautifulSoup


url = 'https://web.ashevillechamber.org/cwt/external/wcpages/wcdirectory/Directory.aspx?CategoryID=1242&Title=Accounting++and++Bookkeeping&AdKeyword=Accounting++and++Bookkeeping'
soup = BeautifulSoup(requests.get(url).content,'html.parser')


for b in soup.select('td[bgcolor="#E6E6E6"] b'):
    business_name = b.text
    business_url = b.a['href'] if b.a else '-'
    owner = b.find_next('td',width="90%").contents[0]

    addr,current = [],owner.find_next(text=True)
    while not current.find_parent('b'):
        addr.append(current.strip())
        current = current.find_next(text=True)

    addr = '\n'.join(addr)
    phone = current.find_next(text=True).strip()

    print('Business Name :',business_name)
    print('Business URL  :',business_url)
    print('Owner         :',owner)
    print('Phone         :',phone)
    print('Address:')
    print(addr)
    print('-' * 80)

打印:

Business Name : Johnson Price Sprinkle PA
Business URL  : ?listingid=9758&profileid=2D7R3B5E4N&action=uweb&url=http%3a%2f%2fwww.jpspa.com
Owner         : Maria Pilos
Phone         : (828) 254-2374
Address:
79 Woodfin Place,Suite 300
Asheville,NC  28801
--------------------------------------------------------------------------------
Business Name : Leah B. Noel,CPA,PC
Business URL  : ?listingid=9656&profileid=549S620J3J&action=uweb&url=http%3a%2f%2fwww.lbnoelcpa.com%2f
Owner         : Ms. Leah Noel
Phone         : 828-333-4529
Address:
14 S. Pack Square #503
Asheville,NC  28801
--------------------------------------------------------------------------------
Business Name : Worley,Woodbery,& Associates,PA
Business URL  : ?listingid=9661&profileid=3L7R304J8X&action=uweb&url=http%3a%2f%2fwww.worleycpa.com%2f
Owner         : Mr. David Worley
Phone         : (828) 271-7997
Address:
7 Orchard Street,Ste. 202
Asheville,NC  28801
--------------------------------------------------------------------------------
Business Name : Peridot Consulting,Inc.
Business URL  : ?listingid=14005&profileid=7L724E5W7E&action=uweb&url=http%3a%2f%2fwww.PeridotConsultingInc.com
Owner         : John Michael  Kledis
Phone         : (828) 242-6971
Address:
PO Box 8904
Asheville,NC  28804
--------------------------------------------------------------------------------
Business Name : DHG
Business URL  : ?listingid=9579&profileid=25711D625I&action=uweb&url=http%3a%2f%2fwww.dhgllp.com%2f
Owner         : Adrienne Bernardi
Phone         : (828) 254-2254
Address:
PO Box 3049
Asheville,NC  28802
--------------------------------------------------------------------------------
Business Name : Gould Killian CPA Group,P.A.
Business URL  : ?listingid=9659&profileid=2P7X216Y66&action=uweb&url=http%3a%2f%2fwww.gk-cpa.com
Owner         : Ed Towson
Phone         : (828) 258-0363
Address:
100 Coxe Avenue
Asheville,NC  28801
--------------------------------------------------------------------------------
Business Name : Michelle Tracz CPA,CFE,PLLC
Business URL  : ?listingid=12921&profileid=610C8H3I7N&action=uweb&url=http%3a%2f%2fwww.michelletraczcpa.com
Owner         : Michelle Tracz
Phone         : (828) 280-2530
Address:
1238 Hendersonville Rd.
Asheville,NC  28803
--------------------------------------------------------------------------------
Business Name : Burleson & Earley,P.A.
Business URL  : ?listingid=10436&profileid=57132N5P9C&action=uweb&url=http%3a%2f%2fwww.burlesonearley.com%2f
Owner         : Bronwyn Burleson,CPA
Phone         : (828) 251-2846
Address:
902 Sand Hill Road
Asheville,NC  28806
--------------------------------------------------------------------------------
Business Name : Carol L. King & Associates,P.A.
Business URL  : ?listingid=10439&profileid=2Z8C7I0B4X&action=uweb&url=http%3a%2f%2fwww.clkcpa.com
Owner         : Carol King
Phone         : (828) 258-2323
Address:
40 North French Broad Avenue
Asheville,NC  28801
--------------------------------------------------------------------------------
Business Name : Goldsmith Molis & Gray
Business URL  : ?listingid=12638&profileid=6C8D2C7F55&action=uweb&url=http%3a%2f%2fwww.gmg-cpa.com
Owner         : Allen Gray
Phone         : (828) 281-3161
Address:
32 Orange St.
Asheville,NC  28801
--------------------------------------------------------------------------------
Business Name : Corliss & Solomon,PLLC
Business URL  : ?listingid=12407&profileid=6T7Y798S1R&action=uweb&url=http%3a%2f%2fwww.candspllc.com
Owner         : Slater Solomon
Phone         : (828) 236-0206
Address:
242 Charlotte St.,Suite 1
Asheville,NC  28801
--------------------------------------------------------------------------------
Business Name : Mountain BizWorks
Business URL  : ?listingid=12733&profileid=2L9E9G6A1S&action=uweb&url=http%3a%2f%2fwww.mountainbizworks.org
Owner         : Matthew Raker
Phone         : (828) 253-2834
Address:
153 South Lexington Ave.
Asheville,NC  28801
--------------------------------------------------------------------------------
Business Name : LeBlanc CPA Limited
Business URL  : -
Owner         : Leslie LeBlanc
Phone         : (828) 225-4940
Address:
218 Broadway
Asheville,NC  28801-2347
--------------------------------------------------------------------------------
Business Name : Bolick & Associates,PA,CPA's
Business URL  : -
Owner         : Alan E Bolick,CPA
Phone         : (828) 253-4692
Address:
Central Office Park   Suite 104
56 Central Avenue
Asheville,NC  28801
--------------------------------------------------------------------------------

编辑:解析URL:

import requests
from bs4 import BeautifulSoup
from urllib.parse import unquote


url = 'https://web.ashevillechamber.org/cwt/external/wcpages/wcdirectory/Directory.aspx?CategoryID=1242&Title=Accounting++and++Bookkeeping&AdKeyword=Accounting++and++Bookkeeping'
soup = BeautifulSoup(requests.get(url).content,unquote(business_url).rsplit('=',maxsplit=1)[-1])
    print('Owner         :',phone)
    print('Address:')
    print(addr)
    print('-' * 80)

打印:

Business Name : Johnson Price Sprinkle PA
Business URL  : http://www.jpspa.com
Owner         : Maria Pilos
Phone         : (828) 254-2374
Address:
79 Woodfin Place,PC
Business URL  : http://www.lbnoelcpa.com/
Owner         : Ms. Leah Noel
Phone         : 828-333-4529
Address:
14 S. Pack Square #503
Asheville,NC  28801
--------------------------------------------------------------------------------

...and so on.

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


依赖报错 idea导入项目后依赖报错,解决方案:https://blog.csdn.net/weixin_42420249/article/details/81191861 依赖版本报错:更换其他版本 无法下载依赖可参考:https://blog.csdn.net/weixin_42628809/a
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下 2021-12-03 13:33:33.927 ERROR 7228 [ main] o.s.b.d.LoggingFailureAnalysisReporter : *************************** APPL
错误1:gradle项目控制台输出为乱码 # 解决方案:https://blog.csdn.net/weixin_43501566/article/details/112482302 # 在gradle-wrapper.properties 添加以下内容 org.gradle.jvmargs=-Df
错误还原:在查询的过程中,传入的workType为0时,该条件不起作用 &lt;select id=&quot;xxx&quot;&gt; SELECT di.id, di.name, di.work_type, di.updated... &lt;where&gt; &lt;if test=&qu
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct redisServer’没有名为‘server_cpulist’的成员 redisSetCpuAffinity(server.server_cpulist); ^ server.c: 在函数‘hasActiveC
解决方案1 1、改项目中.idea/workspace.xml配置文件,增加dynamic.classpath参数 2、搜索PropertiesComponent,添加如下 &lt;property name=&quot;dynamic.classpath&quot; value=&quot;tru
删除根组件app.vue中的默认代码后报错:Module Error (from ./node_modules/eslint-loader/index.js): 解决方案:关闭ESlint代码检测,在项目根目录创建vue.config.js,在文件中添加 module.exports = { lin
查看spark默认的python版本 [root@master day27]# pyspark /home/software/spark-2.3.4-bin-hadoop2.7/conf/spark-env.sh: line 2: /usr/local/hadoop/bin/hadoop: No s
使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams[&#39;font.sans-serif&#39;] = [&#39;SimHei&#39;] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -&gt; systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping(&quot;/hires&quot;) public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate&lt;String
使用vite构建项目报错 C:\Users\ychen\work&gt;npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-