如何解决Scrapy:从返回给items.py字典的数组中删除html
首先,感谢您的所有帮助!
... stackoverflow(&python)的新功能,我为使用错误的术语表示歉意:)
我正在使用Scrapy从html源中提取数据,该数据通过使用Scrapy的选择器在items.py中创建了dict字段:
def parse_item(self,response):
item = SiennaautoItem() #instatiating dict
item['attributes'] = response.css('p.attrgroup').extract()
yield item
这将返回一个具有多个值的数组/列表的字典:
> ['<p class="attrgroup">\n\n\n\n <span><b>2014 honda odyssey
> touring elite</b></span>\n <br>\n\n </p>','<p
> class="attrgroup">\n\n\n\n <span>VIN:
> <b>5FNRL5H66EB107700</b></span>\n <br>\n\n\n\n\n
> <span>condition: <b>like new</b></span>\n <br>\n\n\n\n\n
> <span>cylinders: <b>6 cylinders</b></span>\n <br>\n\n\n\n\n
> <span>drive: <b>fwd</b></span>\n <br>\n\n\n\n\n
> <span>fuel: <b>gas</b></span>\n <br>\n\n\n\n\n
> <span>odometer: <b>99000</b></span>\n <br>\n\n\n\n\n
> <span>paint color: <b>white</b></span>\n <br>\n\n\n\n\n
> <span>size: <b>full-size</b></span>\n <br>\n\n\n\n\n
> <span>title status: <b>clean</b></span>\n <br>\n\n\n\n\n
> <span>transmission: <b>automatic</b></span>\n
> <br>\n\n\n\n\n <span>type: <b>mini-van</b></span>\n
> <br>\n\n </p>']
这是呈现的html:
['\ n \ n \ n \ n 2014本田奥德赛 巡回精英 \ n
','\ n \ n \ n \ n VIN: 5FNRL5H66EB107700 \ n
\ n \ n
\ n \ n \ n \ n \ n
条件:新的 \ n
\ n \ n \ n \ n \ n
气瓶: 6个气瓶 \ n
\ n \ n \ n \ n \ n 驱动器: fwd \ n
\ n \ n \ n \ n \ n
燃料:汽油 \ n
\ n \ n \ n \ n \ n
里程表: 99000 \ n
\ n \ n \ n \ n \ n
绘画颜色:白色 \ n
\ n \ n \ n \ n \ n
大小:原尺寸 \ n
\ n \ n \ n \ n \ n
标题状态:干净 \ n
\ n \ n \ n \ n \ n
传输:自动 \ n
\ n \ n \ n \ n \ n类型:小型货车 \ n
\ n \ n ']
我的问题是,如何删除html标签以及如何从span标签创建键,
条件,驱动器,里程表等
我希望从item [attributes]返回的值创建自己的字典值,例如:
项目[里程表] 项目情况] 等等
非常感谢您的帮助,因为我已经坚持了一段时间!
解决方法
我的xpath有点生锈,但这是一种无需使用xpath即可做到的方法,只需使用w3lib库
from w3lib.html import remove_tags,replace_escape_chars
html_array=['<p class="attrgroup">\n\n\n\n <span><b>2014 honda odyssey > touring elite</b></span>\n <br>\n\n </p>','<p > class="attrgroup">\n\n\n\n <span>VIN: > <b>5FNRL5H66EB107700</b></span>\n <br>\n\n\n\n\n > <span>condition: <b>like new</b></span>\n <br>\n\n\n\n\n > <span>cylinders: <b>6 cylinders</b></span>\n <br>\n\n\n\n\n > <span>drive: <b>fwd</b></span>\n <br>\n\n\n\n\n > <span>fuel: <b>gas</b></span>\n <br>\n\n\n\n\n > <span>odometer: <b>99000</b></span>\n <br>\n\n\n\n\n > <span>paint color: <b>white</b></span>\n <br>\n\n\n\n\n > <span>size: <b>full-size</b></span>\n <br>\n\n\n\n\n > <span>title status: <b>clean</b></span>\n <br>\n\n\n\n\n > <span>transmission: <b>automatic</b></span>\n > <br>\n\n\n\n\n <span>type: <b>mini-van</b></span>\n > <br>\n\n </p>']
html=replace_escape_chars(' '.join(list(map(lambda x:remove_tags(x),html_array))))
data={}
for i in html.split('>'):
splitted_content = list(map(lambda x:x.strip(),i.split(":")))
if splitted_content[0].replace(':','').strip() in ['condition','cylinders','drive','fuel']: #put in this array the elements you need
data[splitted_content[0]]=splitted_content[1]
print(data)
输出:
{'condition': 'like new','cylinders': '6 cylinders','drive': 'fwd','fuel': 'gas'}
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。