如何解决删除重复的值,然后附加其余的行值
我正在使用以下代码来爬行页面上的多个链接,并从每个相应的链接中获取数据列表:
carspider.py:
def parse_item(self,response):
sel = Selector(response)
item = CarscrapeItem()
item['carType'] = sel.xpath('//div[@class="listing__section listing__section--key-details listing__key-details portable-one-whole push--bottom"]//span[@itemprop="manufacturer"]//text()').get()
item['model'] = sel.xpath('//div[@class="listing__section listing__section--key-details listing__key-details portable-one-whole push--bottom"]//span[@itemprop="model"]//text()').get()
item['variant'] = sel.xpath('//div[@class="listing__section listing__section--key-details listing__key-details portable-one-whole push--bottom"]//span[@class="float--right"]//text()')[3].get()
item['year'] = sel.xpath('//div[@class="listing__section listing__section--key-details listing__key-details portable-one-whole push--bottom"]//span[@class="float--right"]//text()')[4].get()
item['engineCapacity'] = sel.xpath('//div[@class="listing__section listing__section--key-details listing__key-details portable-one-whole push--bottom"]//span[@class="float--right"]//text()')[5].get()
item['transmission'] = sel.xpath('//div[@class="listing__section listing__section--key-details listing__key-details portable-one-whole push--bottom"]//span[@class="float--right"]//text()')[6].get()
item['seatCapacity'] = sel.xpath('//div[@class="listing__section listing__section--key-details listing__key-details portable-one-whole push--bottom"]//span[@class="float--right"]//text()')[7].get()
yield item
pipelines.py:
def __init__(self):
dispatcher.connect(self.spider_opened,signals.spider_opened)
dispatcher.connect(self.spider_closed,signals.spider_closed)
self.files = {}
def spider_opened(self,spider):
self.file = open('%s_dataset.json' % spider.name,'w+b')
self.exporter = JsonLinesItemExporter(self.file)
self.exporter.start_exporting()
def spider_closed(self,spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self,item,spider):
self.exporter.export_item(item)
return item
我将项目导出到json文件中,输出如下:
{"carType": "Honda","model": "Civic","variant": "TC VTEC Premium","year": "2020","engineCapacity": "1498 cc","transmission": "Automatic","seatCapacity": "5"}
{"carType": "Honda","model": "Accord","variant": "TC","seatCapacity": "5"}
我试图这样输出:
{"carType": "Honda","seatCapacity": "5"
"model": "Accord","seatCapacity": "5"}
我想删除重复的汽车类型,并将其余的行值附加到现有汽车类型上。我想以这种方式创建推荐系统会更好。使用Scrapy可以做到吗?我搜索了与重复值有关的回复。通常,它们与重复过滤器有关,而其他过滤器对我不起作用。
编辑:
因为我想要的输出无法实现。我尝试了Akshay Jain提出的建议,该建议几乎与我期望的输出类似。我终于得到了这个输出:
{
"BMW" : [
{
"colour" : "White","engineCapacity" : "1998 cc","model" : "530e","seatCapacity" : "5","transmission" : "Automatic","variant" : "M Sport","warranty" : "5 years","year" : "2020"
}
],"Subaru" : [
{
"colour" : "Silver","model" : "WRX","variant" : "EyeSight","year" : "2020"
},{
"colour" : "Blue","engineCapacity" : "1995 cc","model" : "XV","variant" : "GT Edition","year" : "2019"
},{
"colour" : "Grey",{
"colour" : "Silver","model" : "Forester","variant" : "S EyeSight","year" : "2019"
}
]
}
我添加带有以下代码的python文件以实现此结构:
import json
with open("dataset.json","r+") as json_data:
car = {}
item = json_data
for line in item:
element = json.loads(line)
brand = element.get("carType")
if brand not in car:
car[brand] = [element]
else:
car[brand].append(element)
json_data.seek(0)
json.dump(car,json_data,sort_keys=True,indent=2,separators=(","," : "))
json_data.truncate()
我参考了一些文档和教程,其中包括https://www.w3schools.com/python/python_json.asp http://www.compciv.org/guides/python/fundamentals/dictionaries-overview/
希望它可以帮助任何人!
解决方法
-
对于您而言,种类繁多的信息,字典键在python中必须是唯一的。因此,您期望的输出是不可能的。
-
建议: 您可以通过以下方式存储数据:
car = {
"Honda": [
{
"model": "Civic","variant": "TC VTEC Premium","year": "2020","engineCapacity": "1498cc","transmission": "Automatic","seatCapacity": "5"
},{
"model": "Accord","variant": "TC","engineCapacity": "1498 cc","seatCapacity": "5"
}
],"BMW": [
{
"model": "XYZ",{
"model": "ABC","seatCapacity": "5"
}
]
}
您可以使用下面的部分代码从文件中逐行读取数据,也可以编写自己的代码以上述格式存储数据
import json
with open('PATH_TO_FILE/FILE_NAME.json') as f:
data = f
for line in f:
line = json.loads(line)
# YOUR CODE HERE
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。