删除重复的值，然后附加其余的行值

如何解决删除重复的值，然后附加其余的行值

我正在使用以下代码来爬行页面上的多个链接，并从每个相应的链接中获取数据列表：

carspider.py：

def parse_item(self,response):
    sel = Selector(response)

    item = CarscrapeItem()

    item['carType'] = sel.xpath('//div[@class="listing__section  listing__section--key-details  listing__key-details  portable-one-whole  push--bottom"]//span[@itemprop="manufacturer"]//text()').get()
    item['model'] = sel.xpath('//div[@class="listing__section  listing__section--key-details  listing__key-details  portable-one-whole  push--bottom"]//span[@itemprop="model"]//text()').get()
    item['variant'] = sel.xpath('//div[@class="listing__section  listing__section--key-details  listing__key-details  portable-one-whole  push--bottom"]//span[@class="float--right"]//text()')[3].get()
    item['year'] = sel.xpath('//div[@class="listing__section  listing__section--key-details  listing__key-details  portable-one-whole  push--bottom"]//span[@class="float--right"]//text()')[4].get()
    item['engineCapacity'] = sel.xpath('//div[@class="listing__section  listing__section--key-details  listing__key-details  portable-one-whole  push--bottom"]//span[@class="float--right"]//text()')[5].get()
    item['transmission'] = sel.xpath('//div[@class="listing__section  listing__section--key-details  listing__key-details  portable-one-whole  push--bottom"]//span[@class="float--right"]//text()')[6].get()
    item['seatCapacity'] = sel.xpath('//div[@class="listing__section  listing__section--key-details  listing__key-details  portable-one-whole  push--bottom"]//span[@class="float--right"]//text()')[7].get()

    yield item

pipelines.py：

def __init__(self):
    dispatcher.connect(self.spider_opened,signals.spider_opened)
    dispatcher.connect(self.spider_closed,signals.spider_closed)
    self.files = {}

def spider_opened(self,spider):
    self.file = open('%s_dataset.json' % spider.name,'w+b')
    self.exporter = JsonLinesItemExporter(self.file)
    self.exporter.start_exporting()

def spider_closed(self,spider):
    self.exporter.finish_exporting()
    file = self.files.pop(spider)
    file.close()

def process_item(self,item,spider):
    self.exporter.export_item(item)
    return item

我将项目导出到json文件中，输出如下：

{"carType": "Honda","model": "Civic","variant": "TC VTEC Premium","year": "2020","engineCapacity": "1498 cc","transmission": "Automatic","seatCapacity": "5"}
{"carType": "Honda","model": "Accord","variant": "TC","seatCapacity": "5"}

我试图这样输出：

{"carType": "Honda","seatCapacity": "5"
                     "model": "Accord","seatCapacity": "5"}

我想删除重复的汽车类型，并将其余的行值附加到现有汽车类型上。我想以这种方式创建推荐系统会更好。使用Scrapy可以做到吗？我搜索了与重复值有关的回复。通常，它们与重复过滤器有关，而其他过滤器对我不起作用。

编辑：

因为我想要的输出无法实现。我尝试了Akshay Jain提出的建议，该建议几乎与我期望的输出类似。我终于得到了这个输出：

{
"BMW" : [
{ 
  "colour" : "White","engineCapacity" : "1998 cc","model" : "530e","seatCapacity" : "5","transmission" : "Automatic","variant" : "M Sport","warranty" : "5 years","year" : "2020"
}
],"Subaru" : [
{ 
  "colour" : "Silver","model" : "WRX","variant" : "EyeSight","year" : "2020"
},{ 
  "colour" : "Blue","engineCapacity" : "1995 cc","model" : "XV","variant" : "GT Edition","year" : "2019"
},{ 
  "colour" : "Grey",{ 
  "colour" : "Silver","model" : "Forester","variant" : "S EyeSight","year" : "2019"
}
]
}

我添加带有以下代码的python文件以实现此结构：

import json
with open("dataset.json","r+") as json_data:
car = {}
item = json_data
for line in item:
    element = json.loads(line)
    brand = element.get("carType")
    if brand not in car:
        car[brand] = [element]
    else:
        car[brand].append(element)

json_data.seek(0) 
json.dump(car,json_data,sort_keys=True,indent=2,separators=(","," : "))
json_data.truncate()

我参考了一些文档和教程，其中包括https://www.w3schools.com/python/python_json.asp http://www.compciv.org/guides/python/fundamentals/dictionaries-overview/

希望它可以帮助任何人！

解决方法

对于您而言，种类繁多的信息，字典键在python中必须是唯一的。因此，您期望的输出是不可能的。
建议：您可以通过以下方式存储数据：

car = {
  "Honda": [
    {
      "model": "Civic","variant": "TC VTEC Premium","year": "2020","engineCapacity": "1498cc","transmission": "Automatic","seatCapacity": "5"
    },{
      "model": "Accord","variant": "TC","engineCapacity": "1498 cc","seatCapacity": "5"
    }
  ],"BMW": [
    {
      "model": "XYZ",{
      "model": "ABC","seatCapacity": "5"
    }
  ]
}

您可以使用下面的部分代码从文件中逐行读取数据，也可以编写自己的代码以上述格式存储数据

import json
with open('PATH_TO_FILE/FILE_NAME.json') as f:
  data = f
  for line in f:
    line = json.loads(line)
    # YOUR CODE HERE

删除重复的值，然后附加其余的行值

如何解决删除重复的值，然后附加其余的行值

解决方法

相关推荐