如何解决scrapyd-如何将下载的图像路径发布到REST API
我想将一堆已爬网的项目到达BATCH_SIZE
时发布到RES API。
下载图像后,我应该在哪里获取图像的绝对路径以将爬网的项目发布到REST API?
我使用scrapyd
部署项目。
items.py
class MyItem(Item):
name = Field()
images = Field()
image_urls = Field()
image_paths = Field()
pipelines.py
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self,item,info):
for image_url in item['image_urls']:
yield scrapy.Request(image_url)
def item_completed(self,results,info):
image_paths = [x['path'] for ok,x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
adapter = ItemAdapter(item)
adapter['image_paths'] = image_paths
return item
middlewares.py
class FooSpiderMiddleware(object):
self.bulk_items = []
@classmethod
def from_crawler(cls,crawler):
s = cls()
crawler.signals.connect(s.spider_opened,signal=signals.spider_opened)
return s
def process_spider_input(self,response,spider):
return None
def process_spider_output(self,result,spider):
for i in result:
yield i
def process_spider_exception(self,exception,spider):
pass
def process_start_requests(self,start_requests,spider):
result_list = list(result)
if isinstance(result_list[-1],Request):
self.bulk_items.extend(result_list[:-1])
else:
self.bulk_items.extend(result_list)
if len(self.bulk_items) == BATCH_SIZE:
# post here
self.bulk_items = []
result_restore = (i for i in result_list)
for i in result_restore:
yield i
def spider_opened(self,spider):
spider.logger.info('Spider opened: %s' % spider.name)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。