如何解决我的scrapy shell会永远加载,并给出超时twisted.internet.error.TimeoutError:
我的蜘蛛遇到了问题,这些问题不是交货进口,然后我拿走了它们的start_urls
并尝试使用scrapy shell
查找错误。不幸的是,即使是易碎的外壳也将永远加载并返回twisted.internet.error.TimeoutError
,我该如何解决呢?请在下面看到我的刮scrap外壳命令和错误
root@cf59900d79a8:/workspace# scrapy shell "https:www.mystart_url.com"
2020-08-28 04:37:53 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: crawler)
2020-08-28 04:37:53 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0,libxml2 2.9.10,cssselect 1.1.0,parsel 1.6.0,w3lib 1.22.0,Twisted 20.3.0,Python 3.5.9 (default,Jul 22 2020,13:58:49) - [GCC 8.3.0],pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020),cryptography 3.0,Platform Linux-4.19.76-linuxkit-x86_64-with-debian-10.4
2020-08-28 04:37:53 [scrapy.crawler] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter','DOWNLOAD_TIMEOUT': 40,'NEWSPIDER_MODULE': 'crawler.spiders','USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/55.0.2883.95 Safari/537.36','FEED_FORMAT': 'json','FEED_URI': 'http://azurite:10000/devstoreaccount1/%(feed_name)s.json','SPIDER_MODULES': ['crawler.spiders'],'MAIL_FROM': 'scraping.info@yahoo.com','MAIL_USER': 'scraping.info@yahoo.com','CONCURRENT_REQUESTS': 60,'AUTOTHROTTLE_MAX_DELAY': 5.0,'LOG_FORMATTER': 'crawler.middlewares.PoliteLogFormatter','*****': 'hgyTvvty43q','LOGSTATS_INTERVAL': 0,'COOKIES_ENABLED': False,'REACTOR_THREADPOOL_MAXSIZE': 60,'AUTOTHROTTLE_START_DELAY': 0.1,'MAIL_HOST': 'smtp.mail.yahoo.com','DOWNLOAD_DELAY': 0.25,'BOT_NAME': 'crawler','HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage','CONCURRENT_REQUESTS_PER_DOMAIN': 6,'RETRY_HTTP_CODES': [500,502,503,504,522,524,408,429,403],'AUTOTHROTTLE_ENABLED': True,'AUTOTHROTTLE_TARGET_CONCURRENCY': 5.0,'RETRY_TIMES': 10,'LOG_LEVEL': 'INFO'}
2020-08-28 04:37:54 [scrapy.extensions.feedexport] ERROR: Unknown feed storage scheme: http
2020-08-28 04:37:54 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats','scrapy.extensions.memusage.MemoryUsage','scrapy.extensions.throttle.AutoThrottle','scrapy.extensions.telnet.TelnetConsole','crawler.extensions.ItemLogStats','crawler.extensions.StatsMailer']
2020-08-28 04:37:54 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy_splash.SplashCookiesMiddleware','scrapy_splash.SplashMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-08-28 04:37:54 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy_splash.SplashDeduplicateArgsMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-08-28 04:37:54 [scrapy.middleware] INFO: Enabled item pipelines:
['crawler.pipelines.DuplicatesPipeline','crawler.pipelines.DateWorker','crawler.pipelines.CustomImagesPipeline','scrapy_jsonschema.JsonSchemaValidatePipeline','crawler.pipelines.OutAttachmentProcessing','crawler.pipelines.IgnoreNullValues','crawler.pipelines.ItemLogStats']
2020-08-28 04:37:54 [scrapy.core.engine] INFO: Spider opened
Traceback (most recent call last):
File "/usr/local/bin/scrapy",line 8,in <module>
sys.exit(execute())
File "/usr/local/lib/python3.5/site-packages/scrapy/cmdline.py",line 150,in execute
_run_print_help(parser,_run_command,cmd,args,opts)
File "/usr/local/lib/python3.5/site-packages/scrapy/cmdline.py",line 90,in _run_print_help
func(*a,**kw)
File "/usr/local/lib/python3.5/site-packages/scrapy/cmdline.py",line 157,in _run_command
cmd.run(args,opts)
File "/usr/local/lib/python3.5/site-packages/scrapy/commands/shell.py",line 73,in run
shell.start(url=url,redirect=not opts.no_redirect)
File "/usr/local/lib/python3.5/site-packages/scrapy/shell.py",line 48,in start
self.fetch(url,spider,redirect=redirect)
File "/usr/local/lib/python3.5/site-packages/scrapy/shell.py",line 115,in fetch
reactor,self._schedule,request,spider)
File "/usr/local/lib/python3.5/site-packages/twisted/internet/threads.py",line 122,in blockingCallFromThread
result.raiseException()
File "/usr/local/lib/python3.5/site-packages/twisted/python/failure.py",line 488,in raiseException
raise self.value.with_traceback(self.tb)
twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting https:www.mystart_url.com took longer than 40.0 seconds..
root@cf59900d79a8:/workspace# ```
解决方法
设置从站点下载页面之间的延迟可以帮助解决因请求频率太高而导致的超时错误。这是通过项目的settings.py文件中的DOWNLOAD_DELAY参数完成的。
Scrapy文档中这样说:
The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard. Decimal numbers are supported
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。