我的scrapy shell会永远加载，并给出超时twisted.internet.error.TimeoutError：

如何解决我的scrapy shell会永远加载，并给出超时twisted.internet.error.TimeoutError：

我的蜘蛛遇到了问题，这些问题不是交货进口，然后我拿走了它们的start_urls并尝试使用scrapy shell查找错误。不幸的是，即使是易碎的外壳也将永远加载并返回twisted.internet.error.TimeoutError，我该如何解决呢？请在下面看到我的刮scrap外壳命令和错误

root@cf59900d79a8:/workspace# scrapy shell "https:www.mystart_url.com"
2020-08-28 04:37:53 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: crawler)
2020-08-28 04:37:53 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0,libxml2 2.9.10,cssselect 1.1.0,parsel 1.6.0,w3lib 1.22.0,Twisted 20.3.0,Python 3.5.9 (default,Jul 22 2020,13:58:49) - [GCC 8.3.0],pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020),cryptography 3.0,Platform Linux-4.19.76-linuxkit-x86_64-with-debian-10.4
2020-08-28 04:37:53 [scrapy.crawler] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter','DOWNLOAD_TIMEOUT': 40,'NEWSPIDER_MODULE': 'crawler.spiders','USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/55.0.2883.95 Safari/537.36','FEED_FORMAT': 'json','FEED_URI': 'http://azurite:10000/devstoreaccount1/%(feed_name)s.json','SPIDER_MODULES': ['crawler.spiders'],'MAIL_FROM': 'scraping.info@yahoo.com','MAIL_USER': 'scraping.info@yahoo.com','CONCURRENT_REQUESTS': 60,'AUTOTHROTTLE_MAX_DELAY': 5.0,'LOG_FORMATTER': 'crawler.middlewares.PoliteLogFormatter','*****': 'hgyTvvty43q','LOGSTATS_INTERVAL': 0,'COOKIES_ENABLED': False,'REACTOR_THREADPOOL_MAXSIZE': 60,'AUTOTHROTTLE_START_DELAY': 0.1,'MAIL_HOST': 'smtp.mail.yahoo.com','DOWNLOAD_DELAY': 0.25,'BOT_NAME': 'crawler','HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage','CONCURRENT_REQUESTS_PER_DOMAIN': 6,'RETRY_HTTP_CODES': [500,502,503,504,522,524,408,429,403],'AUTOTHROTTLE_ENABLED': True,'AUTOTHROTTLE_TARGET_CONCURRENCY': 5.0,'RETRY_TIMES': 10,'LOG_LEVEL': 'INFO'}
2020-08-28 04:37:54 [scrapy.extensions.feedexport] ERROR: Unknown feed storage scheme: http
2020-08-28 04:37:54 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats','scrapy.extensions.memusage.MemoryUsage','scrapy.extensions.throttle.AutoThrottle','scrapy.extensions.telnet.TelnetConsole','crawler.extensions.ItemLogStats','crawler.extensions.StatsMailer']
2020-08-28 04:37:54 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy_splash.SplashCookiesMiddleware','scrapy_splash.SplashMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-08-28 04:37:54 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy_splash.SplashDeduplicateArgsMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-08-28 04:37:54 [scrapy.middleware] INFO: Enabled item pipelines:
['crawler.pipelines.DuplicatesPipeline','crawler.pipelines.DateWorker','crawler.pipelines.CustomImagesPipeline','scrapy_jsonschema.JsonSchemaValidatePipeline','crawler.pipelines.OutAttachmentProcessing','crawler.pipelines.IgnoreNullValues','crawler.pipelines.ItemLogStats']
2020-08-28 04:37:54 [scrapy.core.engine] INFO: Spider opened
Traceback (most recent call last):
  File "/usr/local/bin/scrapy",line 8,in <module>
    sys.exit(execute())
  File "/usr/local/lib/python3.5/site-packages/scrapy/cmdline.py",line 150,in execute
    _run_print_help(parser,_run_command,cmd,args,opts)
  File "/usr/local/lib/python3.5/site-packages/scrapy/cmdline.py",line 90,in _run_print_help
    func(*a,**kw)
  File "/usr/local/lib/python3.5/site-packages/scrapy/cmdline.py",line 157,in _run_command
    cmd.run(args,opts)
  File "/usr/local/lib/python3.5/site-packages/scrapy/commands/shell.py",line 73,in run
    shell.start(url=url,redirect=not opts.no_redirect)
  File "/usr/local/lib/python3.5/site-packages/scrapy/shell.py",line 48,in start
    self.fetch(url,spider,redirect=redirect)
  File "/usr/local/lib/python3.5/site-packages/scrapy/shell.py",line 115,in fetch
    reactor,self._schedule,request,spider)
  File "/usr/local/lib/python3.5/site-packages/twisted/internet/threads.py",line 122,in blockingCallFromThread
    result.raiseException()
  File "/usr/local/lib/python3.5/site-packages/twisted/python/failure.py",line 488,in raiseException
    raise self.value.with_traceback(self.tb)
twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting https:www.mystart_url.com took longer than 40.0 seconds..
root@cf59900d79a8:/workspace# ```

解决方法

设置从站点下载页面之间的延迟可以帮助解决因请求频率太高而导致的超时错误。这是通过项目的settings.py文件中的DOWNLOAD_DELAY参数完成的。

Scrapy文档中这样说：

The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard. Decimal numbers are supported

我的scrapy shell会永远加载，并给出超时twisted.internet.error.TimeoutError：

如何解决我的scrapy shell会永远加载，并给出超时twisted.internet.error.TimeoutError：

解决方法

相关推荐