如何解决禁止的问题
我正在使用scrapy 1.12来爬行分类网站。
本地主机中的搜寻器正在运行,但服务器(centos)无法运行。
我正在使用randomuseragent和randomproxy。
我的settings.py文件
BOT_NAME = 'xx(https://xx.com)'
SPIDER_MODULES = ['xyz_crawler.spiders']
NEWSPIDER_MODULE = 'xyz_crawler.spiders'
ITEM_PIPELINES = {
'xyz_crawler.pipelines.XmlWriterPipeline': 800
}
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'null'
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 8
# Retry many times since proxies often fail
RETRY_TIMES = 1
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500,503,504,400,403,404,408]
DOWNLOADER_MIDDLEWARES = {
# 'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': None,'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,'scrapy_proxies.RandomProxy': 100,'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,}
RANDOM_UA_FILE = "xyz_crawler/useragents.txt"
# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
# ...
PROXY_LIST = 'xyz_crawler/proxies.txt'
# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0
# If proxy mode is 2 uncomment this sentence :
#CUSTOM_PROXY = "http://host1:port"
DOWNLOADER_CLIENTCONTEXTFACTORY = 'xyz_crawler.contextfactory.CustomContextFactory'
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY=3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN=16
#CONCURRENT_REQUESTS_PER_IP=16
# Disable cookies (enabled by default)
#COOKIES_ENABLED=False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED=False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',# 'Accept-Language': 'en',#}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'xyz_crawler.middlewares.MyCustomSpiderMiddleware': 543,#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,# 'xyz_crawler.middlewares.TestDownloader': 100,#}
代理IP地址和用户代理在两个地方都相同。
我尝试了COOKIES_ENABLED:false,但还是无法正常工作。
为什么这不起作用?
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。