如何解决拼凑的formrequest产生不匹配和丢失的结果
我正在草率地执行一系列表单请求,从一页跳到下一页,同时在一页内从一行跳到下一页,并从每一行抓取文档编号和名称。
但是,当将抓取的数据输出到csv时,看起来似乎不匹配,有时会丢失。知道为什么会这样吗?
formdata_pre={
'ScriptManager1' : "SearchFormEx1$UpdatePanel|SearchFormEx1$btnSearch",'ScriptManager1_HiddenField': '','Navigator1$SearchOptions1$SavePrintCriteriaCheck' : 'on','Navigator1$SearchOptions1$SaveOrderCriteriaCheck' : 'on','SearchCriteriaOffice1$DDL_OfficeName' : 'Recorded Land','SearchCriteriaName1$DDL_SearchName' : 'Recorded Land Name Search','SearchFormEx1$ACSTextBox_LastName1' : 'mortgage electronic','SearchFormEx1$ACSTextBox_FirstName1' : '','SearchFormEx1$ACSRadioButtonList_PartyType1' : '','SearchFormEx1$ACSTextBox_LastName2' : '','SearchFormEx1$ACSTextBox_FirstName2' : '','SearchFormEx1$ACSRadioButtonList_PartyType2' : '','SearchFormEx1$ACSRadioButtonList_Search' : '3','SearchFormEx1$ACSDropDownList_DocumentType' : '29','SearchFormEx1$ACSDropDownList_Towns' : '-2','SearchFormEx1$ACSTextBox_DateFrom' : '1/1/1753','SearchFormEx1$ACSTextBox_DateTo' : '10/14/2000','ImageViewer1$ScrollPos' : '','ImageViewer1$ScrollPosChange' : '','ImageViewer1$_imgContainerWidth' : '0','ImageViewer1$_imgContainerHeight' : '0','ImageViewer1$isImageViewerVisible' : 'true','ImageViewer1$hdnWidgetSize' : '','ImageViewer1$DragResizeExtender_ClientState' : '','CertificateViewer1$ScrollPos' : '','CertificateViewer1$ScrollPosChange' : '','CertificateViewer1$_imgContainerWidth' : '0','CertificateViewer1$_imgContainerHeight' : '0','CertificateViewer1$isImageViewerVisible' : 'true','CertificateViewer1$hdnWidgetSize' : '','CertificateViewer1$DragResizeExtender_ClientState' : '','PTAXViewer1$ScrollPos' : '','PTAXViewer1$ScrollPosChange' : '','PTAXViewer1$_imgContainerWidth' : '0','PTAXViewer1$_imgContainerHeight' : '0','PTAXViewer1$isImageViewerVisible' : 'true','PTAXViewer1$hdnWidgetSize' : '','PTAXViewer1$DragResizeExtender_ClientState' : '','DocList1$ctl12' : '','DocList1$ctl14' : '','RefinementCtrl1$ctl01' : '','RefinementCtrl1$ctl03' : '','NameList1$ScrollPos' : '','NameList1$ScrollPosChange' : '','NameList1$_SortExpression' : '','NameList1$ctl03' : '','NameList1$ctl05' : '','DocDetails1$PageSize' : '','DocDetails1$PageIndex' : '','DocDetails1$SortExpression' : '','BasketCtrl1$ctl01' : '','BasketCtrl1$ctl03' : '','OrderList1$ctl01' : '','OrderList1$ctl03' : '','__EVENTTARGET' : '','__EVENTARGUMENT' : '','__LASTFOCUS' : '','__VIEWSTATE' : '','__ASYNCPOST' : 'true','SearchFormEx1$btnSearch' : 'Search'
}
上面是下面的formdata的formdata_pre-关于此dict的堆栈溢出很奇怪。
import scrapy
from scrapy import FormRequest
from scrapy.shell import inspect_response
import re
class FormySpider(scrapy.Spider):
name = 'formy'
allowed_domains = ['i2a.uslandrecords.com']
start_urls = ['https://i2a.uslandrecords.com/ME/Cumberland/D/Default.aspx']
def parse(self,response):
URL = 'https://i2a.uslandrecords.com/ME/Cumberland/D/Default.aspx'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/57.0.2987.133 Safari/537.36'}
yield FormRequest(url=URL,method='POST',headers = headers,formdata={
'ScriptManager1' : 'Navigator1$SearchOptions1$UpdatePanel|Navigator1$SearchOptions1$DocImagesCheck','__EVENTTARGET' : 'Navigator1$SearchOptions1$DocImagesCheck','__ASYNCPOST' : ' true'},dont_filter=True,callback=self.after_login_1
)
def after_login_1(self,response):
URL = 'https://i2a.uslandrecords.com/ME/Cumberland/D/Default.aspx'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/57.0.2987.133 Safari/537.36'}
formdata=formdata_pre,yield FormRequest(url=URL,formdata=formdata,callback=self.after_login_2
)
def after_login_2(self,like Gecko) Chrome/57.0.2987.133 Safari/537.36'}
if response.xpath("(//a[contains(@id,'ButtonRow_Doc')])[1]").get(): print(response.xpath("//a[contains(@id,'LinkButton')]/text()").getall()) print(response.xpath("//a[contains(@id,'ButtonRow_Doc')]/text()").getall())
for row in response.xpath("//a[contains(@id,'ButtonRow_Doc')]/@href").getall():
event =re.findall(r'doPostBack\(\'([^()].*)\'\,\'\'\)',row)[0]
scriptmanager = 'DocList1$UpdatePanel|'+event
ndoc=re.findall(r'\#\_(.*)\'\,',row)[0]
nbook = response.xpath( "(//a[contains(@href,\"_{}',\")])[3]/text()".format(ndoc)).get()
npage = response.xpath( "(//a[contains(@href,\")])[4]/text()".format(ndoc)).get()
yield FormRequest(url=URL,formdata={
'ScriptManager1' : scriptmanager,'__EVENTTARGET' : event,meta = {
'nbook' : nbook,'npage' : npage
},callback=self.after_login_3
)
if response.xpath("//a[@id='DocList1_LinkButtonNext']").get():
yield FormRequest(url=URL,formdata={
'ScriptManager1' : 'DocList1$UpdatePanel|DocList1$LinkButtonNext','__EVENTTARGET' : 'DocList1$LinkButtonNext',callback=self.after_login_2
)
else:
print("empty week")
def after_login_3(self,like Gecko) Chrome/57.0.2987.133 Safari/537.36'}
names = response.xpath("//a[contains(@id,'GrantorGrantee')]/text()").getall()
nbook = response.request.meta['nbook']
npage = response.request.meta['npage']
yield {
'nbook' : nbook,'npage' : npage,'names' : names
}
在输出方面,这是一个不匹配的示例:
15643 || 20 || {FRANTZ RICHARD C,房贷电子注册系统公司,机队抵押公司}
在上面,名称与页码和书号不匹配,并且完全没有名称“ HUGHES ANDREW”。这只是一个例子。
我认为,紧急请求过快而无法产生响应,也许这种滞后造成了断开连接和丢失的对象。
解决方法
您没有提供代码或输出,因此很难为您提供帮助。 听起来像,该页面会根据您的会话显示结果,因为Scrapy请求是并发的,会话可能会重叠。
您可以使用参数meta={'cookiejar': different_values_here}
设置每个会话的请求,还可以阅读更多here。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。