如何解决如何使用scrapy正确的CSS选择器从整个页面获取href?
我正在尝试抓取一个房地产网站:https://www.nepremicnine.net/oglasi-prodaja/slovenija/hisa/。我想获取隐藏在房屋图片标签中的href:
我想在整个页面(以及其他页面)上使用它。这是我写的不返回任何内容的代码(例如空字典):
import scrapy
from ..items import RealEstateSloItem
import time
# first get all the URLs that have more info on the houses
# next crawl those URLs to get the desired information
class RealestateSpider(scrapy.Spider):
# allowed_domains = ['nepremicnine.net']
name = 'realestate'
page_number = 2
# page 1 url
start_urls = ['https://www.nepremicnine.net/oglasi-prodaja/slovenija/hisa/1/']
def parse(self,response):
items = RealEstateSloItem() # create it from items class --> need to store it down
all_links = response.css('a.slika a::attr(href)').extract()
items['house_links'] = all_links
yield items
next_page = 'https://www.nepremicnine.net/oglasi-prodaja/slovenija/hisa/' + str(RealestateSpider.page_number) + '/'
#print(next_page)
# if next_page is not None: # for buttons
if RealestateSpider.page_number < 180: # then only make sure to go to the next page
# if yes then increase it --> for paginations
time.sleep(1)
RealestateSpider.page_number += 1
# parse automatically checks for response.follow if its there when its done with this page
# this is a recursive function
# follow next page and where should it after following
yield response.follow(next_page,self.parse) # want it to go back to parse
您能告诉我我使用CSS选择器在做什么吗?
解决方法
您的选择器正在a
内寻找a.slika
元素。这应该可以解决您的问题:
all_links = response.css('a.slika ::attr(href)').extract()
这些将是相对URL,您可以使用response.urljoin()
以响应URL作为基础域来构建绝对URL。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。