XPATH某些div无法访问

如何解决XPATH某些div无法访问

简介

我必须在我的搜寻器中添加某些productlink的“其他人也已购买”项目。 对我来说真的很奇怪,因为存在“移动开放”和“内部生成”之类的div,这对我意味着什么?

目标

除了“别人也买了”以外,我已经掌握了所有需要的重要信息,经过数小时的尝试,我决定在这里问一下,然后再浪费更多的时间并变得更加沮丧

HTML构建

click me

我的代码

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider,Rule
from ..items import DuifcsvItem
import csv

class DuifSpider(scrapy.Spider):
    name = "duif"
    allowed_domains = ['duif.nl']
    custom_settings = {'FIELD_EXPORT_FIELDS' : ['SKU','Title','Title_small','NL_PL_PC','Description']}
    with open("duifonlylinks.csv","r") as f:
        reader = csv.DictReader(f)
        start_urls = [items['Link'] for items in reader]
    
    rules = (
        Rule(LinkExtractor(),callback='parse'),)



    def parse(self,response):
        card = response.xpath('//div[@class="heading"]')

        if not card:
            print('No productlink',response.url)        
        
        items = DuifcsvItem()
        items['Link'] = response.url
        items['SKU'] = response.xpath('//p[@class="desc"]/text()').get().strip()
        items['Title'] = response.xpath('//h1[@class="product-title"]/text()').get()
        items['Title_small'] = response.xpath('//div[@class="left"]/p/text()').get()
        items['NL_PL_PC'] = response.xpath('//div[@class="desc"]/ul/li/em/text()').getall()
        items['Description'] = response.xpath('//div[@class="item"]/p/text()').getall()
        yield items

实际网页https://www.duif.nl/product/pot-seal-matt-finish-light-pink-large

如果可以使用xpath访问此href,那么效果会很好

我已经尝试过的XPATH

>>> response.xpath('//div[@class="title"]/h3/text()').get()
>>> response.xpath('//div[@class="inner generated"]/div//h3/text()').get()
>>> response.xpath('//div[@class="wrap-products"]/div/div/a/@href').get()
>>> response.xpath('/div[@class="description"]/div/h3/text()').get()
>>> response.xpath('//div[@class="open-on-mobile"]/div/div/div/a/@href').get()
>>> response.xpath('//div[@class="product cross-square white"]/a/@href').get()
>>> response.xpath('//a[@class="product-link"]').get()
>>> response.xpath('//a[@class="product-link"]').getall()

解决方法

您可以在HTML的此部分中找到“其他人也购买了”产品ID(请参见createCrossSellItems部分)

<script>
    $(function () {
            createUpsellItems("885034747 | 885034800 | 885034900 |")
                        createCrossSellItems("885034347 | 480010600 | 480010700 | 010046700 | 500061967 | 480011000 |")
    })
</script>

但是将所有这些产品的详细信息添加到您的主要产品中会有些棘手。首先,您需要了解如何保存此信息(一对多)。它可能是单个字段OtherAlsoBought,例如,您将在其中保存类似JSON的结构。或者,您可以使用许多字段,例如OtherAlsoBought_Product_1_TitleOtherAlsoBought_Product_1_LinkOtherAlsoBought_Product_2_TitleOtherAlsoBought_Product_2_Link等。

收集这些详细信息的一种可能方法是将所有产品ID保存到一个数组中,然后将每个yield的ID一次保存(简单的GET https://www.duif.nl/api/v2/catalog/product?itemcode=885034347_Parent应该可以正常使用Referer标头),也传递产品数组(使用metacb_kwargs)来获取下一个ID。当然,您还需要为每个请求传递主item(向其添加当前产品详细信息,并在末尾yield所有内容)。

更新 您需要在以下代码中添加所需的字段:

import scrapy
import json
import re

class DuifSpider(scrapy.Spider):
    name="duif"
    start_urls = ['https://www.duif.nl/product/pot-seal-matt-finish-light-pink-large']

    def parse(self,response):
        item = {}
        item['title'] = response.xpath('//h1[@class="product-title"]/text()').get()
        item['url'] = response.url
        item['cross_sell'] = []

        cross_sell_items_raw = response.xpath('//script[contains(.,"createCrossSellItems(")]/text()').re_first(r'createCrossSellItems\("([^"]+)')
        cross_sell_items = re.findall(r"\d+",cross_sell_items_raw)

        if cross_sell_items:
            cross_sell_item_id = cross_sell_items.pop(0)
            yield scrapy.Request(
                f"https://www.duif.nl/api/v2/catalog/product?itemcode={cross_sell_item_id}_Parent",headers={
                    'referer': response.url,'Content-type': 'application/json','Authorization': 'bearer null','Accept': '*/*',},callback=self.parse_cross_sell,meta={
                    'item': item,'referer': response.url,'cross_sell_items': cross_sell_items,}
            )
        else:
            # There is no "Others also bought" items for this page,just save main item
            yield item

    def parse_cross_sell(self,response):
        main_item = response.meta["item"]
        cross_sell_items = response.meta["cross_sell_items"]

        data = json.loads(response.text)
        current_cross_sell_item = {}
        current_cross_sell_item['title'] = data["_embedded"]["products"][0]["name"]
        current_cross_sell_item['url'] = data["_embedded"]["products"][0]["url"]
        current_cross_sell_item['description'] = data["_embedded"]["products"][0]["description"]

        main_item['cross_sell'].append(current_cross_sell_item)
        if cross_sell_items:
            cross_sell_item_id = cross_sell_items.pop(0)
            yield scrapy.Request(
                f"https://www.duif.nl/api/v2/catalog/product?itemcode={cross_sell_item_id}_Parent",headers={
                    'referer': response.meta['referer'],meta={
                    'item': main_item,'referer': response.meta['referer'],}
            )
        else:
            # no more cross sell items to process,save output
            yield main_item

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


依赖报错 idea导入项目后依赖报错,解决方案:https://blog.csdn.net/weixin_42420249/article/details/81191861 依赖版本报错:更换其他版本 无法下载依赖可参考:https://blog.csdn.net/weixin_42628809/a
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下 2021-12-03 13:33:33.927 ERROR 7228 [ main] o.s.b.d.LoggingFailureAnalysisReporter : *************************** APPL
错误1:gradle项目控制台输出为乱码 # 解决方案:https://blog.csdn.net/weixin_43501566/article/details/112482302 # 在gradle-wrapper.properties 添加以下内容 org.gradle.jvmargs=-Df
错误还原:在查询的过程中,传入的workType为0时,该条件不起作用 &lt;select id=&quot;xxx&quot;&gt; SELECT di.id, di.name, di.work_type, di.updated... &lt;where&gt; &lt;if test=&qu
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct redisServer’没有名为‘server_cpulist’的成员 redisSetCpuAffinity(server.server_cpulist); ^ server.c: 在函数‘hasActiveC
解决方案1 1、改项目中.idea/workspace.xml配置文件,增加dynamic.classpath参数 2、搜索PropertiesComponent,添加如下 &lt;property name=&quot;dynamic.classpath&quot; value=&quot;tru
删除根组件app.vue中的默认代码后报错:Module Error (from ./node_modules/eslint-loader/index.js): 解决方案:关闭ESlint代码检测,在项目根目录创建vue.config.js,在文件中添加 module.exports = { lin
查看spark默认的python版本 [root@master day27]# pyspark /home/software/spark-2.3.4-bin-hadoop2.7/conf/spark-env.sh: line 2: /usr/local/hadoop/bin/hadoop: No s
使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams[&#39;font.sans-serif&#39;] = [&#39;SimHei&#39;] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -&gt; systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping(&quot;/hires&quot;) public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate&lt;String
使用vite构建项目报错 C:\Users\ychen\work&gt;npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-