如何抓取受ColdFusion保护的网站?

如何解决如何抓取受ColdFusion保护的网站?

从以下网页中提取PDF网址很简单。

https://www.osapublishing.org/boe/abstract.cfm?uri=boe-11-5-2745

但是当我获取它时,它会在输出中显示类似的内容,而不是下载PDF文件。

<p>OSA has implemented a process that requires you to enter the letters and/or numbers below before you can download this article.</p>

由于网站使用Cookie cfid,因此应使用ColdFusion保护它。有人知道如何抓取这样的网页吗?谢谢。

https://cookiepedia.co.uk/cookies/CFID

编辑:Sev Roberts提供的wget解决方案不起作用。我检查了chrome devtools(在新的隐身窗口中),在发送https://www.osapublishing.org/boe/abstract.cfm?uri=boe-11-5-2745的第一个请求之后,发送了许多请求。我想这是因为wget不会发送这些请求,因此https://www.osapublishing.org/boe/viewmedia.cfm?uri=boe-11-5-2745&seq=0的后续wget(带有cookie)将无法工作。谁能说出这些提取请求中的哪些是必不可少的?谢谢。

解决方法

网站针对这种类型的抓取和直接链接或嵌入有几种方法。基本的旧方法包括:

  1. 检查用户的cookie:至少检查用户是否已经从该站点的上一页进行了会话;一些网站可能会走得更远,并寻找特定的Cookie或会话变量来验证通过该网站的真实路径。
  2. 检查cgi.http_referer变量以查看用户是否来自预期的来源。
  3. 检查cgi.http_user_agent是否看起来像已知的人类浏览器-或检查用户代理看起来不像已知的机器人浏览器。

当然还存​​在其他更智能的方法,但是根据我的经验,如果您需要的不只是以上所述,那么您将达到要求验证码和/或要求用户注册并登录的领域。

显然,通过手动设置标题可以很容易地欺骗(2)和(3)。对于(1),如果您正在使用cfhttp或其他语言的等效语言,则需要确保在站点响应的Set-Cookie标头中返回的cookie在后续标头中返回通过使用cfhttpparam请求。可以使用各种cfhttp包装程序和替代程序库(例如绕过cfhttp层的Java包装程序)来进行此操作。但是,如果您想了解一个简单的示例,那么Ben Nadel在这里有一个古老而又不错的示例:https://www.bennadel.com/blog/725-maintaining-sessions-across-multiple-coldfusion-cfhttp-requests.htm

使用问题链接中的pdf网址,Chrome浏览器进行了几分钟的修改后,显示出如果我丢失了上一页中的cookie并保留了http_referer,那么我会看到验证码挑战,但是如果保留cookie和失去了http_referer,然后我直接获得了pdf。这表明他们关心的是Cookie,而不是引荐来源。

Ben关于SO完整性示例的副本:

<cffunction
    name="GetResponseCookies"
    access="public"
    returntype="struct"
    output="false"
    hint="This parses the response of a CFHttp call and puts the cookies into a struct.">
 
    <!--- Define arguments. --->
    <cfargument
        name="Response"
        type="struct"
        required="true"
        hint="The response of a CFHttp call."
        />
    <!---
        Create the default struct in which we will hold
        the response cookies. This struct will contain structs
        and will be keyed on the name of the cookie to be set.
    --->
    <cfset LOCAL.Cookies = StructNew() />
    <!---
        Get a reference to the cookies that werew returned
        from the page request. This will give us an numericly
        indexed struct of cookie strings (which we will have
        to parse out for values). BUT,check to make sure
        that cookies were even sent in the response. If they
        were not,then there is not work to be done.
    --->
    <cfif NOT StructKeyExists(
        ARGUMENTS.Response.ResponseHeader,"Set-Cookie"
        )>
        <!---
            No cookies were send back in the response. Just
            return the empty cookies structure.
        --->
        <cfreturn LOCAL.Cookies />
    </cfif>
    <!---
        ASSERT: We know that cookie were returned in the page
        response and that they are available at the key,"Set-Cookie" of the reponse header.
    --->
    <!---
        Now that we know that the cookies were returned,get
        a reference to the struct as described above.
    --->
    <!--- 
        The cookies might be coming back as a struct or they
        might be coming back as a string. If there is only 
        ONE cookie being retunred,then it comes back as a 
        string. If that is the case,then re-store it as a 
        struct. 
    ---><!---<cfdump var="#arguments#" label="Line 305 - arguments for function GetResponseCookies" output="D:\web\safenet_GetResponseCookies.html" FORMAT="HTML">--->
    <cfif IsSimpleValue(ARGUMENTS.Response.ResponseHeader[ "Set-Cookie" ])>
        <cfset LOCAL.ReturnedCookies = {} />
        <cfset LOCAL.ReturnedCookies[1] = ARGUMENTS.Response.ResponseHeader[ "Set-Cookie" ] />
    <cfelse>
        <cfset LOCAL.ReturnedCookies = ARGUMENTS.Response.ResponseHeader[ "Set-Cookie" ] />
    </cfif>
    <!--- Loop over the returned cookies struct. --->
    <cfloop
        item="LOCAL.CookieIndex"
        collection="#LOCAL.ReturnedCookies#">
        <!---
            As we loop through the cookie struct,get
            the cookie string we want to parse.
        --->
        <cfset LOCAL.CookieString = LOCAL.ReturnedCookies[ LOCAL.CookieIndex ] />
        <!---
            For each of these cookie strings,we are going to
            need to parse out the values. We can treate the
            cookie string as a semi-colon delimited list.
        --->
        <cfloop
            index="LOCAL.Index"
            from="1"
            to="#ListLen( LOCAL.CookieString,';' )#"
            step="1">
            <!--- Get the name-value pair. --->
            <cfset LOCAL.Pair = ListGetAt(
                LOCAL.CookieString,LOCAL.Index,";"
                ) />
            <!---
                Get the name as the first part of the pair
                sepparated by the equals sign.
            --->
            <cfset LOCAL.Name = ListFirst( LOCAL.Pair,"=" ) />
            <!---
                Check to see if we have a value part. Not all
                cookies are going to send values of length,which can throw off ColdFusion.
            --->
            <cfif (ListLen( LOCAL.Pair,"=" ) GT 1)>
                <!--- Grab the rest of the list. --->
                <cfset LOCAL.Value = ListRest( LOCAL.Pair,"=" ) />
            <cfelse>
                <!---
                    Since ColdFusion did not find more than one
                    value in the list,just get the empty string
                    as the value.
                --->
                <cfset LOCAL.Value = "" />
            </cfif>
            <!---
                Now that we have the name-value data values,we have to store them in the struct. If we are
                looking at the first part of the cookie string,this is going to be the name of the cookie and
                it's struct index.
            --->
            <cfif (LOCAL.Index EQ 1)>
                <!---
                    Create a new struct with this cookie's name
                    as the key in the return cookie struct.
                --->
                <cfset LOCAL.Cookies[ LOCAL.Name ] = StructNew() />
                <!---
                    Now that we have the struct in place,lets
                    get a reference to it so that we can refer
                    to it in subseqent loops.
                --->
                <cfset LOCAL.Cookie = LOCAL.Cookies[ LOCAL.Name ] />
                <!--- Store the value of this cookie. --->
                <cfset LOCAL.Cookie.Value = LOCAL.Value />
                <!---
                    Now,this cookie might have more than just
                    the first name-value pair. Let's create an
                    additional attributes struct to hold those
                    values.
                --->
                <cfset LOCAL.Cookie.Attributes = StructNew() />
            <cfelse>
                <!---
                    For all subseqent calls,just store the
                    name-value pair into the established
                    cookie's attributes strcut.
                --->
                <cfset LOCAL.Cookie.Attributes[ LOCAL.Name ] = LOCAL.Value />
            </cfif>
        </cfloop>
    </cfloop>
    <!--- Return the cookies. --->
    <cfreturn LOCAL.Cookies />
</cffunction>

假设您从第一页https://www.osapublishing.org/boe/abstract.cfm?uri=boe-11-5-2745获得cfhttp响应,并将该响应传递到上述函数中,并将其结果保存在名为cookieStruct的变量中,则可以在后续的cfhttp请求中使用此响应:

<cfloop item="strCookie" collection="#cookieStruct#">
    <cfhttpparam type="COOKIE" name="#strCookie#" value="#cookieStruct[strCookie].Value#" />
</cfloop>

编辑:如果使用wget而不是cfhttp-您可以尝试从该问题的答案中解决问题-但无需输入用户名和密码,因为您实际上不需要登录表单

How to get past the login page with Wget?

例如

# Get a session.
wget --save-cookies cookies.txt \
     --keep-session-cookies \
     --delete-after \
     https://www.osapublishing.org/boe/abstract.cfm?uri=boe-11-5-2745

# Now grab the page or pages we care about.
# You may also need to add valid http_referer or http_user_agent headers
wget --load-cookies cookies.txt \
     https://www.osapublishing.org/boe/viewmedia.cfm?uri=boe-11-5-2745&seq=0

...尽管正如其他人所指出的那样,您可能违反了源代码的服务条款,所以我不建议您实际执行此操作。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


依赖报错 idea导入项目后依赖报错,解决方案:https://blog.csdn.net/weixin_42420249/article/details/81191861 依赖版本报错:更换其他版本 无法下载依赖可参考:https://blog.csdn.net/weixin_42628809/a
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下 2021-12-03 13:33:33.927 ERROR 7228 [ main] o.s.b.d.LoggingFailureAnalysisReporter : *************************** APPL
错误1:gradle项目控制台输出为乱码 # 解决方案:https://blog.csdn.net/weixin_43501566/article/details/112482302 # 在gradle-wrapper.properties 添加以下内容 org.gradle.jvmargs=-Df
错误还原:在查询的过程中,传入的workType为0时,该条件不起作用 &lt;select id=&quot;xxx&quot;&gt; SELECT di.id, di.name, di.work_type, di.updated... &lt;where&gt; &lt;if test=&qu
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct redisServer’没有名为‘server_cpulist’的成员 redisSetCpuAffinity(server.server_cpulist); ^ server.c: 在函数‘hasActiveC
解决方案1 1、改项目中.idea/workspace.xml配置文件,增加dynamic.classpath参数 2、搜索PropertiesComponent,添加如下 &lt;property name=&quot;dynamic.classpath&quot; value=&quot;tru
删除根组件app.vue中的默认代码后报错:Module Error (from ./node_modules/eslint-loader/index.js): 解决方案:关闭ESlint代码检测,在项目根目录创建vue.config.js,在文件中添加 module.exports = { lin
查看spark默认的python版本 [root@master day27]# pyspark /home/software/spark-2.3.4-bin-hadoop2.7/conf/spark-env.sh: line 2: /usr/local/hadoop/bin/hadoop: No s
使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams[&#39;font.sans-serif&#39;] = [&#39;SimHei&#39;] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -&gt; systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping(&quot;/hires&quot;) public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate&lt;String
使用vite构建项目报错 C:\Users\ychen\work&gt;npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-