如何解决Selenium,模仿真实用户的最佳方式是什么?
我一直在使用 Selenium 和 Google Colab 从拍卖网站下载卖家数据。几个星期以来,我一直无法下载该站点的内容。我添加了 fake-user 但结果是一样的。否则我怎么能看起来像一个真正的用户来下载页面?
我的代码:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from fake_useragent import UserAgent
options = webdriver.ChromeOptions()
ua = UserAgent(use_cache_server=False)
userAgent = ua.random
print(userAgent)
options.add_argument("window-size=1280,800")
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument(f'user-agent={userAgent}')
driver = webdriver.Chrome(options=options)
driver.get("https://allegro.pl/oferta/zageszczarka-6-5km-90kg-higher-briggs-gratisy-9003885105#aboutSeller")
print(driver.page_source)
结果:
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/41.0.2227.0 Safari/537.36
<html><head><title>allegro.pl</title><style>#cmsg{animation: A 1.5s;}@keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style><meta name="viewport" content="width=device-width,initial-scale=1.0"></head><body style="margin:0"><script>var dd={'cid':'AHrlqAAAAAMAOIflZgDZm2IAI-ywFA==','hsh':'77DC0FFBAA0B77570F6B414F8E5BDB','t':'fe','s':29560,'host':'geo.captcha-delivery.com'}</script><script src="https://ct.captcha-delivery.com/c.js"></script><script>if("string"==typeof navigator.userAgent&&navigator.userAgent.indexOf("Firefox")>-1){var isIframeLoaded=!1,maxTimeoutMs=5e3;function iframeOnload(e){isIframeLoaded=!0;var a=document.getElementById("noiframe");a&&a.parentNode.removeChild(a)}var initialTime=(new Date).getTime();setTimeout(function(){isIframeLoaded||(new Date).getTime()-initialTime>maxTimeoutMs&&(document.body.innerHTML='<div id="noiframe">Please enable JS and disable any ad blocker</div>'+document.body.innerHTML)},maxTimeoutMs)}else function iframeOnload(){}</script><iframe src="https://geo.captcha-delivery.com/captcha/?initialCid=AHrlqAAAAAMAOIflZgDZm2IAI-ywFA%3D%3D&hash=77DC0FFBAA0B77570F6B414F8E5BDB&cid=ak0Wk_5LBEPLw9rTmErZ~211JLk9IruT-DV3pn2r.NzAZ_JOOcDsOjFjoiO8O88Uty8imz7f4IXqYdOqun_vy9SJOl7y7x-cu4m.D1jxOt&t=fe&referer=https%3A%2F%2Fallegro.pl%2Foferta%2Fzageszczarka-6-5km-90kg-higher-briggs-gratisy-9003885105%23aboutSeller&s=29560" width="100%" height="100%" style="height:100vh;" frameborder="0" border="0" scrolling="yes" onload="iframeOnload()"></iframe>
</body></html>
解决方法
我查看了该站点,如果您使用 Selenium Chrome 浏览器,它似乎可以将 IP 列入黑名单。
这应该可以工作(使用 HEAD 模式,不保证 HEADLESS 模式)https://github.com/ultrafunkamsterdam/undetected-chromedriver
此外,运行 Google Colab 的服务器不应具有列入黑名单的 IP。如果是这样,那太糟糕了,您对此无能为力。
编辑:您可以在此处了解有关站点如何检测 Selenium 驱动程序的更多信息:https://stackoverflow.com/a/56529616/8068153