如何解决Nutch Selenium Interactive插件会忽略chromedriver配置
我为本地爬网配置了nutch-site.xml,其中包括硒交互插件。
我仅配置了基础知识,因此配置非常简单(conf / nutch-site.xml中的属性)。
<property>
<name>plugin.includes</name>
<value>protocol-interactiveselenium|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
By default Nutch includes plugins to crawl HTML and various other
document formats via HTTP/HTTPS and indexing the crawled content
into Solr. More plugins are available to support more indexing
backends,to fetch ftp:// and file:// URLs,for focused crawling,and many other use cases.
</description>
</property>
<property>
<name>selenium.driver</name>
<value>chrome</value>
<description>
A String value representing the flavour of Selenium
WebDriver() to use. Currently the following options
exist - 'firefox','chrome','safari','opera' and 'remote'.
If 'remote' is used it is essential to also set correct properties for
'selenium.hub.port','selenium.hub.path','selenium.hub.host','selenium.hub.protocol','selenium.grid.driver','selenium.grid.binary'
and 'selenium.enable.headless'.
</description>
</property>
<property>
<name>webdriver.chrome.driver</name>
<value>/Users/theo/DISKS/Work/PNR/chromedriver</value>
<description>The path to the ChromeDriver binary</description>
</property>
这是来自坚果日志:
2020-08-17 23:40:57,427 ERROR interactiveselenium.Http - Failed to get protocol output
java.lang.RuntimeException: java.lang.IllegalStateException: The driver executable does not exist: /root/chromedriver
at org.apache.nutch.protocol.selenium.HttpWebClient.getDriverForPage(HttpWebClient.java:153)
at org.apache.nutch.protocol.interactiveselenium.HttpResponse.readPlainContent(HttpResponse.java:401)
at org.apache.nutch.protocol.interactiveselenium.HttpResponse.<init>(HttpResponse.java:280)
at org.apache.nutch.protocol.interactiveselenium.Http.getResponse(Http.java:57)
at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:383)
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:352)
Caused by: java.lang.IllegalStateException: The driver executable does not exist: /root/chromedriver
at com.google.common.base.Preconditions.checkState(Preconditions.java:585)
at org.openqa.selenium.remote.service.DriverService.checkExecutable(DriverService.java:146)
at org.openqa.selenium.remote.service.DriverService.findExecutable(DriverService.java:141)
at org.openqa.selenium.chrome.ChromeDriverService.access$000(ChromeDriverService.java:35)
at org.openqa.selenium.chrome.ChromeDriverService$Builder.findDefaultExecutable(ChromeDriverService.java:159)
at org.openqa.selenium.remote.service.DriverService$Builder.build(DriverService.java:355)
at org.openqa.selenium.chrome.ChromeDriverService.createDefaultService(ChromeDriverService.java:94)
at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:157)
at org.apache.nutch.protocol.selenium.HttpWebClient.createChromeWebDriver(HttpWebClient.java:182)
at org.apache.nutch.protocol.selenium.HttpWebClient.getDriverForPage(HttpWebClient.java:89)
... 5 more
2020-08-17 23:40:57,430 INFO fetcher.FetcherThread - FetcherThread 46 fetch of https://www.amazon.in/ failed with: java.lang.RuntimeException: java.lang.IllegalStateException: The driver executable does not exist: /root/chromedriver
为什么看错地方了?
实际上..它正确地引用了nutch-site.xml中的其他设置。一旦包含了协议交互式硒,它就开始使用硒进行获取。
此外,较早之前它还在寻找/ root / geckodriver,它是firefox驱动程序。将selenium.driver更改为chrome后,它开始寻找/ root / chromedriver。
那么好。现在,我去更改了webdriver.chrome.driver属性,但这似乎没有考虑。
解决方法
查看code of HttpWebClient-属性webdriver.chrome.driver
被selenium.grid.binary
的值覆盖。将后者指向您的chromedrive应该可以。请在https://issues.apache.org/jira/projects/NUTCH打开一个问题,不清楚是错误还是文档问题。但是无论如何都应该解决。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。