无法使用Python从网站下载文件

如何解决无法使用Python从网站下载文件

我是一个正在抓网的新手，但是在使用所有可以想象的从网站上下载一些excel文件的.get方法时遇到了问题。我已经能够轻松地解析HTML来获取页面上每个链接的URL，但是我还没有足够的经验来理解为什么实际上我无法下载文件（Cookie，会话等，不知道）。 / p>

这是网站：

https://mlcu.org.eg/ar/3118/%D9%82%D9%88%D8%A7%D8%A6%D9%85-%D9%85%D8%AC%D9%84%D8%B3-%D8%A7%D9%84%D8%A7%D9%85%D9%86-%D8%B0%D8%A7%D8%AA-%D8%A7%D9%84%D8%B5%D9%84%D8%A9

如果向下滚动，将找到5个excel文件链接，但我都无法下载。（只需搜索id =“ AutoDownload”

当我尝试使用请求.get方法并使用时保存文件

import requests
requests.Session()
res = requests.get(url).content
with open(filename) as f:
   f.write(res.content)

我收到一个错误，说res是一个字节对象，当我将res作为变量查看时，输出为：

b'<html><head><title>Request Rejected</title></head><body>The requested URL was rejected. 
Please consult with your administrator.<br><br>Your support ID is: 11190392837244519859</body></html>

现在尝试了一段时间，将非常感谢您的帮助。非常感谢。

解决方法

如果您没有足够的经验来手动设置HTTP请求中的所有正确参数，以避免出现“请求被拒绝”错误（就我而言，我将无法执行），我建议您使用更高级别的方法，例如硒。

Selenium可以自动执行计算机上安装的浏览器执行的操作，例如下载文件（因此，它可以用于自动执行Web应用程序上的测试以及进行Web剪贴）。想法是，浏览器生成的HTTP请求要比您可以手动编写的HTTP请求要好。

Here是一个教程，可以帮助您尝试使用Selenium。

为了下载文件，您需要在python请求的标头中设置“用户代理”字段。这可以通过将字典传递给get函数来完成：

 file = session.get(url,headers=my_headers)

显然，此主机不响应来自python的具有以下User-Agent的请求：

'User-Agent': 'python-requests/2.24.0'

请牢记，如果您在请求标头中为该字段传递了另一个值（例如，来自Firefox的一个值，请参见下文），则主机会认为该请求来自Firefox用户，并将以实际文件作为响应

这是代码的完整版本：

import requests

my_headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0','Accept-Encoding': 'gzip,deflate','Accept': '*/*','Connection': 'keep-alive'
    }

session = requests.session()
file = session.get(url,headers=my_headers)
                
with open(filename,'wb') as f:
    f.write(file.content)

最新的Firefox用户代理为我工作，但您可以在该字段here中找到更多可能的值。

所以我终于想出了一个仅使用requests和标准Python HTML解析器的解决方案。

从我发现的结果来看，Request rejected错误通常很难追溯到确切原因。在那种情况下，这是由于HTTP请求中没有用户代理。

import requests
from html.parser import HTMLParser

# Custom parser to retrieve the links
link_urls = []
class AutoDownloadLinksHTMLParser(HTMLParser):
    def handle_starttag(self,tag,attrs):
        if(tag == 'a' and [attr for attr in attrs if attr == ('id','AutoDownload')]):
            href = [attr[1] for attr in attrs if attr[0] == 'href'][0]
            link_urls.append(href)

# Get the links to the files
url = 'https://mlcu.org.eg/ar/3118/%D9%82%D9%88%D8%A7%D8%A6%D9%85-%D9%85%D8%AC%D9%84%D8%B3-%D8%A7%D9%84%D8%A7%D9%85%D9%86-%D8%B0%D8%A7%D8%AA-%D8%A7%D9%84%D8%B5%D9%84%D8%A9'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
links_page = requests.get(url,headers=headers)
AutoDownloadLinksHTMLParser().feed(links_page.content.decode('utf-8'))

# Download the files
host = 'https://mlcu.org.eg'
for i,link_url in enumerate(link_urls):
    file_content = requests.get(host + link_urls[i],headers = headers).content
    with open('file' + str(i) + '.xls','wb+') as f:
        f.write(file_content)

无法使用Python从网站下载文件

如何解决无法使用Python从网站下载文件

解决方法

相关推荐