如何解决无法通过美丽汤从玻璃门中获取公司评论
我正在尝试使用BeautifulSoup从Glassdoor抓取公司评论。但是未能从该站点提取任何内容。我正在使用以下代码-
from requests import get
from bs4 import BeautifulSoup
url = "https://www.glassdoor.com/Reviews/The-Wonderful-Company-Reviews-E1005987_P2.htm?
sort.sortType=RD&sort.ascending=false"
response = get(url)
html_soup = BeautifulSoup(response.text,'html.parser')
html_soup
我发现上述代码无法提取任何内容,并且显示为-“不允许机器人” 。我已经分享了下面的输出。
<!DOCTYPE html>
<html><head><title></title><style type="text/css">H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3 {font-family:Tahoma,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,sans-serif;color:white;background-color:#525D76;} P {font-family:Tahoma,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}A.name {color : black;}.line {height: 1px; background-color: #525D76; border: none;}</style> </head><body><h1>HTTP Status 403 - Bots not allowed</h1><div class="line"></div><p><b>type</b> Status report</p><p><b>message</b> <u>Bots not allowed</u></p><p><b>description</b> <u>Access to the specified resource has been forbidden.</u></p><hr class="line"/><h3>Apache Tomcat</h3></body></html>
我是网络抓取领域的新手。有人可以指导我如何从Glass door中提取评论。
解决方法
要从服务器获得正确的响应,请设置User-Agent
HTTP标头:
from requests import get
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0'}
url = "https://www.glassdoor.com/Reviews/The-Wonderful-Company-Reviews-E1005987_P2.htm?sort.sortType=RD&sort.ascending=false"
response = get(url,headers=headers)
html_soup = BeautifulSoup(response.text,'html.parser')
print(html_soup)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。