如何解决Web报废时面临的问题
我正在尝试从Glass door中提取评论。但是我面临着问题。请遵循以下我的代码-
import requests
from bs4 import BeautifulSoup
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',})
url = requests.get("https://www.glassdoor.co.in/Reviews/The-Wonderful-Company-Reviews-E1005987.htm?sort.sortType=RD&sort.ascending=false&countryRedirect=true",headers=headers)
urlContent =BeautifulSoup(url.content,"lxml")
print(urlContent)
review = urlContent.find_all('a',class_='reviewLink')
review
title = []
for i in range(0,len(review)):
title.append(review[i].get_text())
title
rating= urlContent.find_all('div',class_='v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small')
score=[]
for i in range(0,len(rating)):
score.append(rating[i].get_text())
rev_pros=urlContent.find_all("span",{"data-test":"pros"})
pros=[]
for i in range(0,len(rev_pros)):
pros.append(rev_pros[i].get_text())
pros
rev_cons=urlContent.find_all("span",{"data-test":"cons"})
cons=[]
for i in range(0,len(rev_cons)):
cons.append(rev_cons[i].get_text())
cons
advse=urlContent.find_all("span",{"data-test":"advice-management"})
advse
advise=[]
for i in range(0,len(advse)):
advise.append(advse[i].get_text())
advise
location=urlContent.find_all('span',class_='authorLocation')
location
job_location=[]
for i in range(0,len(location)):
job_location.append(location[i].get_text())
job_location
import pandas as pd
df=pd.DataFrame()
df['Review Title']=title
df['Overall Score']=score
df['Pros']=pros
df['Cons']=cons
df['Jobs_Location']=job_location
df['Advise to Mgmt']=advise
我在这里面临两个挑战-
-
无法为“ advse”提取任何内容(用于“建议以 管理”)。
-
在将“作业位置”用作数据列时出现错误 框架。
(ValueError: Length of values does not match length of index)
。 对于这个错误,我的发现是-有十行 其他列,但“工作地点”的行较少 某些评论未公开该位置。
有人能帮我这个忙吗?预先感谢。
解决方法
一种更好的方法是找到一个<div>
并将每个评论括起来,然后从中提取所有需要的信息,然后再进行下一个评论。这样可以更轻松地处理某些评论中缺少信息的情况。
例如:
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',})
url = requests.get("https://www.glassdoor.co.in/Reviews/The-Wonderful-Company-Reviews-E1005987.htm?sort.sortType=RD&sort.ascending=false&countryRedirect=true",headers=headers)
urlContent = BeautifulSoup(url.content,"lxml")
get_text = lambda x: x.get_text(strip=True) if x else ""
entries = []
for entry in urlContent.find_all('div',class_='row mt'):
review = entry.find('a',class_="reviewLink")
rating = entry.find('div',class_='v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small')
rev_pros = entry.find("span",{"data-test":"pros"})
rev_cons = entry.find("span",{"data-test":"cons"})
location = entry.find('span',class_='authorLocation')
advice = entry.find("span",{"data-test":"advice-management"})
entries.append([
get_text(review),get_text(rating),get_text(rev_pros),get_text(rev_cons),get_text(location),get_text(advice)
])
columns = ['Review Title','Overall Score','Pros','Cons','Jobs_Location','Advise to Mgmt']
df = pd.DataFrame(entries,columns=columns)
print(df)
get_text()
函数可确保如果未返回任何内容(即None
),则返回空字符串。
您将需要改善提取建议的逻辑。整个页面的信息都保存在<script>
标记内。其中之一保存JSON数据。除非用户单击,否则建议信息不会移入HTML,因此需要从JSON中提取建议信息。如果使用这种方法,那么也可以直接从JSON提取所有其他信息。
为此,请找到所有<script>
标签并确定其中包含评论。将JSON转换为Python数据结构(使用JSON库)。现在找到评论,例如:
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',"lxml")
entries = []
for script in urlContent.find_all('script'):
text = script.text
if "appCache" in text:
# extract the JSON from the script tag
data = json.loads(text[text.find('{'): text.rfind('}') + 1])
# Go through all keys in the dictionary and pick those containing reviews
for key,value in data['apolloState'].items():
if ".reviews." in key and "links" not in key:
location = value['location']
city = location['id'] if location else None
entries.append([
value['summary'],value['ratingOverall'],value['pros'],value['cons'],city,value['advice']
])
columns = ['Review Title',columns=columns)
print(df)
这将为您提供如下数据框:
Review Title Overall Score Pros Cons Jobs_Location Advise to Mgmt
0 Upper management n... 3 Great benefits,lo... Career advancement... City:1146821 Listen to your emp...
1 Sales 2 Good atmosphere lo... Drive was very far... None None
2 As an organization... 2 Free water and goo... Not a lot of diver... None None
3 Great place to grow 4 If your direct man... Owners are heavily... City:1146821 None
4 Great Company 5 Great leadership,... To grow and move u... City:1146821 None
5 Lots of opportunit... 5 This is a fast pac... There's a sense of... City:1146821 Continue listening...
6 Interesting work i... 3 Working with great... High workload and ... None None
7 Wonderful 5 This company care... The drive,but we ... City:1146577 Continue growing y...
8 Horrendous 1 The pay was fairly... Culture of abuse a... City:1146821 Upper management l...
9 Upper Leadership a... 1 Strong Company,fu... You don't have a f... City:1146577 You get rid of fol...
如果您添加print(data)
来查看所返回数据的整个结构,将会很有帮助。这种方法的唯一问题是需要进一步查找才能将城市ID转换为实际位置。该信息也包含在JSON中。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。