如何解决Web Scraper功能无法获取新数据
我创建的网络抓取功能遇到一些问题。
首先,我创建了一些函数来整理代码:
def get_reviews(string):
index = string.find('Reviews')
review = string[index - 10: index + 10]
reviews.append(review.strip())
def append_institute(institute):
if institute is not None:
institutes.append(institute.text.strip())
else:
institutes.append(-1)
def append_provider(provider):
if provider is not None:
providers.append(provider.text.strip())
else:
providers.append(-1)
def append_date(date):
if date is not None:
dates.append(date.text.strip())
else:
dates.append('Self Paced')
def append_rating(rating):
if rating is not None:
ratings.append(rating.text.strip())
else:
ratings.append(-1)
def append_name(name):
names.append(name)
然后我创建了网络爬虫:
def get_data(pages):
names = []
institutes = []
providers = []
dates = []
reviews = []
ratings = []
for page in pages:
r = requests.get(page)
soup = BeautifulSoup(r.content,'html.parser')
rows = soup.select('tbody tr')
for row in rows:
#name
name = row.select_one('span',{ 'class': 'text-1 line-tight'}).text.strip()
append_name(name)
#institute
institute = row.find('a',{'class': 'color-charcoal small-down-text-2 text-3'})
append_institute(institute)
#provider
provider = row.find('span',{'class': 'hidden medium-up-inline-block'})
append_provider(provider)
#date
date = row.find('td',{'itemprop': 'startDate'})
append_date(date)
#reviews
rev = row.find('span',{'class': 'large-down-hidden block line-tight text-4 color-gray'})
string = str(rev)
get_reviews(string)
#rating
rating = row.find('span',attrs = {'class': 'xlarge-up-hidden color-charcoal text-center'})
append_rating(rating)
df = pd.DataFrame({'name': names,'institute': institutes,'provider': providers,'date': dates,'review': reviews,'rating': ratings})
return df
但是,当我调用get_data函数时,出现错误:名称'names'未定义。我试着在函数之前引用空数组,并且该方法起作用了。但是,它只允许我运行一次函数,因为它们中的值被存储了。任何帮助将不胜感激。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。