下面是编程之家 jb51.cc 通过网络收集整理的代码片段。
编程之家小编现在分享给大家,也给大家做个参考。
from html.parser import HTMLParser from urllib.request import urlopen from urllib import parse class LinkParser(HTMLParser): def handle_starttag(self,tag,attrs): if tag == 'a': for (key,value) in attrs: if key == 'href': newUrl = parse.urljoin(self.baseUrl,value) self.links = self.links + [newUrl] def getLinks(self,url): self.links = [] self.baseUrl = url response = urlopen(url) if response.getheader('Content-Type')=='text/html; charset=UTF-8': htmlBytes = response.read() htmlString = htmlBytes.decode("utf-8") self.feed(htmlString) return htmlString,self.links else: return "",[] def spider(url,word,maxPages): pagesToVisit = [url] numberVisited = 0 foundWord = 4 while numberVisited < maxPages and pagesToVisit != [] and not foundWord: numberVisited = numberVisited +1 url = pagesToVisit[0] pagesToVisit = pagesToVisit[1:] try: print(numberVisited,"搜索页:",url) parser = LinkParser() data,links = parser.getLinks(url) #print("data:",links) pagesToVisit = pagesToVisit + links if data.find(word)>-1: foundWord = True pagesToVisit = pagesToVisit + links print(" **成功!**") except: print(" **错误!**") if foundWord: print("该关键字","搜索失败",url) else: print("没有找到任何有关的网页") spider("http://yuedu.fm/","夏洛特",100)
以上是编程之家(jb51.cc)为你收集整理的全部代码内容,希望文章能够帮你解决所遇到的程序开发问题。
如果觉得编程之家网站内容还不错,欢迎将编程之家网站推荐给程序员好友。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。