我正在抓取网页，并具有格式为[[base URL] /？filter = results“的HREF，但是此链接只能检索“ [base URL]”吗？

如何解决我正在抓取网页，并具有格式为[[base URL] /？filter = results“的HREF，但是此链接只能检索“ [base URL]”吗？

背景-我是一个新手，试图通过Jupyter笔记本中的Python 3和Beautiful Soup从BBC Sport网站上抓取一些有关足球器材的信息。

但是我根本不认为问题出在代码上（尽管肯定可以肯定！）-相反，我认为这是我对URL / html的理解...

为了最好地了解我的问题，我必须引用该网站：

https://www.bbc.com/sport/football/teams/arsenal/scores-fixtures

此站点按月显示选定足球队（在此示例中为阿森纳）的结果和即将来临的比赛。对于当月（当前为八月），它将灯具分为“八月结果” /“今天”（仅当球队今天有比赛/“八月”）时使用。默认值为“今天”（如果存在）或“夹具”。

问题-如果单击“ AUG RESULTS”（或通过网络抓取提取HREF），则URL变为： https://www.bbc.com/sport/football/teams/arsenal/scores-fixtures/2020-08?filter=results

但是，如果您点击此链接（或通过request.get(url)下载），它仍然会将您带到相同的默认页面。

只有在按浏览器中的“ AUG RESULTS”元素（并且URL不变）时，页面上的信息才会更新。

关键问题-是否可以通过“美丽汤”使用“？filter = results”“应用”直接链接到页面？还是最好以其他方式抓取此信息？

下面是我的代码的一部分，以显示我的网络抓取目的 如前所述-我对以下内容没有问题，但是我需要找到正确的website输入以传递给该函数。当前，以“？filter = XXX”结尾的网站不提供我真正想要的信息，它们仅提供基本URL（就像未应用过滤器分配一样）。

# Here we want to define a function that,for a given website (i.e. each month)
# prints out all the information we're after from the matches of that month

def print_matches(website):
    
    # download the correct starting website for the team
    res = requests.get(website)
    
    # checks if the website exists and throws an immediate exception if not
    res.raise_for_status()
    
    # turn the website into a 'soup' parsable string object
    soup = bs4.BeautifulSoup(res.text,features= "lxml")
    
    # each match is within a "<div class=qa-match-block"
    matches = soup.select(".qa-match-block")
    
    for match in matches:
        
        # the h3 and h4 tags have the fixture info (date,league,round)
        h_tags = match.findAll(["h3","h4"])
        
        for h in h_tags:
            print(h.text)
        
        # We know this list will only contain 2 elements,i.e. the 2 teams
        fixture = match.findAll("abbr")
        
        # We know this list will only have 1 element,the time
        match_time = match.findAll("span","sp-c-fixture__number")
        
        print(fixture[0].text,"vs",fixture[1].text,"(" + match_time[0].text + ",UK time)")
        print()
        
    # team separator
    print()

我正在抓取网页，并具有格式为[[base URL] /？filter = results“的HREF，但是此链接只能检索“ [base URL]”吗？

如何解决我正在抓取网页，并具有格式为[[base URL] /？filter = results“的HREF，但是此链接只能检索“ [base URL]”吗？

相关推荐