利用Linq to xml查询html

ften times we need to parse HTML for data. Sure in a perfect world everything would have a nice service or API wrapped around it but as we all know this is not always the case. Many times we're left with parsing files or "screen scraping" to get the data we need from other applications. Sure this is brittle,but sometimes it's the best we can do. And sometimes you're just trying to get the data once so "good enough" is really good enough.

I was faced with that challenge myself this week. Yes even here not all systems expose services or if they do,finding the documentation or person to consult would take longer than writing a simple program. ;-) At the core all I needed to do was query a couple pieces of data from a bunch of web pages. This seemed like the perfect opportunity to use LINQ to XML because the structure of the page was pretty well formed HTML. However there were a couple tricks to figure out mainly because LINQ to XML doesn't support HTML entities. It only supports character entities and the built in XML entities (< > " & ').

Working with simple HTML in an XElement is very straightforward,as long as it's well-formed and doesn't contain any HTML entity references:

Dim html = <html>
               <head>
                   <title>
                        Test Page
                    </title>
               </head>
               <body>
                    <a id="link1" href="http://mydownloads1.com">This is a link 1</a>
                    <a id="link2" href="http://mydownloads2.com">This is a link 2</a>
                    <a id="link3" href="http://mydownloads3.com">This is a link 3</a>
                    <a id="link4" href="http://mydownloads4.com">This is a link 4</a>
               </body>
           </html>


Dim links = From link In html...<a>

For Each link In links
    Console.WriteLine(link.@href)
Next

But as we all know HTML almost always contains entity references all over the place (like &nbsp; for the HTML space). Also if you end up with any querystring parameters in your hrefs,when you try to load the HTML into the XElement,you get the same problem. Additionally if you paste a literal into the VB editor it places a semicolon into the querystring because it automatically tries to interpret it as an entity and places a semicolon where you don't want it.

So to fix this you need to remove all the unsupported HTML entity references as well as replace the & characters with &amp;. So in the pages I was loading luckily they were not that complicated and only contained &nbsp; and the problematic querystrings. This is an example of the page I was trying to load:

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>
      Sample Page
    </title>
    <link href="css/page.css" rel="StyleSheet"/>
   </head>
  <body >
     <!--begin form -->
    <form name="form1" method="post" action="page.aspx?Product=Cool&amp;Id=12345" id="form1">
  
      <!--begin main table -->
      <table class="tblMain" cellspacing="0" cellpadding="0">
    
        <!--Properties -->
        <tr>
          <td class="tdHead">Properties</td>
        </tr>

        <tr>
          <td class="tdGrid">
            <div>
              <table class="grid" cellspacing="0" cellpadding="3" 
                     border="1" id="dgPage" style="border-collapse:collapse;">
                <tr class="grid_row">
                  <td class="grid_item" style="font-weight:bold;width:100px;">ID</td>
                  <td class="grid_item" style="width:480px;">12345</td>
                </tr>
                <tr class="grid_row">
                  <td class="grid_item" style="font-weight:bold;width:100px;">Published</td>
                  <td class="grid_item" style="width:480px;">05/04/2007</td>
                </tr>
              </table>
            </div>
          </td>
        </tr>

        <!--Details -->
        <tr>
          <td id="tdHeadDetails" class="tdHead">Statistics</td>
        </tr>

        <tr>
          <td class="tdGrid">
            <div>
              <table class="grid" cellspacing="0" cellpadding="3" rules="all" border="1" 
                     id="dgDetails" style="border-collapse:collapse;">
                <tr class="grid_header">
                  <th scope="col">Rating&nbsp;:</th>
                  <th scope="col">Raters&nbsp;:</th>
                  <th scope="col">Pageviews&nbsp;:</th>
                  <th scope="col">Printed&nbsp;:</th>
                  <th scope="col">Saved&nbsp;:</th>
                  <th scope="col">Emailed&nbsp;:</th>
                  <th scope="col">Linked&nbsp;:</th>
                  <th scope="col"></th>
                </tr>
                <tr class="grid_row">
                  <td class="grid_item" style="width:60px;">5.00</td>
                  <td class="grid_item" style="width:60px;">100</td>
                  <td class="grid_item" style="width:80px;">1000000</td>
                  <td class="grid_item" style="width:60px;">150</td>
                  <td class="grid_item" style="width:60px;">1000</td>
                  <td class="grid_item" style="width:60px;">100</td>
                  <td class="grid_item" style="width:280px;">40</td>
                  <td class="grid_item">
                    <a href="http://www.somewhere.com/default.aspx?ID=12345&Name=Beth" target="_blank">View</a>
                  </td>
                </tr>
              </table>
            </div>
          </td>
        </tr>
      </table>
     </form>
  </body>
</html>

So here's what I did to load this programmatically and fix up the HTML. Also notice that I need to add an Imports statement in order to import the default xml namespace that is declared in the HTML document otherwise our query later will not return any results.

Imports <xmlns="http://www.w3.org/1999/xhtml">
Imports System.Net
Imports System.IO

Public Class SimpleScreenScrape

    Function GetHtmlPage(ByVal strURL As String) As String
        Try

            Dim strResult As String
            Dim objResponse As WebResponse
            Dim objRequest As WebRequest = HttpWebRequest.Create(strURL)
            objRequest.UseDefaultCredentials = True

            objResponse = objRequest.GetResponse()
            Using sr As New StreamReader(objResponse.GetResponseStream())
                strResult = sr.ReadToEnd()
                sr.Close()
            End Using

            'Replace HTML entity references so that we can load into XElement
            strResult = Replace(strResult,"&nbsp;","")
            strResult = Replace(strResult,"&","&amp;")

            Return strResult

        Catch ex As Exception
            Return ""
        End Try
    End Function

    Sub QueryData()
        Dim html As XElement
        Try
            Dim p = GetHtmlPage("http://www.somewhere.com/default.aspx")

            Using sr As New StringReader(p)
                html = XElement.Load(sr)
            End Using

        Catch ex As Exception
            MsgBox("Page could not be loaded.")
            Exit Sub
        End Try
.
. 'Now we can write the queries... 
.

Now for the fun part,the actual querying! Now that the document is loaded into the XElement the querying of it becomes a snap. I needed to grab the publish date,and then all the statistics from the page. This is easily done with a couple LINQ to XML queries,one query for each of the HTML tables where the data is located:

'I'm using FirstOrDefault here because I know my page 
' only has one of these tables
Dim stats = (From stat In html...<table> _
            Where stat.@id = "dgDetails" _
            Select fields = stat.<tr>.<th>,values = stat.<tr>.<td>).FirstOrDefault()

'Same here. FirstOrDefault because there's only one "Published" 
' html row (<tr>) on the page that I'm looking for.
Dim lastPublished = (From prop In html...<tr> _
                    Where prop.<td>.Value = "Published" _
                    Select prop.<td>(1).Value).FirstOrDefault()

Console.WriteLine(lastPublished)

For i = 0 To stats.fields.Count - 1
    Console.WriteLine(stats.fields(i).Value & " = " & stats.values(i).Value)
Next

And that's it. For this simple utility this is good enough for me and took me about 15 minutes to program using LINQ. The trick to loading the HTML document into an XElement is to remove all the unsupported HTML entity references first.

原文

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


php输出xml格式字符串
J2ME Mobile 3D入门教程系列文章之一
XML轻松学习手册
XML入门的常见问题(一)
XML入门的常见问题(三)
XML轻松学习手册(2)XML概念
xml文件介绍及使用
xml编程(一)-xml语法
XML文件结构和基本语法
第2章 包装类
XML入门的常见问题(二)
Java对象的强、软、弱和虚引用
JS解析XML文件和XML字符串详解
java中枚举的详细使用介绍
了解Xml格式
XML入门的常见问题(四)
深入SQLite多线程的使用总结详解
PlayFramework完整实现一个APP(一)
XML和YAML的使用方法
XML轻松学习总节篇