如何解决C# HTMLAGILITYPACK 在两个标签之间抓取数据
使用Html Agility Pack,我必须从设置在//h2 标签之间的所有//dd 标签(在这种情况下在名为“Applicant”和“Agent”的h2 标签之间)中刮取innerText。这怎么办?
以下只是一段 HTML 代码,我必须从中抓取数据:
<!-- Applicants section -->
<h2 class="GridTitle">Applicant</h2>
<h3 class="DataTitle">1</h3>
<dl class="Grid LeftCol">
<dt>Name:</dt>
<dd>Some name here</dd>
<dt>Legal Form:</dt>
<dd></dd>
<dt>From:</dt>
<dd>06/08/2020</dd>
</dl>
<dl class="Grid RightCol">
<dt>Address:</dt>
<dd>Some address here</dd>
<dt>To:</dt>
<dd></dd>
</dl>
<h3 class="DataTitle">2</h3>
<dl class="Grid LeftCol">
<dt>Name:</dt>
<dd>Some name here1</dd>
<dt>Legal Form:</dt>
<dd></dd>
<dt>From:</dt>
<dd>04/08/2010</dd>
</dl>
<dl class="Grid RightCol">
<dt>Address:</dt>
<dd>Some address here1</dd>
<dt>To:</dt>
<dd>06/08/2020</dd>
</dl>
<!-- Agents section -->
<h2 class="GridTitle">Agent</h2>
这是我尝试过的,但它首先需要 //dd 以上 //h2(Agent)
var h2Tags = doc.DocumentNode.SelectNodes("//h2[text() = 'Applicant']");
var h2Tags1 = doc.DocumentNode.SelectNodes("//h2[text() = 'Agent']");
var lineNum = h2Tags[0].Line;
var lineNum1 = h2Tags1[0].Line;
var Applicants = doc.DocumentNode.SelectNodes("//dd").Where(x => x.Line > lineNum).Where(x => x.Line < lineNum1);
foreach (HtmlNode g in Applicants)
{
TMOwner = g.InnerText;
}
解决方法
您可以完全使用 XPath 查询完成此操作,如下所示。您已经有了 XPath 查询来选择开始和结束 h2 节点。然后您可以选择它们之间的所有 dd
节点,如下所示:
var startnode = doc.DocumentNode.SelectSingleNode("//h2[text() = 'Applicant']"); // Select the start node.
// TODO: Handle the case where startnode is null/missing here.
var endnode = startnode.SelectSingleNode("./following::h2"); // And select the following end node using whatever criteria you need.
// TODO: Handle the case where endnode is null/missing here.
var followingXPath = $"./following::dd"; // Select nodes following the current node,which will be startnode
var precedingXPath = $"{endnode.XPath}/preceding::dd"; // Select nodes preceding the end node explicitly.
var intersectedXPath = $"{followingXPath}[count(. | {precedingXPath}) = count({precedingXPath})]";
var query = startnode.SelectNodes(intersectedXPath);
var innerTexts = query.Select(n => n.InnerText).ToList();
或者,您可以像这样将更简单的 XPath 查询与 Linq TakeWhile()
结合起来:
var startnode = doc.DocumentNode.SelectSingleNode("//h2[text() = 'Applicant']"); // Select the start node.
var endnode = startnode.SelectSingleNode("./following::h2"); // And select the following end node using whatever criteria you need.
var query = startnode.SelectNodes("./following::node()") // Select all nodes following startnode
.TakeWhile(n => n != endnode) // Until endnode is reached
.Where(n => n.Name == "dd"); // With name "dd".
注意事项:
-
/following::dd
、./following::h2
和/preceding::dd
是 axes 的 location steps 示例。following
轴选择与上下文节点相同文档中的节点,按文档顺序位于上下文节点之后,而preceding
轴选择与上下文节点相同文档中上下文节点之前的节点按文档顺序排列的节点。如果您想选择具有特定文本值的下一个
<h2>
节点,请说“代理”,您可以这样做:var endnode = startnode.SelectSingleNode("./following::h2[text() = 'Agent']");
-
intersectedXPath
的公式是由 this answer 从 Dimitre Novatchev 到 How would you find all nodes between two H3's using XPATH? 的。那里的情况类似,但是您的问题并不限制要选择的元素为兄弟姐妹。
XPath 的演示小提琴 here; here 用于 XPath + LINQ; https://bpp.economie.fgov.be/fo-eregister-view/search/details/721770937_EPV/0/0/1/10/0/0/0/null/null?locale=en
和 here。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。