Welcome toVigges Developer Community-Open, Learning,Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
280 views
in Technique[技术] by (71.8m points)

html agility pack - C# HTMLAGILITYPACK scrape data between two tags

Using Html Agility Pack, I have to scrape the innerText from all //dd tags which are set between //h2 tags (in this case between h2 tags named "Applicant" and "Agent"). How can this be done?

The following is just a piece of HTML code from which I have to scrape data:

<!-- Applicants section  -->

    <h2 class="GridTitle">Applicant</h2>
    
        
            <h3 class="DataTitle">1</h3>
        
        <dl class="Grid LeftCol">
            <dt>Name:</dt>
            <dd>Some name here</dd>
            <dt>Legal Form:</dt>
            <dd></dd>              
            <dt>From:</dt>
            <dd>06/08/2020</dd>
        </dl>
        <dl class="Grid RightCol">
            <dt>Address:</dt>
            <dd>Some address here</dd>
            <dt>To:</dt>
            <dd></dd>
        </dl>
    
        
            <h3 class="DataTitle">2</h3>
        
        <dl class="Grid LeftCol">
            <dt>Name:</dt>
            <dd>Some name here1</dd>
            <dt>Legal Form:</dt>
            <dd></dd>              
            <dt>From:</dt>
            <dd>04/08/2010</dd>
        </dl>
        <dl class="Grid RightCol">
            <dt>Address:</dt>
            <dd>Some address here1</dd>
            <dt>To:</dt>
            <dd>06/08/2020</dd>
        </dl>
    



<!-- Agents section  -->

    <h2 class="GridTitle">Agent</h2>

This is something I have tried, but it takes first //dd above //h2(Agent)

var h2Tags = doc.DocumentNode.SelectNodes("//h2[text() = 'Applicant']");
var h2Tags1 = doc.DocumentNode.SelectNodes("//h2[text() = 'Agent']");
var lineNum = h2Tags[0].Line;
var lineNum1 = h2Tags1[0].Line;
var Applicants = doc.DocumentNode.SelectNodes("//dd").Where(x => x.Line > lineNum).Where(x => x.Line < lineNum1);

foreach (HtmlNode g in Applicants)
{
      TMOwner = g.InnerText;
}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You can do this entirely with XPath queries as follows. You already have XPath queries to select your start and end h2 nodes. Then you can select all dd nodes between pairs of them as follows:

var startnode = doc.DocumentNode.SelectSingleNode("//h2[text() = 'Applicant']"); // Select the start node.
// TODO: Handle the case where startnode is null/missing here.

var endnode = startnode.SelectSingleNode("./following::h2");                     // And select the following end node using whatever criteria you need.
// TODO: Handle the case where endnode is null/missing here.

var followingXPath = $"./following::dd";                                         // Select nodes following the current node, which will be startnode
var precedingXPath = $"{endnode.XPath}/preceding::dd";                           // Select nodes preceding the end node explicitly.
var intersectedXPath = $"{followingXPath}[count(. | {precedingXPath}) = count({precedingXPath})]";

var query = startnode.SelectNodes(intersectedXPath);

var innerTexts = query.Select(n => n.InnerText).ToList();

Or, you could combine a simpler XPath query with a Linq TakeWhile() like so:

var startnode = doc.DocumentNode.SelectSingleNode("//h2[text() = 'Applicant']"); // Select the start node.
var endnode = startnode.SelectSingleNode("./following::h2");                     // And select the following end node using whatever criteria you need.

var query = startnode.SelectNodes("./following::node()") // Select all nodes following startnode
    .TakeWhile(n => n != endnode)                        // Until endnode is reached
    .Where(n => n.Name == "dd");                         // With name "dd".

Notes:

  • /following::dd, ./following::h2 and /preceding::dd are examples of axes of location steps. The following axis selects nodes in the same document as the context node that are after the context node in document order, while the preceding axis selects nodes in the same document as the context node that are before the context node in document order.

    If you wanted to select the next following <h2> node with a specific text value, say "Agent", you could do:

    var endnode = startnode.SelectSingleNode("./following::h2[text() = 'Agent']");
    
  • The formula for intersectedXPath is taken from this answer by Dimitre Novatchev to How would you find all nodes between two H3's using XPATH?. The situation there is similar, however your question does not constrain the elements to be selected to be siblings.

Demo fiddle here for XPath; here for XPath + LINQ; and here for https://bpp.economie.fgov.be/fo-eregister-view/search/details/721770937_EPV/0/0/1/10/0/0/0/null/null?locale=en.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to Vigges Developer Community for programmer and developer-Open, Learning and Share
...