This article is the second part of our series on scraping website using HTML Agility-Pack. Having tackled the cover and contents page in the previous article, we’re now ready to put the main content. Let’s start off with:
HTML Selectors
Selectors allow you to select HTML node from HTML document.
HTML Selector Methods
SelectNodes[] : Collect a bunch of nodes matching the X-path expression.
SelectSingleNode[String]: Collect the first Xml-Node that matching the X-Path expression.
SelectNodes Method
Selects a list of nodes matching the HtmlAgilityPack.HtmlNode.XPath expression.
Parameters: xpath[The XPath expression.]
Returns: An HtmlAgilityPack.HtmlNodeCollection containing a collection of nodes matching the HtmlAgilityPack. HtmlNode.XPath query, or null if no node matched the X-Path expression.
Examples:
[i] The following example selects the first node matching the X-Path expression using SelectNodes method.
var htmlDoc = new HtmlDocument[]; htmlDoc.LoadHtml[html]; string name = htmlDoc.DocumentNode.SelectNodes["//td/input"].First[].Attributes["value"].Value;
[ii] The following example selects all nodes which are matching the XPath expression.
var htmlDoc = new HtmlDocument[]; htmlDoc.LoadHtml[html]; var htmlNodes = htmlDoc.DocumentNode.SelectNodes["//td/input"];
SelectSingleNode Method
Selects first HtmlNode matching the HtmlAgilityPack.HtmlNode.XPath expression.
Parameters: xpath[The X-Path expression. May not be null.]
Returns: The first Node that matches the X-Path query or a null reference if no matching node was found.
Example:
The following example selects the first node matching the X-Path expression using SelectNodes method.
var htmlDoc = new HtmlDocument[]; htmlDoc.LoadHtml[html]; string name = htmlDoc.DocumentNode.SelectSingleNode["//td/input"].Attributes["value"].Value;
HTML Manipulation
Manipulation allows you to cross HTML node.
HTML Manipulation Properties
InnerHtml: Gets or Sets the HTML within the start and end tags of the object.
InnerText: Gets the text between the start and end tags of the object.
OuterHtml: Gets the object and its content in HTML.
ParentNode: Gets the top most node of this node [for nodes that can have parents].
InnerHtml
public virtual string InnerHtml { get; set; } :-
Gets or Sets the HTML between the start and end tags of the object. InnerHtml is a member of HtmlAgilityPack.HtmlNode.
Example:
var htmlDoc = new HtmlDocument[]; htmlDoc.LoadHtml[html]; var htmlNodes =htmlDoc.DocumentNode.SelectNodes["//body/h1"]; foreach [var node in htmlNodes] { Console.WriteLine[node.InnerHtml]; }
InnerText
public virtual string InnerText { get; } :-
Gets the text between the start and end tags of the object. InnerText is a member of HtmlAgilityPack.HtmlNode .
Example:
var htmlDoc = new HtmlDocument[]; htmlDoc.LoadHtml[html]; var htmlNodes = htmlDoc.DocumentNode.SelectNodes["//body/h1"]; foreach [var node in htmlNodes] { Console.WriteLine[node.InnerText]; }
OuterHtml
public virtual string OuterHtml { get; } :-
Gets the object and its content in HTML. OuterHtml is a component of HtmlAgilityPack.HtmlNode.
Example:
var htmlDoc = new HtmlDocument[]; var htmlDoc = new HtmlDocument[]; htmlDoc.LoadHtml[html]; var htmlNodes = htmlDoc.DocumentNode.SelectNodes["//body/h1"]; foreach [var node in htmlNodes] { Console.WriteLine[node.OuterHtml]; }
HTML Manipulation Methods
AppendChild[]: Combine the specified node to the end of the children’s list of this node.
AppendChildren[]: Merge the identified node to the end of the list of children of this node.
Clone[]: Generate the identical of the node.
CloneNode[Boolean]: Generate an identical of the node.
CloneNode[String]: Generate an identical of the node and modify its name at the same time.
CloneNode[String, Boolean]: Generate an identical of the node and modify its name at the same time.
CopyFrom[HtmlNode]: Make an identical of the node and the sub-tree under it.
CopyFrom[HtmlNode, Boolean]: Make a copy of the node.
CreateNode[]: Make a HTML node from a string representing literal HTML.
InsertAfter[]: Inserts the enumerated node immediately after the enumerated reference node.
InsertBefore: Inserts the enumerated node immediately before the enumerated reference node.
PrependChild: Adds the enumerated node to the start of the children’s list of this node.
PrependChildren: Merge the enumerated node list to the start of the list of children of this node.
Remove: Discard node from the main collection.
RemoveAll: Discard all the children and/or attributes of the present node.
RemoveAllChildren: Delete all the children of the present node.
RemoveChild[HtmlNode]: Remove the enumerated child node.
RemoveChild[HtmlNode, Boolean]: Removes the enumerated child node.
ReplaceChild[]: Replaces the child node oldChild with newChild node.
HTML Traversing
Traversing allow you to traverse through HTML node.
HTML Traversing Properties
ChildNodes: Gets all the children of the node.
FirstChild: Gets the first child of the node.
LastChild: Obtains the final child of the node.
NextSibling: Obtain the node instantly following this element.
ParentNode: Obtain the upper node of this node [for nodes that can have parents].
ChildNodes
Example:
using system;
using system.xml;
using htmlAgelityPack;
public class Program
{
public static void Main[]
{
var html=@"
This is underlineparagraphThis isboldheading
Output:
This is underlineparagraphThis isboldheading
FirstChild
Example:
using system;
using system.xml;
using htmlAgelityPack;
public class Program
{
public static void Main[]
{
var html=@"
This is underlineparagraphThis isboldheading
Output:
This isboldheading
LastChild
Example:
var htmlDoc = new HtmlDocument[]; htmlDoc.LoadHtml[html]; var htmlNodes = htmlDoc.DocumentNode.SelectNodes["//td/input"];
0
Output:
var htmlDoc = new HtmlDocument[]; htmlDoc.LoadHtml[html]; var htmlNodes = htmlDoc.DocumentNode.SelectNodes["//td/input"];
1
NextSibling
Example:
var htmlDoc = new HtmlDocument[]; htmlDoc.LoadHtml[html]; var htmlNodes = htmlDoc.DocumentNode.SelectNodes["//td/input"];
2
Output:
var htmlDoc = new HtmlDocument[]; htmlDoc.LoadHtml[html]; var htmlNodes = htmlDoc.DocumentNode.SelectNodes["//td/input"];
3
HTML Traversing Methods
Ancestors[]: Gets all the ancestor of the node.
Ancestors[String]: Gets ancestors with matching the name.
AncestorsAndSelf[]: Gets all anscestor nodes and the current node.
AncestorsAndSelf[String]: Gets all ancestor nodes and the current node with matching the name.
DescendantNodes: Obtains all Descendant nodes for this node and each of child nodes.
Descendants[]: Obtain all Descendant nodes in enumerated list.
Descendants[String]: Get all descendant nodes with matching the name.
DescendantsAndSelf[]: Returns a group of all descendant nodes of this element, in document order.
DescendantsAndSelf[String]: Gets all descendant nodes including this node.
Element: Gets first generation child node matching name.
Elements: Gets matching first generation child nodes matching the name.
HTML Writer
Save HtmlDocument & Write HtmlNode.
HTML Writer Methods [HtmlDocument]
Save[Stream]: Saves the HTML document to the specified stream.
Save[StreamWriter]: Saves the HTML document to the specified StreamWriter.
Save[TextWriter]: Reserves the HTML document to the enumerated TextWriter.
Save[String]: Reserves the mixed document to the enumerated file.
Save[XmlWriter]: Reserves the HTML document to the enumerated XmlWriter.
Save[Stream, Encoding]: Watch over the HTML document to the enumerated stream.
Save[String, Encoding]: Watch over the mixed document to the enumerated file.
HTML Writer Methods [HtmlNode]
WriteContentTo[]: Saves all the children of the node to a string.
WriteContentTo[TextWriter]: Reserves all the children of the node to the enumerated TextWriter.
WriteTo[]: Reserves the current node to a string.
WriteTo[TextWriter]: Reserves the current node to the enumerated TextWriter.
WriteTo[XmlWriter]: Reserves the current node to the enumerated XmlWriter.
HTML Utilities
HTML Utilities Methods [HtmlDocument]
DetectEncoding[Stream]: Find out the encoding of an HTML stream.
DetectEncoding[TextReader]: Find out the encoding of an HTML text provided on a TextReader.
DetectEncoding[String]: Find out the encoding of an HTML file.
DetectEncodingAndLoad[String]: Find out the encoding of an HTML document from a file first, and then loads the file.
DetectEncodingAndLoad[String, Boolean]: Find out the encoding of an HTML document from a file first, and then loads the file.
HTML Attributes
HTML Attributes Methods
Add[HtmlAttribute]: Adds supplied item to collection.
Add[String, String]: Adds a new attribute to the collection with the given values.
Append[String]: Creates and inserts a new attribute as the last attribute in the collection.
Append[HtmlAttribute]: Inserts the specified attribute as the last attribute in the collection.
Append[String, string]: Creates and inserts a new attribute as the last attribute in the collection.
Remove[]: Removes all attributes from the collection.
Remove[String]: Removes an attribute from the list, using its name. If there are more than one attributes with this name, they will all be removed.
Remove[HtmlAttribute]: Removes a given attribute from the list.
RemoveAll[]: Remove all attributes in the list.
RemoveAt[]: Removes the attribute at the specified index.
SetAttributeValue[]: Sets the value of an attribute, adds an attribute, or removes an attribute. If the attribute is not found, it will be created automatically.
I have covered a lot of ground in very little code which I hope this post shows us how to effectively parse HTML documents in C# and further impresses on you the power of this library!
A tech junkie and writer can describe me the best 😀. I love and live with technology running through my blood. After creating my first scraping program, I was sure that is my call, and I've dedicated 4 years of studying about it. I love to help others who want to learn more about any kind of web scraping. Feel free to get in touch in case you want to know too!
What is the append method in JavaScript?
append[] method inserts a set of Node objects or string objects after the last child of the document. String objects are inserted as equivalent Text nodes. This method appends a child to a Document .
How do you append an element to a div in JavaScript?
To append data to a div element using JavaScript, you would first need to select the element using a selector, then use the append[] or appendChild[] method to append data to the selected element.