COMP4332/RMBI4310

COMP4332/RMBI4310 XPath Prepared by Raymond Wong Presented by Raymond Wong

Outline • Overview • Example 1 (Simple Webpage) • Example 2 (Table)

Overview • XPath is a query language that is used for traversing through an XML (Extensible Markup Language) document (and an HTML file).

In the web browsers, we could use the “search” function to match what we type in the “search” input box with the content in the “display” form of the HTML file in the “word-by-word” manner. • We could use XPath to match what we type in a XPath language with the content in the “raw” form of the HTML file in the “structure” manner. • XPath could be used in some operations of “Data Crawling”.

In Google Chrome, we could install a Chrome plug-in called “XPath Helper” to help us to understand XPath more • Besides, we could use “Inspect” by right-clicking a part of the webpage to see the “corresponding” part of the “raw” form of the HTML file.

HTML <!DOCTYPE html> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title>A Webpage Title</title> </head> <body> <h1>Webpage Heading (H1)</h1> <p>We could link to a webpage outside this webpage.<br> This is a <a href="http://home.cse.ust.hk/~raywong/">link (No. 1)</a> linking to Raymond's homepage. This link is an absolute link (linking to a webpage outside this webpage).<br> This is a <a href="Table.html">link (No. 2)</a> linking to the next webpage to be taught. This link is a relative link (linking to a webpage outside this webpage).</p> <h2><a name="ImageHeading">Image Heading (H2)</a></h2> <img border="1" src="http://home.cse.ust.hk/~raywong/photo/raymond2.JPG" width="166" height="221"> </body> </html>

html body head meta h2 title img h1 p a a a br

Node element • E.g., <html>…</html> • E.g., <head>…</head> • E.g., <title>…</title> • E.g., <br> • E.g., <a href="Table.html"> … </a> Root node element Non-root node element Non-root node element Non-root node element Non-root node element “href”: Attribute of this node element

Parent • Parent of “body” = ? • Children • Children of “body” = ? • Siblings • Sibling of “body” = ?

Ancestor • Ancestors of “br” = ? • Descendant • Descendant of “body” = ?

When we type an XPath language/script, we could match a list of nodes

Select a list of “html” nodes starting from the beginning directly XPath /html Output <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title>A Webpage Title</title> </head> <body> … </body> </html>

Select a list of “head” nodes whose parent nodes are “html” nodes (starting from the beginning directly) XPath /html/head Output <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title>A Webpage Title</title> </head>

Select a list of “title” nodes whose parent nodes are “head” nodes and whose grandparent nodes are “html” nodes (starting from the beginning directly) XPath /html/head/title Output <title>A Webpage Title</title>

Select a list of “title” nodes starting from the beginning (directly or indirectly) XPath //title Output <title>A Webpage Title</title>

Select a list of the text content of “title” nodes starting from the beginning (directly or indirectly) XPath //title/text() Output A Webpage Title

Select a list of “title” nodes whose ancestor nodes are “html” nodes (starting from the beginning directly) XPath /html//title Output <title>A Webpage Title</title>

Select a list of “head” nodes which have “title” child nodes and whose parent nodes are “html” nodes (starting from the beginning directly) XPath /html/head[title] Output <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title>A Webpage Title</title> </head>

Select a list of “head” nodes whose child “title” nodes are “A Webpage Title” and whose parent nodes are “html” nodes (starting from the beginning directly) XPath /html/head[title="A Webpage Title"] Output <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title>A Webpage Title</title> </head>

Select a list of “a” nodes starting from the beginning (directly or indirectly) XPath //a Output <a href="http://home.cse.ust.hk/~raywong/">link (No. 1)</a> <a href="Table.html">link (No. 2)</a> <a name="ImageHeading">Image Heading (H2)</a>

Select a list of “a” nodes with attribute “href” starting from the beginning (directly or indirectly) XPath //a[@href] Output <a href="http://home.cse.ust.hk/~raywong/">link (No. 1)</a> <a href="Table.html">link (No. 2)</a>

Select a list of “href” attribute values of “a” nodes (starting from the beginning) XPath //a/@href Output http://home.cse.ust.hk/~raywong/ Table.html

Select a list of “name” attribute values of “a” nodes (starting from the beginning) XPath //a/@name Output ImageHeading

Select a list of “name” attribute values of any node (starting from the beginning) XPath //@name Output ImageHeading

Select a list of “a” nodes whose “href” attribute values are “Table.html” (starting from the beginning) (directly or indirectly) XPath //a[@href="Table.html"] Output <a href="Table.html">link (No. 2)</a>

Select a list of “a” nodes whose “href” attribute values are “html” (starting from the beginning) XPath //a[@href="html"] Output -

Select a list of “a” nodes whose “href” attribute values contain “html” (starting from the beginning) XPath //a[contains(@href, "html")] Output <a href="Table.html">link (No. 2)</a>

Select a list of “a” nodes whose “href” attribute values start with “Table” (starting from the beginning) XPath //a[starts-with(@href, "Table")] Output <a href="Table.html">link (No. 2)</a>

Select a list of “title” nodes whose ancestor nodes are “html” nodesand a list of “a” nodes starting from the beginning XPath /html//title | //a Output union <title>A Webpage Title</title> <a href="http://home.cse.ust.hk/~raywong/">link (No. 1)</a> <a href="Table.html">link (No. 2)</a> <a name="ImageHeading">Image Heading (H2)</a>

HTML <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title>Table Title</title> </head> <body> <table width="800" border="1"> <tr> <th scope="col">Student ID</th> <th scope="col">Student Name</th> <th scope="col">Birth Year</th> </tr> <tr> <td>12345678</td> <td>Raymond</td> <td>1998</td> </tr> <tr> <td>87654321</td> <td>Peter Chan</td> <td>1997</td> </tr> <tr> <td>12341234</td> <td>Mary Lau</td> <td>1999</td> </tr> <tr> <td>56785678</td> <td>David Lee</td> <td>1998</td> </tr> <tr> <td>88888888</td> <td>Test Test</td> <td>1998</td> </tr> </table> </body> </html>

Select a list of “tr” nodes under “table” nodes starting from the beginning XPath //table//tr “//table/tr” does not work Output <tr> <th scope="col">Student ID</th> <th scope="col">Student Name</th> <th scope="col">Birth Year</th> </tr> <tr> <td>12345678</td> <td>Raymond</td> <td>1998</td> </tr> …

Select a list of “tr” nodes starting from the beginning XPath //tr Output <tr> <th scope="col">Student ID</th> <th scope="col">Student Name</th> <th scope="col">Birth Year</th> </tr> <tr> <td>12345678</td> <td>Raymond</td> <td>1998</td> </tr> …

Select the first entry in the list of “tr” nodes starting from the beginning XPath //tr[1] Note that this is NOT equal to 0. Output <tr> <th scope="col">Student ID</th> <th scope="col">Student Name</th> <th scope="col">Birth Year</th> </tr>

Select the second entry in the list of “tr” nodes starting from the beginning XPath //tr[2] Output <tr> <td>12345678</td> <td>Raymond</td> <td>1998</td> </tr>

Select the last entry in the list of “tr” nodes starting from the beginning XPath //tr[last()] Output <tr> <td>88888888</td> <td>Test Test</td> <td>1998</td> </tr>

Select the first two entries in the list of “tr” nodes starting from the beginning XPath //tr[position()<=2] Output <tr> <th scope="col">Student ID</th> <th scope="col">Student Name</th> <th scope="col">Birth Year</th> </tr> <tr> <td>12345678</td> <td>Raymond</td> <td>1998</td> </tr>

Select a list of “tr” nodes whose 3rd “td” child nodes are 1998(starting from the beginning) XPath //tr[td[3] = 1998] Output <tr> <td>12345678</td> <td>Raymond</td> <td>1998</td> </tr> <tr> <td>56785678</td> <td>David Lee</td> <td>1998</td> </tr> …

Select a list of “td” nodes which are the 2nd “td” child nodes of the “tr” nodes whose 3rd “td” child nodes are 1998 (starting from the beginning) XPath //tr[td[3] = 1998]/td[2] Output <td>Raymond</td> <td>David Lee</td> <td>Test Test</td>

Select a list of “tr” nodes whose 2nd “td” child nodes are “Raymond” or 3rd “td” child nodes are 1998 (starting from the beginning) XPath //tr[(td[2] = "Raymond") or (td[3] = 1998)] Output <tr> <td>12345678</td> <td>Raymond</td> <td>1998</td> </tr> <tr> <td>56785678</td> <td>David Lee</td> <td>1998</td> </tr> …

Select a list of “tr” nodes whose 2nd “td” child nodes are “Raymond” and 3rd “td” child nodes are 1998 (starting from the beginning) XPath //tr[(td[2] = "Raymond") and (td[3] = 1998)] Output <tr> <td>12345678</td> <td>Raymond</td> <td>1998</td> </tr>

Select a list of “tr” nodes whose 2nd “td” child nodes are NOT “Raymond” and 3rd “td” child nodes are 1998 (starting from the beginning) XPath //tr[not(td[2] = "Raymond") and (td[3] = 1998)] Output <tr> <td>56785678</td> <td>David Lee</td> <td>1998</td> </tr> <tr> <td>88888888</td> <td>Test Test</td> <td>1998</td> </tr>

COMP4332/RMBI4310

COMP4332/RMBI4310

Presentation Transcript