1 / 44

COMP4332/RMBI4310

Learn how to use XPath to extract data from XML and HTML documents, with examples and practical tips.

rjoann
Download Presentation

COMP4332/RMBI4310

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COMP4332/RMBI4310 XPath Prepared by Raymond Wong Presented by Raymond Wong

  2. Outline • Overview • Example 1 (Simple Webpage) • Example 2 (Table)

  3. Overview • XPath is a query language that is used for traversing through an XML (Extensible Markup Language) document (and an HTML file).

  4. In the web browsers, we could use the “search” function to match what we type in the “search” input box with the content in the “display” form of the HTML file in the “word-by-word” manner. • We could use XPath to match what we type in a XPath language with the content in the “raw” form of the HTML file in the “structure” manner. • XPath could be used in some operations of “Data Crawling”.

  5. In Google Chrome, we could install a Chrome plug-in called “XPath Helper” to help us to understand XPath more • Besides, we could use “Inspect” by right-clicking a part of the webpage to see the “corresponding” part of the “raw” form of the HTML file.

  6. Outline • Overview • Example 1 (Simple Webpage) • Example 2 (Table)

  7. HTML <!DOCTYPE html> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title>A Webpage Title</title> </head> <body> <h1>Webpage Heading (H1)</h1> <p>We could link to a webpage outside this webpage.<br> This is a <a href="http://home.cse.ust.hk/~raywong/">link (No. 1)</a> linking to Raymond's homepage. This link is an absolute link (linking to a webpage outside this webpage).<br> This is a <a href="Table.html">link (No. 2)</a> linking to the next webpage to be taught. This link is a relative link (linking to a webpage outside this webpage).</p> <h2><a name="ImageHeading">Image Heading (H2)</a></h2> <img border="1" src="http://home.cse.ust.hk/~raywong/photo/raymond2.JPG" width="166" height="221"> </body> </html>

  8. html body head meta h2 title img h1 p a a a br

  9. Node element • E.g., <html>…</html> • E.g., <head>…</head> • E.g., <title>…</title> • E.g., <br> • E.g., <a href="Table.html"> … </a> Root node element Non-root node element Non-root node element Non-root node element Non-root node element “href”: Attribute of this node element

  10. Parent • Parent of “body” = ? • Children • Children of “body” = ? • Siblings • Sibling of “body” = ?

  11. Ancestor • Ancestors of “br” = ? • Descendant • Descendant of “body” = ?

  12. When we type an XPath language/script, we could match a list of nodes

  13. Select a list of “html” nodes starting from the beginning directly XPath /html Output <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title>A Webpage Title</title> </head> <body> … </body> </html>

  14. Select a list of “head” nodes whose parent nodes are “html” nodes (starting from the beginning directly) XPath /html/head Output <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title>A Webpage Title</title> </head>

  15. Select a list of “title” nodes whose parent nodes are “head” nodes and whose grandparent nodes are “html” nodes (starting from the beginning directly) XPath /html/head/title Output <title>A Webpage Title</title>

  16. Select a list of “title” nodes starting from the beginning (directly or indirectly) XPath //title Output <title>A Webpage Title</title>

  17. Select a list of the text content of “title” nodes starting from the beginning (directly or indirectly) XPath //title/text() Output A Webpage Title

  18. Select a list of “title” nodes whose ancestor nodes are “html” nodes (starting from the beginning directly) XPath /html//title Output <title>A Webpage Title</title>

  19. Select a list of “head” nodes which have “title” child nodes and whose parent nodes are “html” nodes (starting from the beginning directly) XPath /html/head[title] Output <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title>A Webpage Title</title> </head>

  20. Select a list of “head” nodes whose child “title” nodes are “A Webpage Title” and whose parent nodes are “html” nodes (starting from the beginning directly) XPath /html/head[title="A Webpage Title"] Output <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title>A Webpage Title</title> </head>

  21. Select a list of “a” nodes starting from the beginning (directly or indirectly) XPath //a Output <a href="http://home.cse.ust.hk/~raywong/">link (No. 1)</a> <a href="Table.html">link (No. 2)</a> <a name="ImageHeading">Image Heading (H2)</a>

  22. Select a list of “a” nodes with attribute “href” starting from the beginning (directly or indirectly) XPath //a[@href] Output <a href="http://home.cse.ust.hk/~raywong/">link (No. 1)</a> <a href="Table.html">link (No. 2)</a>

  23. Select a list of “href” attribute values of “a” nodes (starting from the beginning) XPath //a/@href Output http://home.cse.ust.hk/~raywong/ Table.html

  24. Select a list of “name” attribute values of “a” nodes (starting from the beginning) XPath //a/@name Output ImageHeading

  25. Select a list of “name” attribute values of any node (starting from the beginning) XPath //@name Output ImageHeading

  26. Select a list of “a” nodes whose “href” attribute values are “Table.html” (starting from the beginning) (directly or indirectly) XPath //a[@href="Table.html"] Output <a href="Table.html">link (No. 2)</a>

  27. Select a list of “a” nodes whose “href” attribute values are “html” (starting from the beginning) XPath //a[@href="html"] Output -

  28. Select a list of “a” nodes whose “href” attribute values contain “html” (starting from the beginning) XPath //a[contains(@href, "html")] Output <a href="Table.html">link (No. 2)</a>

  29. Select a list of “a” nodes whose “href” attribute values start with “Table” (starting from the beginning) XPath //a[starts-with(@href, "Table")] Output <a href="Table.html">link (No. 2)</a>

  30. Select a list of “title” nodes whose ancestor nodes are “html” nodesand a list of “a” nodes starting from the beginning XPath /html//title | //a Output union <title>A Webpage Title</title> <a href="http://home.cse.ust.hk/~raywong/">link (No. 1)</a> <a href="Table.html">link (No. 2)</a> <a name="ImageHeading">Image Heading (H2)</a>

  31. Outline • Overview • Example 1 (Simple Webpage) • Example 2 (Table)

  32. HTML <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title>Table Title</title> </head> <body> <table width="800" border="1"> <tr> <th scope="col">Student ID</th> <th scope="col">Student Name</th> <th scope="col">Birth Year</th> </tr> <tr> <td>12345678</td> <td>Raymond</td> <td>1998</td> </tr> <tr> <td>87654321</td> <td>Peter Chan</td> <td>1997</td> </tr> <tr> <td>12341234</td> <td>Mary Lau</td> <td>1999</td> </tr> <tr> <td>56785678</td> <td>David Lee</td> <td>1998</td> </tr> <tr> <td>88888888</td> <td>Test Test</td> <td>1998</td> </tr> </table> </body> </html>

  33. Select a list of “tr” nodes under “table” nodes starting from the beginning XPath //table//tr “//table/tr” does not work Output <tr> <th scope="col">Student ID</th> <th scope="col">Student Name</th> <th scope="col">Birth Year</th> </tr> <tr> <td>12345678</td> <td>Raymond</td> <td>1998</td> </tr> …

  34. Select a list of “tr” nodes starting from the beginning XPath //tr Output <tr> <th scope="col">Student ID</th> <th scope="col">Student Name</th> <th scope="col">Birth Year</th> </tr> <tr> <td>12345678</td> <td>Raymond</td> <td>1998</td> </tr> …

  35. Select the first entry in the list of “tr” nodes starting from the beginning XPath //tr[1] Note that this is NOT equal to 0. Output <tr> <th scope="col">Student ID</th> <th scope="col">Student Name</th> <th scope="col">Birth Year</th> </tr>

  36. Select the second entry in the list of “tr” nodes starting from the beginning XPath //tr[2] Output <tr> <td>12345678</td> <td>Raymond</td> <td>1998</td> </tr>

  37. Select the last entry in the list of “tr” nodes starting from the beginning XPath //tr[last()] Output <tr> <td>88888888</td> <td>Test Test</td> <td>1998</td> </tr>

  38. Select the first two entries in the list of “tr” nodes starting from the beginning XPath //tr[position()<=2] Output <tr> <th scope="col">Student ID</th> <th scope="col">Student Name</th> <th scope="col">Birth Year</th> </tr> <tr> <td>12345678</td> <td>Raymond</td> <td>1998</td> </tr>

  39. Select a list of “tr” nodes whose 3rd “td” child nodes are 1998(starting from the beginning) XPath //tr[td[3] = 1998] Output <tr> <td>12345678</td> <td>Raymond</td> <td>1998</td> </tr> <tr> <td>56785678</td> <td>David Lee</td> <td>1998</td> </tr> …

  40. Select a list of “td” nodes which are the 2nd “td” child nodes of the “tr” nodes whose 3rd “td” child nodes are 1998 (starting from the beginning) XPath //tr[td[3] = 1998]/td[2] Output <td>Raymond</td> <td>David Lee</td> <td>Test Test</td>

  41. Select a list of “tr” nodes whose 2nd “td” child nodes are “Raymond” or 3rd “td” child nodes are 1998 (starting from the beginning) XPath //tr[(td[2] = "Raymond") or (td[3] = 1998)] Output <tr> <td>12345678</td> <td>Raymond</td> <td>1998</td> </tr> <tr> <td>56785678</td> <td>David Lee</td> <td>1998</td> </tr> …

  42. Select a list of “tr” nodes whose 2nd “td” child nodes are “Raymond” and 3rd “td” child nodes are 1998 (starting from the beginning) XPath //tr[(td[2] = "Raymond") and (td[3] = 1998)] Output <tr> <td>12345678</td> <td>Raymond</td> <td>1998</td> </tr>

  43. Select a list of “tr” nodes whose 2nd “td” child nodes are NOT “Raymond” and 3rd “td” child nodes are 1998 (starting from the beginning) XPath //tr[not(td[2] = "Raymond") and (td[3] = 1998)] Output <tr> <td>56785678</td> <td>David Lee</td> <td>1998</td> </tr> <tr> <td>88888888</td> <td>Test Test</td> <td>1998</td> </tr>

More Related