1 / 90

Chapter 10: XML

Chapter 10: XML. The world of XML. The Data. Semistructured data instance = a large graph. The indexing problem. The storage problem Store the graph in a relational DBMS Develop a new database storage structure The indexing problem: Input: large, irregular data graph

Download Presentation

Chapter 10: XML

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 10: XML The world of XML

  2. The Data Semistructured data instance = a large graph

  3. The indexing problem • The storage problem • Store the graph in a relational DBMS • Develop a new database storage structure • The indexing problem: • Input: large, irregular data graph • Output: index structure for evaluating (regular) path expressions, e.g. bib.paper.author.firstname

  4. XSet: a simple index for XML • Part of the Ninja project at Berkeley • Example XML data:

  5. XSet: a simple index for XML Each node = a hashtable Each entry = list of pointers to data nodes (not shown) • SELECT X FROM part.name X -yes • SELECT X FROM part.supplier.name X -yes • SELECT X FROM part.*.subpart.name X -maybe • SELECT X FROM *.supplier.name X -maybe

  6. Region Algebras • structured text = text with tags (like XML) • data = sequence of characters [c1c2c3 …] • region = interval in the text • representation (x,y) = [cx,cx+1, … cy] • example: <section> … </section> • region set = a set of regions • example all <section> regions (may be nested) • region algebra = operators on region set, s1 op s2 • s1 intersect s2 = {r | r s1, r s2} • s1 included s2 = {r | rs1, r’  s2, r  r’} • s1 including s2 = {r | r s1, r’  s2, r  r’} • s1 parent s2 = {r | r s1, r’ s2, r is a parent of r’} • s1 child s2 = {r | r s1, r’  s2, r is child of r’}

  7. Region Algebras • Region expressions correspond to simple XPath expressions • s1 child s2 = {r | r s1, r’  s2, r is child of r’} part.name name child (part child root) part.supplier.name name child (supplier child (part child root)) *.supplier.name name child supplier part.*.subpart.name name child (subpart included (part child root))

  8. Efficient computation of Region Algebra Operators Example: s1 included s2 s1 = {(x1,x1'), (x2,x2'), …} s2 = {(y1,y1'), (y2,y2'), …} (i.e. assume each consists of disjoint regions) Algorithm: if xi < yj then i := i + 1 if xi' > yj' then j := j + 1 otherwise: print (xi,xi'), do i := i + 1 Can do in sub-linear time when one region is very small

  9. Storage structures for region algebras • Every node is characterised by an integer pair (x,y) • This means we have a 2-d space • Any 2-d space data structure can be used • If you use a (pre-order,post-order) numbering you get triangular filling of 2-d (to be discussedlater)

  10. Alternative mappings • Mapping the structure to the relational world • The Edge approach • The Attribute approach • The Universal Table approach • The Normalized Universal approach • The Monet/XML approach • The Dataguide approach • Mapping values • Separate value tables • Inlining • Shredding

  11. Dataguide approach • Developed in the context of Lore, Lorel (Stanford Univ) • Predecessor of the Monet/XML model • Observation: • queries in the graph-representation take a limited form • they are partial walks from the root to an object of interest • this behaviour was stressed by the query language Lorel, i.e. an SQL-based query language based on processing regular expressions SELECT X FROM (Bib.*.author).(lastname|firstname).Abiteboul X

  12. DataGuides Definition given a semistructured data instance DB, a DataGuide for DB is a graph G s.t.: - every path in DB also occurs in G - every path in G occurs in DB - every path in G is unique

  13. Dataguides Example:

  14. DataGuides • Multiple DataGuides for the same data:

  15. DataGuides Definition Let w, w’ be two words (I.e word queries) and G a graph w G w’ if w(G) = w’(G) Definition G is a strong dataguide for a database DB if G is the same as DB Example: - G1 is a strong dataguide - G2 is not strong person.project !DB dept.project person.project !G2 dept.project

  16. DataGuides • Constructing the strong DataGuide G: Nodes(G)={{root}} Edges(G)= while changes do choose s in Nodes(G), a in Labels add s’={y|x in s, (x -a->y) in Edges(DB)} to Nodes(G) add (x -a->y) to Edges(G) • Use hash table for Nodes(G) • This is precisely the powerset automaton construction.

  17. DataGuides • How large are the dataguides ? • if DB is a tree, then size(G) <= size(DB) • why? answer: every node is in exactly one extent of G • here: dataguide = XSet • How many nodes does the strong dataguide have for this DB ? 20 nodes (least common multiple of 4 and 5) Dataguides usually fail on data with cyclic schemas, like:

  18. Monet XML approach Monet XML approach

  19. Monet XML approach

  20. Monet XML approach

  21. Monet XML approach

  22. Monet XML approach

  23. Querying the XML world

  24. Querying and Transforming XML Data • Standard XML querying/translation languages • XPath • Simple language consisting of path expressions • XSLT • Simple language designed for translation from XML to XML and XML to HTML • XQuery • An XML query language with a rich set of features • Wide variety of other languages have been proposed, and some served as basis for the Xquery standard • XML-QL, Quilt, XQL, …

  25. XPath • XPath is used to address (select) parts of documents usingpath expressions • A path expression is a sequence of steps separated by “/” • Think of file names in a directory hierarchy • Result of path expression: set of values that along with their containing elements/attributes match the specified path • E.g. /bank-2/customer/name evaluated on the bank-2 data we saw earlier returns <name>Joe</name> <name>Mary</name> • E.g. /bank-2/customer/name/text( ) returns the same names, but without the enclosing tags

  26. The initial “/” denotes root of the document (above the top-level tag) Path expressions are evaluated left to right Each step operates on the set of instances produced by the previous step Selection predicates may follow any step in a path, in [ ] E.g. /bank-2/account[balance > 400] returns account elements with a balance value greater than 400 /bank-2/account[balance] returns account elements containing a balance subelement Attributes are accessed using “@” E.g. /bank-2/account[balance > 400]/@account-number returns the account numbers of those accounts with balance > 400 IDREF attributes are not dereferenced automatically (more on this later) XPath (Cont.)

  27. Functions in XPath • XPath provides several functions • The function count() at the end of a path counts the number of elements in the set generated by the path • E.g. /bank-2/account[customer/count() > 2] • Returns accounts with > 2 customers • Also function for testing position (1, 2, ..) of node w.r.t. siblings • Boolean connectives and and or and function not() can be used in predicates • IDREFs can be referenced using function id() • id() can also be applied to sets of references such as IDREFS and even to strings containing multiple references separated by blanks • E.g. /bank-2/account/id(@owner) • returns all customers referred to from the owners attribute of account elements.

  28. Operator “|” used to implement union E.g. /bank-2/account/id(@owner) |/bank-2/loan/id(@borrower) gives customers with either accounts or loans However, “|” cannot be nested inside other operators. “//” can be used to skip multiple levels of nodes E.g. /bank-2//name finds any name element anywhere under the /bank-2 element, regardless of the element in which it is contained. A step in the path can go to (13 variations in the standard): parents, siblings, ancestors and descendants of the nodes generated by the previous step, not just to the children “//”, described above, is a short from for specifying “all descendants” “..” specifies the parent. More XPath Features

  29. Pathfinder • Xpath is essential for the implementation of an Xquery processor. It is strongly related to the data structures and its primitives. • A state-of-the-art implementation is MonetDB/Pathfinder developed by Uni. Konstantz, Twente University, CWI

  30. Pathfinder Uni Konstantz

  31. Pathfinder

  32. Pathfinder

  33. Pathfinder

  34. Pathfinder

  35. Pathfinder

  36. Pathfinder

  37. pathfinder

  38. Pathfinder

  39. Staircase join

  40. Staircase join

  41. Pathfinder

  42. Pathfinder

  43. Pathfinder

  44. Pathfinder

  45. XQuery

  46. XQuery • XQuery is a general purpose query language for XML data • Currently being standardized by the World Wide Web Consortium (W3C) • The textbook description is based on a March 2001 draft of the standard. The final version may differ, but major features likely to stay unchanged. • Alpha version of XQuery engine • Galax http://db.bell-labs.com/galax/ • IPSI-IQ • Xpath visualized http://www.vbxml.com/xpathvisualizer/ • MonetDB/Pathfinder • Xhive • XQuery is derived from the Quilt query language, which itself borrows from SQL, XQL and XML-QL

  47. XQuery • XQuery uses a for … let … where .. return … syntax • for SQL fromwhere  SQL wherereturn  SQL selectlet allows temporary variables, and has no equivalent in SQL • Variables make it possible to keep the state of processing around and severely complicates optimization

  48. For clause uses XPath expressions, and variables in the for- clause ranges over values in the set returned by Xpath XPath is used to address (select) parts of documents using path expressions A path expression is a sequence of steps separated by “/” Result of path expression: set of values that along with their containing elements/attributes match the specified path E.g. /bank-2/customer/name evaluated on the bank-2 data we saw earlier returns <name>Joe</name> <name>Mary</name> E.g. /bank-2/customer/name/text( ) returns the same names, but without the enclosing tags FLWR Syntax in XQuery

  49. XPath • XPath is used to address (select) parts of documents using path expressions • A path expression is a sequence of steps separated by “/” • Think of file names in a directory hierarchy • Result of path expression: set of values that along with their containing elements/attributes match the specified path • E.g. /bank-2/customer/name evaluated on the bank-2 data we saw earlier returns <name>Joe</name> <name>Mary</name> • E.g. /bank-2/customer/name/text( ) returns the same names, but without the enclosing tags

More Related