Storage and Query Methods for XML Data: Region Algebras, DataGuides, and Efficient Computation

Chapter 10: XML The world of XML

The Data Semistructured data instance = a large graph

The indexing problem • The storage problem • Store the graph in a relational DBMS • Develop a new database storage structure • The indexing problem: • Input: large, irregular data graph • Output: index structure for evaluating (regular) path expressions, e.g. bib.paper.author.firstname

XSet: a simple index for XML • Part of the Ninja project at Berkeley • Example XML data:

XSet: a simple index for XML Each node = a hashtable Each entry = list of pointers to data nodes (not shown) • SELECT X FROM part.name X -yes • SELECT X FROM part.supplier.name X -yes • SELECT X FROM part.*.subpart.name X -maybe • SELECT X FROM *.supplier.name X -maybe

Region Algebras • structured text = text with tags (like XML) • data = sequence of characters [c1c2c3 …] • region = interval in the text • representation (x,y) = [cx,cx+1, … cy] • example: <section> … </section> • region set = a set of regions • example all <section> regions (may be nested) • region algebra = operators on region set, s1 op s2 • s1 intersect s2 = {r | r s1, r s2} • s1 included s2 = {r | rs1, r’  s2, r  r’} • s1 including s2 = {r | r s1, r’  s2, r  r’} • s1 parent s2 = {r | r s1, r’ s2, r is a parent of r’} • s1 child s2 = {r | r s1, r’  s2, r is child of r’}

Region Algebras • Region expressions correspond to simple XPath expressions • s1 child s2 = {r | r s1, r’  s2, r is child of r’} part.name name child (part child root) part.supplier.name name child (supplier child (part child root)) *.supplier.name name child supplier part.*.subpart.name name child (subpart included (part child root))

Efficient computation of Region Algebra Operators Example: s1 included s2 s1 = {(x1,x1'), (x2,x2'), …} s2 = {(y1,y1'), (y2,y2'), …} (i.e. assume each consists of disjoint regions) Algorithm: if xi < yj then i := i + 1 if xi' > yj' then j := j + 1 otherwise: print (xi,xi'), do i := i + 1 Can do in sub-linear time when one region is very small

Storage structures for region algebras • Every node is characterised by an integer pair (x,y) • This means we have a 2-d space • Any 2-d space data structure can be used • If you use a (pre-order,post-order) numbering you get triangular filling of 2-d (to be discussedlater)

Alternative mappings • Mapping the structure to the relational world • The Edge approach • The Attribute approach • The Universal Table approach • The Normalized Universal approach • The Monet/XML approach • The Dataguide approach • Mapping values • Separate value tables • Inlining • Shredding

Dataguide approach • Developed in the context of Lore, Lorel (Stanford Univ) • Predecessor of the Monet/XML model • Observation: • queries in the graph-representation take a limited form • they are partial walks from the root to an object of interest • this behaviour was stressed by the query language Lorel, i.e. an SQL-based query language based on processing regular expressions SELECT X FROM (Bib.*.author).(lastname|firstname).Abiteboul X

DataGuides Definition given a semistructured data instance DB, a DataGuide for DB is a graph G s.t.: - every path in DB also occurs in G - every path in G occurs in DB - every path in G is unique

Dataguides Example:

DataGuides • Multiple DataGuides for the same data:

DataGuides Definition Let w, w’ be two words (I.e word queries) and G a graph w G w’ if w(G) = w’(G) Definition G is a strong dataguide for a database DB if G is the same as DB Example: - G1 is a strong dataguide - G2 is not strong person.project !DB dept.project person.project !G2 dept.project

DataGuides • Constructing the strong DataGuide G: Nodes(G)={{root}} Edges(G)= while changes do choose s in Nodes(G), a in Labels add s’={y|x in s, (x -a->y) in Edges(DB)} to Nodes(G) add (x -a->y) to Edges(G) • Use hash table for Nodes(G) • This is precisely the powerset automaton construction.

DataGuides • How large are the dataguides ? • if DB is a tree, then size(G) <= size(DB) • why? answer: every node is in exactly one extent of G • here: dataguide = XSet • How many nodes does the strong dataguide have for this DB ? 20 nodes (least common multiple of 4 and 5) Dataguides usually fail on data with cyclic schemas, like:

Monet XML approach Monet XML approach

Monet XML approach

Querying the XML world

Querying and Transforming XML Data • Standard XML querying/translation languages • XPath • Simple language consisting of path expressions • XSLT • Simple language designed for translation from XML to XML and XML to HTML • XQuery • An XML query language with a rich set of features • Wide variety of other languages have been proposed, and some served as basis for the Xquery standard • XML-QL, Quilt, XQL, …

XPath • XPath is used to address (select) parts of documents usingpath expressions • A path expression is a sequence of steps separated by “/” • Think of file names in a directory hierarchy • Result of path expression: set of values that along with their containing elements/attributes match the specified path • E.g. /bank-2/customer/name evaluated on the bank-2 data we saw earlier returns <name>Joe</name> <name>Mary</name> • E.g. /bank-2/customer/name/text( ) returns the same names, but without the enclosing tags

The initial “/” denotes root of the document (above the top-level tag) Path expressions are evaluated left to right Each step operates on the set of instances produced by the previous step Selection predicates may follow any step in a path, in [ ] E.g. /bank-2/account[balance > 400] returns account elements with a balance value greater than 400 /bank-2/account[balance] returns account elements containing a balance subelement Attributes are accessed using “@” E.g. /bank-2/account[balance > 400]/@account-number returns the account numbers of those accounts with balance > 400 IDREF attributes are not dereferenced automatically (more on this later) XPath (Cont.)

Functions in XPath • XPath provides several functions • The function count() at the end of a path counts the number of elements in the set generated by the path • E.g. /bank-2/account[customer/count() > 2] • Returns accounts with > 2 customers • Also function for testing position (1, 2, ..) of node w.r.t. siblings • Boolean connectives and and or and function not() can be used in predicates • IDREFs can be referenced using function id() • id() can also be applied to sets of references such as IDREFS and even to strings containing multiple references separated by blanks • E.g. /bank-2/account/id(@owner) • returns all customers referred to from the owners attribute of account elements.

Operator “|” used to implement union E.g. /bank-2/account/id(@owner) |/bank-2/loan/id(@borrower) gives customers with either accounts or loans However, “|” cannot be nested inside other operators. “//” can be used to skip multiple levels of nodes E.g. /bank-2//name finds any name element anywhere under the /bank-2 element, regardless of the element in which it is contained. A step in the path can go to (13 variations in the standard): parents, siblings, ancestors and descendants of the nodes generated by the previous step, not just to the children “//”, described above, is a short from for specifying “all descendants” “..” specifies the parent. More XPath Features

Pathfinder • Xpath is essential for the implementation of an Xquery processor. It is strongly related to the data structures and its primitives. • A state-of-the-art implementation is MonetDB/Pathfinder developed by Uni. Konstantz, Twente University, CWI

Pathfinder Uni Konstantz

Pathfinder

pathfinder

Pathfinder

Staircase join

Pathfinder

XQuery

XQuery • XQuery is a general purpose query language for XML data • Currently being standardized by the World Wide Web Consortium (W3C) • The textbook description is based on a March 2001 draft of the standard. The final version may differ, but major features likely to stay unchanged. • Alpha version of XQuery engine • Galax http://db.bell-labs.com/galax/ • IPSI-IQ • Xpath visualized http://www.vbxml.com/xpathvisualizer/ • MonetDB/Pathfinder • Xhive • XQuery is derived from the Quilt query language, which itself borrows from SQL, XQL and XML-QL

XQuery • XQuery uses a for … let … where .. return … syntax • for SQL fromwhere  SQL wherereturn  SQL selectlet allows temporary variables, and has no equivalent in SQL • Variables make it possible to keep the state of processing around and severely complicates optimization

For clause uses XPath expressions, and variables in the for- clause ranges over values in the set returned by Xpath XPath is used to address (select) parts of documents using path expressions A path expression is a sequence of steps separated by “/” Result of path expression: set of values that along with their containing elements/attributes match the specified path E.g. /bank-2/customer/name evaluated on the bank-2 data we saw earlier returns <name>Joe</name> <name>Mary</name> E.g. /bank-2/customer/name/text( ) returns the same names, but without the enclosing tags FLWR Syntax in XQuery

XPath • XPath is used to address (select) parts of documents using path expressions • A path expression is a sequence of steps separated by “/” • Think of file names in a directory hierarchy • Result of path expression: set of values that along with their containing elements/attributes match the specified path • E.g. /bank-2/customer/name evaluated on the bank-2 data we saw earlier returns <name>Joe</name> <name>Mary</name> • E.g. /bank-2/customer/name/text( ) returns the same names, but without the enclosing tags

Storage and Query Methods for XML Data: Region Algebras, DataGuides, and Efficient Computation

Storage and Query Methods for XML Data: Region Algebras, DataGuides, and Efficient Computation

Presentation Transcript

Chapter 17

A Different Mirror Chapter 1 and excerpts of Chapter 4

Chapter 9

Chapter 7

Chapter 7

The Word Is Alive 1 Timothy

Chapter 5

King County EMS System

Chapters to be lectured

Signal Conditioning

EPC Advanced Business Aspects Training

Chapter 1. Chapte r 2. Chapter 3. Chapter 4. Chapter 5. Chapter 6. Chapter 7.

Chapter 7

Business Correspondences

PILOT NAVIGATION

Biochemistry

Chapter 8

MIDTERM

Chapter 26: Electromagnetism

Chapter 5

Travel Coordinators’ User Guide

Chapter Menu