900 likes | 998 Views
Explore indexing and storage solutions for XML data through region algebras and data guides. Learn efficient computation methods and mapping approaches. Dive into querying in the XML world with XPath, XSLT, and XQuery languages.
E N D
Chapter 10: XML The world of XML
The Data Semistructured data instance = a large graph
The indexing problem • The storage problem • Store the graph in a relational DBMS • Develop a new database storage structure • The indexing problem: • Input: large, irregular data graph • Output: index structure for evaluating (regular) path expressions, e.g. bib.paper.author.firstname
XSet: a simple index for XML • Part of the Ninja project at Berkeley • Example XML data:
XSet: a simple index for XML Each node = a hashtable Each entry = list of pointers to data nodes (not shown) • SELECT X FROM part.name X -yes • SELECT X FROM part.supplier.name X -yes • SELECT X FROM part.*.subpart.name X -maybe • SELECT X FROM *.supplier.name X -maybe
Region Algebras • structured text = text with tags (like XML) • data = sequence of characters [c1c2c3 …] • region = interval in the text • representation (x,y) = [cx,cx+1, … cy] • example: <section> … </section> • region set = a set of regions • example all <section> regions (may be nested) • region algebra = operators on region set, s1 op s2 • s1 intersect s2 = {r | r s1, r s2} • s1 included s2 = {r | rs1, r’ s2, r r’} • s1 including s2 = {r | r s1, r’ s2, r r’} • s1 parent s2 = {r | r s1, r’ s2, r is a parent of r’} • s1 child s2 = {r | r s1, r’ s2, r is child of r’}
Region Algebras • Region expressions correspond to simple XPath expressions • s1 child s2 = {r | r s1, r’ s2, r is child of r’} part.name name child (part child root) part.supplier.name name child (supplier child (part child root)) *.supplier.name name child supplier part.*.subpart.name name child (subpart included (part child root))
Efficient computation of Region Algebra Operators Example: s1 included s2 s1 = {(x1,x1'), (x2,x2'), …} s2 = {(y1,y1'), (y2,y2'), …} (i.e. assume each consists of disjoint regions) Algorithm: if xi < yj then i := i + 1 if xi' > yj' then j := j + 1 otherwise: print (xi,xi'), do i := i + 1 Can do in sub-linear time when one region is very small
Storage structures for region algebras • Every node is characterised by an integer pair (x,y) • This means we have a 2-d space • Any 2-d space data structure can be used • If you use a (pre-order,post-order) numbering you get triangular filling of 2-d (to be discussedlater)
Alternative mappings • Mapping the structure to the relational world • The Edge approach • The Attribute approach • The Universal Table approach • The Normalized Universal approach • The Monet/XML approach • The Dataguide approach • Mapping values • Separate value tables • Inlining • Shredding
Dataguide approach • Developed in the context of Lore, Lorel (Stanford Univ) • Predecessor of the Monet/XML model • Observation: • queries in the graph-representation take a limited form • they are partial walks from the root to an object of interest • this behaviour was stressed by the query language Lorel, i.e. an SQL-based query language based on processing regular expressions SELECT X FROM (Bib.*.author).(lastname|firstname).Abiteboul X
DataGuides Definition given a semistructured data instance DB, a DataGuide for DB is a graph G s.t.: - every path in DB also occurs in G - every path in G occurs in DB - every path in G is unique
Dataguides Example:
DataGuides • Multiple DataGuides for the same data:
DataGuides Definition Let w, w’ be two words (I.e word queries) and G a graph w G w’ if w(G) = w’(G) Definition G is a strong dataguide for a database DB if G is the same as DB Example: - G1 is a strong dataguide - G2 is not strong person.project !DB dept.project person.project !G2 dept.project
DataGuides • Constructing the strong DataGuide G: Nodes(G)={{root}} Edges(G)= while changes do choose s in Nodes(G), a in Labels add s’={y|x in s, (x -a->y) in Edges(DB)} to Nodes(G) add (x -a->y) to Edges(G) • Use hash table for Nodes(G) • This is precisely the powerset automaton construction.
DataGuides • How large are the dataguides ? • if DB is a tree, then size(G) <= size(DB) • why? answer: every node is in exactly one extent of G • here: dataguide = XSet • How many nodes does the strong dataguide have for this DB ? 20 nodes (least common multiple of 4 and 5) Dataguides usually fail on data with cyclic schemas, like:
Monet XML approach Monet XML approach
Querying and Transforming XML Data • Standard XML querying/translation languages • XPath • Simple language consisting of path expressions • XSLT • Simple language designed for translation from XML to XML and XML to HTML • XQuery • An XML query language with a rich set of features • Wide variety of other languages have been proposed, and some served as basis for the Xquery standard • XML-QL, Quilt, XQL, …
XPath • XPath is used to address (select) parts of documents usingpath expressions • A path expression is a sequence of steps separated by “/” • Think of file names in a directory hierarchy • Result of path expression: set of values that along with their containing elements/attributes match the specified path • E.g. /bank-2/customer/name evaluated on the bank-2 data we saw earlier returns <name>Joe</name> <name>Mary</name> • E.g. /bank-2/customer/name/text( ) returns the same names, but without the enclosing tags
The initial “/” denotes root of the document (above the top-level tag) Path expressions are evaluated left to right Each step operates on the set of instances produced by the previous step Selection predicates may follow any step in a path, in [ ] E.g. /bank-2/account[balance > 400] returns account elements with a balance value greater than 400 /bank-2/account[balance] returns account elements containing a balance subelement Attributes are accessed using “@” E.g. /bank-2/account[balance > 400]/@account-number returns the account numbers of those accounts with balance > 400 IDREF attributes are not dereferenced automatically (more on this later) XPath (Cont.)
Functions in XPath • XPath provides several functions • The function count() at the end of a path counts the number of elements in the set generated by the path • E.g. /bank-2/account[customer/count() > 2] • Returns accounts with > 2 customers • Also function for testing position (1, 2, ..) of node w.r.t. siblings • Boolean connectives and and or and function not() can be used in predicates • IDREFs can be referenced using function id() • id() can also be applied to sets of references such as IDREFS and even to strings containing multiple references separated by blanks • E.g. /bank-2/account/id(@owner) • returns all customers referred to from the owners attribute of account elements.
Operator “|” used to implement union E.g. /bank-2/account/id(@owner) |/bank-2/loan/id(@borrower) gives customers with either accounts or loans However, “|” cannot be nested inside other operators. “//” can be used to skip multiple levels of nodes E.g. /bank-2//name finds any name element anywhere under the /bank-2 element, regardless of the element in which it is contained. A step in the path can go to (13 variations in the standard): parents, siblings, ancestors and descendants of the nodes generated by the previous step, not just to the children “//”, described above, is a short from for specifying “all descendants” “..” specifies the parent. More XPath Features
Pathfinder • Xpath is essential for the implementation of an Xquery processor. It is strongly related to the data structures and its primitives. • A state-of-the-art implementation is MonetDB/Pathfinder developed by Uni. Konstantz, Twente University, CWI
XQuery • XQuery is a general purpose query language for XML data • Currently being standardized by the World Wide Web Consortium (W3C) • The textbook description is based on a March 2001 draft of the standard. The final version may differ, but major features likely to stay unchanged. • Alpha version of XQuery engine • Galax http://db.bell-labs.com/galax/ • IPSI-IQ • Xpath visualized http://www.vbxml.com/xpathvisualizer/ • MonetDB/Pathfinder • Xhive • XQuery is derived from the Quilt query language, which itself borrows from SQL, XQL and XML-QL
XQuery • XQuery uses a for … let … where .. return … syntax • for SQL fromwhere SQL wherereturn SQL selectlet allows temporary variables, and has no equivalent in SQL • Variables make it possible to keep the state of processing around and severely complicates optimization
For clause uses XPath expressions, and variables in the for- clause ranges over values in the set returned by Xpath XPath is used to address (select) parts of documents using path expressions A path expression is a sequence of steps separated by “/” Result of path expression: set of values that along with their containing elements/attributes match the specified path E.g. /bank-2/customer/name evaluated on the bank-2 data we saw earlier returns <name>Joe</name> <name>Mary</name> E.g. /bank-2/customer/name/text( ) returns the same names, but without the enclosing tags FLWR Syntax in XQuery
XPath • XPath is used to address (select) parts of documents using path expressions • A path expression is a sequence of steps separated by “/” • Think of file names in a directory hierarchy • Result of path expression: set of values that along with their containing elements/attributes match the specified path • E.g. /bank-2/customer/name evaluated on the bank-2 data we saw earlier returns <name>Joe</name> <name>Mary</name> • E.g. /bank-2/customer/name/text( ) returns the same names, but without the enclosing tags