TAX: A Tree Algebra for XML

TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001.

Overview • Why an algebra for XML? • Main challenges • Data model • Patterns & Witnesses • Tree Value Functions • Some Example Operators • Translation Example – XQuery

Overview (contd.) • Main Results • Optimization Examples • Implementation • Summary & Future Work

Why an Algebra (for XML)? (aka Related Work) • Bulk algebra for tree manipulation – efficient implementation of XML queries • Algebra for manipulating trees (has been attempted before) • Feature algebras – linguistics; efficient implementation? • Grammar-based algebra for trees [Tompa+ 87, Gyssens+ 89] • Aqua project [Zdonik+95]

Why XML algebra? [Related work] (contd.) • GraphLog, Hy+ [Consens+90], GOOD [Paradaens+92] – cannot exploit special properties of trees (e.g., support for arbitrary recursion vs. descendants, order) • SS data – Lorel [Abiteboul+ 96], UnQL [Buneman+ 96]. • XML algebras – [Beech+ 99], [Fernandez+ 00] (mainly type system issues), [Christofidis+ 00] (trees  tuples), [Ludascher+ 00] (nodes, not trees), SAL [Beeri+ 99] (ordered lists of nodes)

Why? (contd.) • be close to relational model, but • direct support for (collections of) trees • express at least RA + aggregation • capture substantial fragment of XQuery • admit efficient implementation and effective query optimization • e.g., satisfy “natural” identities.

Main Chellanges • Capture rich variety of manipulations in a simple algebra • Handle heterogeneity in tree collections • structure • “schema” of nodes of the same “type” • Handle order (documents are ordered) • sometimes important (e.g., author list, whether anesthesia preceded incision) • sometimes not (e.g., publisher vs. authors)

Data Model • Data tree = rooted ordered tree • Data in node = set of attr-val pairs • Special attribute: pedigree – where did I come from? • Node representation = (docId, startPos:endPos, level) • preservedfor (copies of) original nodes thru manipulations. • play important role in grouping, sorting, etc. • null for new nodes. • Collections (of trees) – unordered. • IDREF(S) treated like other attr’s. • Possible alternative: treat them as pointers. • One position: express pointer dereferencing as IDREF=ID join (but implement as you will).

Patterns & Witnesses • first challenge: how do you get at nodes and/or attributes? • Notion of selection parameter – considerably more complex • our solution: patterns – enable specification of parameters for most operations • only show parts of interest: • Need not know/care about entire structure of trees in collection • Analogy: in SchemaLog, you only specify what you care about.

Patterns & Witnesses (contd.)  Structural part $1 • Example P1: pc ad $2 $3 $1.tag = book & $2.tag = year & $2.content < 2000 & $3.tag = author pc = direct ad = transitive Condition part Additional parameters possible: e.g., selection/projection lists, grouping, ordering, etc.

Patterns & Witnesses (contd.) • What does a pattern do for you? • generate witnesses against i/p collection • one for each matching of pattern against i/p • conditions must be respected • (sub)structure preserved in o/p • e.g., witness trees for pattern P1 – • one tree for each author of each book published before 2000, showing year & author • book-author link may be transitive in i/p but is necessarily direct in o/p • source trees = trees witnesses “came from”

Example Database 1 bib 2 19 book book 3 12 20 22 4 5 author year author author title title year Principia Mathematica 1910 Ashtadhyayi (First book on Sanskrit Grammar) 560 BC name name deg name deg Panini M.A., FRS Sc.D., FRS Only startPos shown. first mid last first last Alfred North Whitehead Bertrand Russel

What should selection return? • trees where a match occurred? • poor granularity when DB = one big document tree; e.g., select books authored by John Grisham  the whole bib tree! • only distinguished nodes (as XPath)? • don’t get all info. that you want. • witness trees – right level of abstraction and info. extraction. • may enhance: e.g., relatives of selected nodes might be of interest too. • Deescendants – most useful case.

Example Operators – Selection • Input: collection; parameters: pattern, selection list (pattern nodes) • Example • pattern P1 and empty SL: same witness trees as before • pattern P1 with SL = {$1}: whole book subtrees (i.e. retain $1’s descendants) • One-zero/more o/p trees in general per i/p tree • Could retain other “relatives” instead (e.g., siblings)

Selection with P1 (empty SL) Whole author subtree included when SL = {$3}. 2 2 19 book book book 3 22 5 3 12 21 author year author year author year 1910 1910 560 BC

What should projection do? • Unlike relational model, selection is not purely “horizontal” (so, can’t expect pure “vertical” for project). • Can one op serve both roles? • Select finds match witnesses (localized) • Want project to retain all (named) nodes satisfying some predicates in a given source tree no matter how you match the pattern • The two ops are still orthogonal

Example operators – Projection • Input: collection; parameters: pattern, projection list • Example • Pattern P1 w/ PL = {$1, $2, $3}: one tree for each book published before 2000, showing year and author(s) • Pattern P1 w/ PL = {$3}: one tree for each author of aforementioned books • `*’ in PL causes descendants to be retained • One-zero/more op (for reasons diff. from select)

Projection: P1 w/ PL = {$1,$2,$3} 2 19 book book 3 12 22 5 21 author year author author year 1910 560 BC With $3*, we can include whole author subtrees.

Selection vs. Projection • Example FOR $b IN document(“doc.xml”)//book FOR $y IN $b/year[data() < 2000] FOR $a IN $b//author RETURN <book> {$y} {$a} </book> versus FOR $b IN document(“doc.xml”)//book[/year/data() < 2000 & author] RETURN <book> {$b/year} {$b/author} </book> selection projection

Tree Value Functions (TVF) • What are they? • Primitive recursive functions on structure of source trees • Codomain must be ordered • Where are they used? • grouping, ordering, aggregation, etc. • Here is an example: • f: T  value of author, number of authors, tuple of authors, {author tuple, title}, etc. • Complete example coming up …

Example operators – grouping • Input: collection; parameters: pattern, grouping TVF, ordering TVF. • Example input: collection of books pattern: f_g(T) = “$4.content” f_o(T) = “$2.content” $1 pc ad $2 $3 pc $1.tag = book & $2.tag = title & $3.tag = author & $4.tag = name $4

Grouping (contd.) • Here is what the o/p looks like: -- books ordered by title in each group tax_group_root … tax_group_basis tax_group_subroot book book author name

Other operators • Derived operators – various joins. • Set operations: • When are two data trees the “same”? • Equality (shallow/deep) vs. isomorphism (include pedigree or not?) • Multiset versions of operators • Aggregation, Reordering, Renaming.

Joins $1 $1.tag=book & SL=$2 E  SELECT: | $2.tag=publisher $2 $3 $3.tag=book & SL=$4 F  SELECT: |ad $4.tag=author $4 G  (F x E) $5 $5.tag=tax_prod_root & H  SELECT: / \ $6.tag=book & $7.tag=book & $6 $7 $6.pedigree=$7.pedigree SL=$6, $7. • we joined on pedigrees. • could have joined on publisher city = author city instead, if desired. • can express a variety of outerjoins easily.

XQuery Translation Example 1 • FOR $b IN document(“doc.xml)//book[//author/@hobby=tennis]RETURN<sportydiveshbook>$b/title IF SOME $a IN $b//author SATISFIES $a/data() = “divesh” THEN $b//author </sportydiveshbook>

Example 1 (contd.) • outer pattern tree: $1 $1.tag=book & $2.tag=author & $2.hobby=tennis ad $2 • inner pattern tree: $3 $3.tag=book & $4.tag=author & $4.content=divesh ad $4

Example 1 (contd.) • SELECT input DB w/ outer pattern and empty SL; • Take Cartesian product with entire input DB; • SELECT result w/ combined inner+outer pattern and join condition: $5.tag=tax_prod_root & $6.tag=book & $7.tag=author & $8.tag=book & $8.pedigree=$6.pedigree & $9.tag=author & $9.content=divesh & $10.tag=title $5 $6 $8 $7 $9 $10 What is wrong with this translation?

Example 1 (contd.) • Pre-IF part E: select w/ $1 $1.tag=book & $2.tag=author & $2.hobby=tennis SL = $1* ad $2 then project w/ $3 PL = $3, $4 PL = $3, $4 $3.tag=book & $4.tag=title $4 Additional duplicate elimination needed if we don’t know title is unique per book.

Example 1 (contd.) • IF part F: select w/ then project w/ $5 $5.tag=book & $6.tag=author & $6.content = divesh SL = $5* ad $6 $7 PL = $7, $8 ad $7.tag=book & $8.tag=author $8

Example 1 (contd.) • Do a left outerjoin of E with F w/ the condition $3 = $7  (What does this really entail?) tax_prod_root / \ book book . . . | / ... \ title author author • Project w/ $9 $9.tag != book PL = $9 • Rename tax_prod_root  sportydiveshbook.

Example 2 FOR $a IN distinct(document(“bib.xml”))//author RETURN <authorpubs> {$a} {FOR $b IN document(“bib.xml”)//article WHERE $a = $b/author RETURN $b/title } </authorpubs>

Example 2 (contd.) • select/project authors and dup-elim. • join with books based on (pedigree-) equality ofbook nodes. (So, what should the selection pattern look like?) • Group by author pedigree. • Do a project, retaining only author and title. • Do a final renaming, if needed.

Main Results • Duplicate elimination by value can be expressed in TAX. • The operators in TAX are independent. • TAX is complete for relational algebra w/ aggregation. • TAX can capture the fragment of XQuery FLWR expressions w/o function calls, recursion, w/ all path expressions using only constants, wildcards, and / & //, when no new ancestor-descendant relationships are created.

Optimization Examples • Revisit translation example 1: • E can be simplified to – project w/ $1 $1.tag=book & $2.tag=author & $2.hobby=tennis & $3.tag=title PL= $1,$3 $2 $3 • Similar simplification applies to F • Self-join can sometimes be eliminated • Associativity, commutativity issues

Implementation • TIMBER system at Univ. of Michigan • Find pattern tree matches via • Index scans • Full scans • Twig joins • Joins implemented on streams • Pedigree – implemented as position of element within document • Pedigrees similar to RID at impl. level

Summary & Future Work • TAX – extension of RA for handling heterogeneous collections of ordered labeled trees • Simplicity; few more operators • Recognize selective importance of order and handle elegantly • Bulk algebra for efficient implementation of XML querying • Stay tuned for TIMBER release(s) • Future • Arbitrary restructuring: copy-and-paste • Updates: principled via operators

More translation examples – ex3 FOR $b IN document(“mybib.xml”)//book, $a IN $b/author WHERE $b/year < 1990 AND $a/@hobby=“tennis” RETURN <result> {$b//publisher} {$a/affiliation} </result> What’s a generic way to translate such queries into TAX?

More translation examples – ex3 Eforwhere FOR $b IN document(“mybib.xml”)//book, $a IN $b/author WHERE $b/year < 1990 AND $a/@hobby=“tennis” RETURN <result> {$b//publisher} {$a/affiliation} </result> Ereturn1 Ereturn2 Efinal Identify major components in query statement and associate expressions with each. Expressions developed in cascade. Each uses its own pattern (tree).

More translation examples – ex3 Eforwhere FOR $b IN document(“mybib.xml”)//book, $a IN $b/author WHERE $b/year < 1990 AND $a/@hobby=“tennis” RETURN <result> {$b//publisher} {$a/affiliation} </result> $b Ereturn1 $a $y pattern used for creating Eforwhere $b.tag=book & $a.tag= author & $y.tag=year& $a.hobby=“tennis” & $y.content<1990 Ereturn2 Efinal P’forwhere – same as Pforwhere, except $y is dropped. E0 = SELECT_{Pforwhere, {}}(mybib.xml); E1 = PROJECT_{P’forwhere, {$b,$a}}(E0); Eforwhere = DE_{P’forwhere, {$b,$a}}(E1); Why need project? Why need DE?

More translation examples – ex3 Eforwhere FOR $b IN document(“mybib.xml”)//book, $a IN $b/author WHERE $b/year < 1990 AND $a/@hobby=“tennis” RETURN <result> {$b//publisher} {$a/affiliation} </result> Ereturn1 $tpr $b $b’ ad $a’ $a $p Ereturn2 $tpr.tag=tax_prod_root & $b.tag= $b’.tag=book & $a.tag=author & $p.tag=publisher & $b’.pedigree=$b.pedigree & $a’.tag=author & $a’.pedigree=$a.pedigree Efinal Why did we impose pedigree equality? pettern used for creating Ereturn1

More translation examples – ex3 Eforwhere FOR $b IN document(“mybib.xml”)//book, $a IN $b/author WHERE $b/year < 1990 AND $a/@hobby=“tennis” RETURN <result> {$b//publisher} {$a/affiliation} </result> Ereturn1 $tpr $b $b’ ad $a’ $a $p Ereturn2 $tpr.tag=tax_prod_root & $b.tag= $b’.tag=book & $a.tag=author & $p.tag=publisher & $b’.pedigree=$b.pedigree & $a’.tag=author & $a’.pedigree=$a.pedigree Efinal Ereturn1is created via left outer-join, Project;DE; followed by GROUP-BY. pettern used for creating Ereturn1

More translation examples – ex3 Eforwhere FOR $b IN document(“mybib.xml”)//book, $a IN $b/author WHERE $b/year < 1990 AND $a/@hobby=“tennis” RETURN <result> {$b//publisher} {$a/affiliation} </result> Ereturn1 $tpr $b $b’ ad $a’ $a $p Ereturn2 $tpr.tag=tax_prod_root & $b.tag= $b’.tag=book & $a.tag=author & $p.tag=publisher & $b’.pedigree=$b.pedigree & $a’.tag=author & $a’.pedigree=$a.pedigree Efinal E2 = LOJ_{P_LG,{$p}}(Eforwhere, mybib.xml); E3 = PD_{P’_LG, {$b,$a,$p*}}(E2 ); Ereturn1 = GP_{P’’_LG, {$b,$a}, {$p*}}(E3 ); pettern used for creating Ereturn1

More translation examples – ex3 Eforwhere FOR $b IN document(“mybib.xml”)//book, $a IN $b/author WHERE $b/year < 1990 AND $a/@hobby=“tennis” RETURN <result> {$b//publisher} {$a/affiliation} </result> Ereturn1 $tpr $b $b’ ad $a’ $a $p Ereturn2 $tpr.tag=tax_prod_root & $b.tag= $b’.tag=book & $a.tag=author & $p.tag=publisher & $b’.pedigree=$b.pedigree & $a’.tag=author & $a’.pedigree=$a.pedigree Efinal Efinal = PJ_{P_PJ, {$r, $p*, $l*}}(Ereturn1, Ereturn2); pettern used for creating Ereturn1

General translation remarks • LET clause handled as correlated subquery; E_LET left outer-joined with E_FORWHERE (just like E_RETURNi). • Ordering by pedigree (i.e., as in original input) already captured. • Ordering by other means doable. • Aggregation – straightforward. • Nested queries (with correlated subqueries) – handled by rewriting them so the query conforms to: (FOR LET)*RETURN where WHERE clause and ORDER-BY are implicit.

TAX: A Tree Algebra for XML