XML Query Reformulation Val Tannen University of Pennsylvania

XML Query Reformulation • Val Tannen • University of Pennsylvania • Joint work with Alin Deutsch, UC San Diego • and in part with Lucian Popa, IBM Almaden

XML XML XML proprietary data proprietary data proprietary data Data Exchange Between Businesses Using XML published data published data pharmaceutical company insurance company published data published data hospital

drug opening tag name price notes “aspirin” “$4” side-effects maker “upset stomach” “Bayer” matching closing tag XML? <drug> <name>aspirin</name> <price>$4</price> <notes> <side-effects>upset stomach</side-effects> <maker>Bayer</maker> </notes> </drug> text

client client query (XQuery) reformulation (SQL) correspondence expressed by publishing query (view) A Simple Publishing Scenario virtual data <study> <case> <diag>migraine</diag> <drug>aspirin</drug> <usage>2/day</usage> </case> <case> <diag>allergy</diag> <drug>cortisone</drug> <usage>3/day</usage> </case> </study> patient name is hidden XML query language standard (draft) published data proprietary data prescription patient usage drug name 2/day aspirin John 3/daycortisone Jane name diagnosis John migraine Jane allergy How to express the view? View = query which, if executed, would produce the virtual data How to “compose” the client query with the view, obtaining the reformulation?

The General Problem of Query Reformulation client queryQ(P) ? reformulated queryX(S) schema S schema P schema correspondence soundness Given query Q(P), find query(ies) X(S) returning same answer, whenever such X(S) exists completeness

Applications of Query Reformulation we just saw it: public schema / storage schema • data publishing • data integration • schema evolution • data security P S global schema / local schema P S old schema / new schema P S illustrated next

(patient,ailment) intrusive query I(P) (patient, physician) + (physician, ailment) Want to be sure that there is no I(P) returning same answer as E(S) An Application: Data Security client queryE(S) (exposes secret data correlation) public schema P proprietary schema S schema correspondence Only possible if Completeness Property holds!

published XML (virtual) public schema view of proprietary data may hide information cached queries schema correspondence storage schema partial relational storage of XML proprietary relational data redundant data proprietary XML data materialized views, indexes after tuning More Complicated Data Publishing:Mixed And Redundant Storage (MARS) initial configuration

cached query relational view XML relational DB rel DB diagnosis,drug drug,price An Example With Tuning XML XML drug,usage,diagnosis simple publishing view identity view XML drug,usage,name name,diagnosis drug,price,notes

R1 R2 R3 Relational DB Rel DB Redundancy Enables Multiple Reformulations client query: “find how much each treatment costs” XML XML drug,usage,diagnosis simple publishing view identity view cached query relational view XML XML diagnosis,drug drug,price drug,usage,name name,diagnosis drug,price,notes Some reformulations are potentially cheaper to execute than others. Want to find an “optimal” one!

relational DB rel DB Schema Correspondence Expressible in XQuery The DB administrator must be able to specify the correspondence. XML XML XQuery XQuery XQuery XQuery XML XML encode encode XML XML Can use XQuery, fixing any of the common encodings of relational tables in XML.

XQuery? binding part drug for$d in document/drug, $m in $d//maker return<producedBy>$m/text()</producedBy> name price notes “aspirin” “$4” side-effects maker tagging template “upset stomach” “Bayer” // (descendant) is the transitive closure of / (child) Result should contain <producedBy>Bayer</producedBy>

client XQuery Mappings () as XQueries relational queries schema correspondence relational constraints C&B XML integrity constraints reformulated queries GReX built-in relational constraints capture XML data model = compilation GReX: Generic Relational encoding of XML reformulated queries (multiple solutions) Approach: XQuery Reformulation Reduced to Relational Reformulation

drug name price notes “$d” “$m” “aspirin” “$4” side-effects maker “upset stomach” “Bayer” XQueries compute in two stages: navigation in XML tree, binds variables to nodes, text, tags, etc. output of new XML, by filling in variable bindings into a tagging template XQuery Semantics Variable binding stage for$d in document/drug, $m in $d//maker return<producedBy>$m/text()</producedBy> XML data model is a tagged tree <drug> <name>aspirin</name> <price>$4</price> <notes> <side-effects>upset stomach</side-effects> <maker>Bayer</maker> </notes> </drug> tagging stage

Relational query over child(x,y) , tag(x,t) ,desc(x,y) , Root (r), etc. Example: for $d indocument(“drugs.xml”)/drug, $m in $d//maker return “$d” “$m” Compiling the Binding Part of XQueries to Relational Queries XBind query = binding part of XQuery (returns a relation: tuples of variable bindings) a relational “conjunctive” query compiles to P($d,$m) :-Root(r) , child(r,$d) , tag($d,“drug”) , desc($d,x) , child(x,$m) ,tag($m,“maker”) But not all models of this schema correspond to the intended model; need GReX !

Sample Constraints from GReX • Relationship between child and descendant navigation: • xy [ child(x,y)  desc(x,y) ] desc contains child • x [ el(x)  desc(x,x) ] desc is reflexive • xyz [ desc(x,y)  desc(y,z)  desc(x,z) ] desc is transitive • Tagged tree structure of XML: • rx [ root(r)  desc(x,r)  x = r ] root has no ancestors • xyz [ child(x,z)  child(y,z)  x = y ] at most one parent These do not capture transitive closure completely, nor is it possible to do it in first-order logic; STILL...

More Constraints from GReX • (some Tag) x [ el(x) t tag(x,t) ] every element has a tag • (oneTag) xt1t2 [ tag(x,t1)  tag(x,t2)  t1 = t2 ] one tag per element • (noLoop) xy [ desc(x,y)  desc(y,x)  x = y ] no non-trivial cycles • (noShare) xyuv [ child(x,u)  child(x,v)  unique path between • desc(u,y)  desc(v,y)  u = v ] elements • (inLine) xy [ desc(x,u)  desc(y,u)  ancestors of an element • x = y  desc(x,y)  desc(y,x) ] are collinear

relational queries relational constraints C&B XML integrity constraints reformulated queries = compilation Which Reformulations Do We Find This Way? client XQuery Mappings () as XQueries schema correspondence GReX built-in constraints capture XML data model reformulated queries (multiple solutions) all of them?

Restrictions on XQuery • Main restriction: no aggregates (to be investigated) • Leaving out aggregates, most common queries can be processed. • Minor restrictions: • no user-defined functions (of course!) • limited use of negation (or else the problem becomes undecidable) • limited use of document order (to be investigated) • no navigation to parent or wildcard child (of unspecified tag) (unintuitive, but we can show that this needs another algorithm, unless NP=  2) p

The Reduction is Sound and Complete • For the restricted XQuery fragment, • Given: • - XBind query B compiled to a relational query c(B) • - schema correspondence C given by XQueries  compiled to set of constraints c(C) Relative Completeness Theorem: R is a minimal reformulation of B under C iff c(R) is a minimal reformulation of c(B) under c(C) and GReX R can be computed from c(R) All of them are found by C&B.

A constraint: ‘ whenever the data satisfies condition “A”, it also satisfies “B” ‘ A B  A chase step: Q: A Q1: A B A Glimpse at the Chase:Transforming Queries Using Constraints A query: ‘ find data satisfying condition “A” ‘ A Q: The chase: repeatedly applying chase steps until no new conditions can be added In general, Q and Q1 are not equivalent, but in all DBs satisfying the constraint, they are! Theory of the chase: 20 years old, deep and rich, due to Beeri, Maier, Mendelson, Sagiv, Vardi, Yannakakis and others!

stands for condition: “data appears in result of V” V Capture the definition with constraints (first-order logic statements) A B   A B V V How Do We Use the Chase?Capturing Relational Views With Constraints Let the schema correspondence be the view: ‘ retrieve the data satisfying conditions “A” and “B” ‘ V: A B all data satisfying “A” and “B” “appears in result of V” all data “appearing in V” satisfies “A” and “B”

  A B A V B Q1: A B Q2: A B V The equivalence is checked again using the chase (backwards)  A B V SQ: V Q2: A B V Chase & Backchase First chase: Q: A Next inspect all subqueries (“syntactic pieces”) of the chase result Q2: SQ: V It turns out that SQ is equivalent to Q Presence of constraint AB allows reformulation

U(P + S ) Universal plan chase with C backchase S U B Q U E R I E S solutions X(S) = subqueries of U, posed against S, equivalent to Q Completeness Theorem [Deutsch&T.]: Any scan-minimal reformulation of Q under C is a subquery of U General C&B Algorithm(joint work with Lucian Popa, IBM Almaden) • (public) schema P, (proprietary) schema S • LetC be a set of constraints. (eg., on Pand/or P&S ) Assume some terminating chasing sequence Q(P)

Two Sets of Experiments • Synthetic queries • reformulation time as function of query “complexity” • XML analog of relational “star” queries, increasing number of joins • can very complex queries still be reformulated in a practical amount of time ? • “Realistic” queries from the XML Benchmark Project [http://monetdb.cwi.nl/xml] • The Queries: 20 queries designed to exercise interesting features of XQuery • The Schema correspondence: views in both directions • compiles to about 200 constraints! Much more than in typical relational schemas!

Experiments with Synthetic Queries Number of joins (number of corners in the star)

Experiments with Benchmark Queries Reformulation times must be understood in conjunction with execution times (eg., tens of seconds for Q10)

Summary of Contributions • MARS, a system for XQuery reformulation, • - with mixed and redundant storage, under integrity constraints. • - complex schema correspondence (views in both directions) • Showed practical relevance of C&B method (feasible and worthwhile) • A completeness result for a significant fragment of XQuery and a large • class of schema correspondences. The method remains sound for the full language. • A reduction between minimal reformulation and query equivalence, and • we gave matching lower bounds showing our chase-based decision procedure is • asymptotically optimal for the fragment considered.

The End

Why XML? • The relational data model is still the dominant concept in databases. • All data can be coded into tables. • (For that matter into (goedel)numbers too!) • Artificial coding makes life harder for query programmers. • Result: less productivity, more bugs. • XML is much more flexible. It is also “self-describing”, i.e., no • need apriori for types/schemas (but this is sometimes a bad idea). • It came from the document community (tagged text) • and was cheered by industry gurus. So we have to live with it. • (Although one can image better data models…)

typical size reduction 2^100  300 Making It Work • Chase: each chase step is similar to evaluation of a recursive Datalog rule on a • symbolic database built from the query •  we borrowed classical query processing techniques Backchase: size of search space is O(2^u), u = size of universal plan We found criteria for pruning this space. • compiling constraints to join tree • joins implemented as hash-joins • pushing selections into joins • Cost-independent: prune subqueries that • - do not correspond to legal XML queries • - contain redundant descendant navigation steps bottom-up exploration of subqueries: first all performing 1 navigation step, next all performing 2 navigation steps, etc. Perform contiguous navigation steps starting from the root x child-of y, y child-of z, x descendant-of z • A cost-based pruning strategy parameterized by costing model • - finds optimal reformulation for any monotonic cost model • - cost models for XML are still under research • - heuristic cost model: cost is number of table scans/XML navigation steps performed • - amenable to experimenting with other cost models

Benefit of Reformulation For Execution Time no. of elements in document Benefit increases with increasing complexity of query and increasing database size

More Results for Benchmark Queries Delta to finish search Delta to best reformulation Time to first reformulation For redundancy: materialized the XBind query for each query (particular case of Acess Support Relation) Time to find first reformulation is essentially the same as in the absence of redundancy. Additional time spent only for finding optimal one.

Local As View (LAV) MARS Q=X o CR Q CR X= Q Q P P CR CR S S rewriting-with-views combined effect of rewriting+composition Information Manifold, STORED, Agora Related Work:Data Integration As Particular Case of MARS Applications Global As View (GAV) X=Q o CR Q P (global schema) CR S (local schema) [with Fernandez and Suciu in SIGMOD’99] reformulation by composition-with-views TSIMMIS, SilkRoute, XPeranto

Future Work Directions • Short-Term: • - tuning of C&B implementation for further speedup • - XML-specific strategies for pruning the backchase stage • - in particular, finding a good cost model to perform cost-based pruning • Medium-Term: • - Applying C&B to Data Security • - Applications to Adaptive Distributed Query Optimization • Long Term: • - a unified framework for integrating data from various, heterogenous sources going • beyond classical databases (XML/relational/LDAP + web forms + web services)

reformulated query X (N) Find X(N) returning same answer as Q(O) Application 3: Schema Evolution (e.g. Caching) Goal: support existing client applications even after changing the schema client old query Q (O) old schema O new schema N schema correspondence could be O extended with cached results

highly unstructured public data relational view (lossy) redundant storage Drugs name price aspirin $4 cortisone $50 A Source of Redundancy: Relational Storage of XML catalog drug drug name notes price price notes name “$50” “aspirin” “$4” “cortisone”

Containment Under Integrity Constraints • Decision procedure for containment is based on chasing with constraints from GReX. • Natural extension to XML integrity constraints. • Some results: • Containment of well-behaved XPath/XBind queries under bounded simple XML integrity constraints (SXICs) is decidable (used in relative completeness theorem). • Even modest use of unboundedness makes the problem undecidable. • Corollary: containment under bounded SXICs and DTDs is undecidable. • Containment under DTDs only is an open problem, but we have a PSPACE lower bound. • See proposal for details.

LDAP

The Very End

tagging template XBind queries relational queries relational constraints C&B XML integrity constraints reformulated queries GReX built-in XML data model constraints = compilation GReX: Generic Relational encoding of XML, used internally to partially capture the intended model The Architecture of Our Solution client XQuery defined next Mappings () as XQueries rel/XML encodings schema correspondence not shown here reformulated queries (multiple solutions)

Tool: Algorithm for reformulation of relational queries under relational constraints Chase & Backchase (C&B) introduced in [VLDB’99 with L. Popa and V. Tannen] evaluated in [SIGMOD’00 with L. Popa, A. Sahuguet and V. Tannen] • Problem: • XML/MARS XQuery Reformulation • schema correspondence given by views in both directions • multiple solutions

result of query defining the view is included in V V is included in result of query defining view Capturing Relational Views With Constraints Let the schema correspondence be a view defined as the relational conjunctive query V(x,z) :- A(x,y), B(y,z) Capture the definition with constraints, (cV) x y z [ A(x,y)  B(y,z)  V(x,z) ] (bV) x z [ V(x,z)  y A(x,y)  B(y,z) ]

Partially capturing the XML model • Partially, because some features cannot fully be captured with constraints: • descendant is the transitive closure of child, but this is not FO-definable • neither is the “treeness” property • our solution: • add a set of constraints GREX to approximate intended models • it turns out that capturing descendant helps in capturing treeness • then, we define a significant XQuery fragment (we call it well-behaved) • that cannot distinguish between intended and approximate models

Constraints in GReX(2): the tagged tree structure of XML • (topRoot) rx [ root(r)  desc(x,r)  x = r ] root has no ancestors • (oneTag) xt1t2 [ tag(x,t1)  tag(x,t2)  t1 = t2 ] one tag per element • (noLoop) xy [ desc(x,y)  desc(y,x)  x = y ] no non-trivial cycles • (oneParent) xyz [ child(x,z)  child(y,z)  x = y ] at most one parent • (noShare) xyuv [ child(x,u)  child(x,v)  unique path between • desc(u,y)  desc(v,y)  u = v ] elements • (inLine) xy [ desc(x,u)  desc(y,u)  ancestors of an element • x = y  desc(x,y)  desc(y,x) ] are collinear

XQuery Restrictions • What it allows: • composition of navigation steps, • navigation axes: self, (named)child, descendant, ancestor, idrefs • qualifiers: path, string  path, “and”, “or”, path equality/inequality • where clause: disjunction, path equality/inequality, • existential quantification • What it rules out: • user-defined functions, • range, before predicates, • aggregates, arbitrary negation, universal quantification, • concatenation (,) • navigation to parent (..) or to child of unspecified name (*)

U(P + S) Universal plan chase backchase S U B Q U E R I E S solutions X(S) = subqueries of U, posed against S, equivalent to Q C&B Completeness • Let C be a set of constraints (relates public schema P and proprietary schema S) • C-minimal query: • removing any of its relational atoms produces non-equivalent query under D • Q1 is a subquery of Q2: • Q1 is isomorphic to a “piece” of Q2 Q(P) Completeness Theorem: Any C-minimal reformulation of Q is a subquery of U

A Completeness Result for Our Solution • Given: • - well-behaved XBind query B • compiled to a relational query c(B) • - schema correspondence M given by well-behaved XQueries (in both directions), • compiled to set of relational constraints c(M) • - bounded XML integrity constraints XIC, • compiled to set of relational constraints c(XIC) a class of XML integrity constraints, see [KRDB’01] Relative Completeness Theorem: for any R R is a (M+XIC)-minimal reformulation of B iff c(R) is a (GReXc(M) c(XIC))-minimal reformulation of c(B) All of them are found by C&B. Corollary: completeness of reformulation algorithm for XBind queries R can be computed from c(R)

relational queries relational constraints C&B XML integrity constraints reformulated queries = compilation Capturing XML Semantics client XQuery Mappings () as XQueries schema correspondence GReX built-in constraints capture XML data model reformulated queries (multiple solutions)

Summary of Constraints Used in C&B Phase • Built-in constraints in GReX • Relational views compile to inclusion constraints • XQuery views • their XBind queries compile to inclusion constraints as for relational views • their return clause compiles to several decorrelated queries, each captured with constraints • the XML template in the return clause compiles to several Skolem and copy functions, each compiled to constraints • Integrity constraints • XML constraints compile to relational constraints • relational schema constraints

XML Query Reformulation Val Tannen University of Pennsylvania