1 / 32

Approximate XML Query Answers

Approximate XML Query Answers. Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas). XML. XML Data. Motivation. XML: de-facto standard for data exchange Development of the “ XML Warehouse”

alvis
Download Presentation

Approximate XML Query Answers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)

  2. XML XML Data Motivation • XML: de-facto standard for data exchange • Development of the “XML Warehouse” • Conflict between “on-line” and query execution cost • Increased query response times • Users might wait for un-interesting results Q Warehouse R

  3. Synopsis XML XML XML Data Approximate Query Answers • Evaluate query over a concise data synopsis and obtain an approximation R’ of the true result • Use approximate result as timely feedback • User can assess the “value” of the query • Goal: reduce number of evaluated queries R’ Q Warehouse R

  4. Contributions • TreeSketch Synopses • Structural summaries for XML data • Approximate answers for complex twig queries • Summarization model  Structural clustering of elements • Efficient processing and construction • Element Simulation Distance • Novel distance metric for XML data • Captures “approximate” similarity between two XML trees • Experimental Results • Accurate approximate answers for low space budgets • Low-error selectivity estimates • Efficient construction algorithm

  5. Outline • Preliminaries • TreeSketches • Synopsis model • Computing approximate answers • Summary construction • Element Simulation Distance • Experimental Study • Conclusions

  6. Twig Query r q0 //section p1 q1 ./figure .//equation s2 s3 q2 q3 f6 f7 f5 f4 Nesting Tree Binding Tuples r e10 c12 c13 e8 c9 c11 q0 q1 q2 q3 s2 r s2 f4 e8 r s2 f4 e10 e11 e13 f5 f7 r s2 f5 e8 r s2 f5 e10 Data and Query Model XML Document

  7. r q0 s //section q1 e e f ./figure .//equation q2 q3 Synopsis r s2 e11 e13 f5 f7 XML Data Problem Definition • Process twig query over a synopsis • Compute approximation of nesting tree Approximate Nesting Tree True Nesting Tree

  8. TreeSketch Model

  9. r R(1) p1 P(1) s2 s3 S(2) F(2) F(2) f6 f7 f5 f4 e10 c12 c13 e8 c9 c11 E(2) C(4) Graph Synopsis • Synopsis node  Set of elements of the same tag • Synopsis edge  Document edge(s) XML Document Graph Synopsis

  10. r R(1) 1 p1 P(1) 2 s2 s3 S(2) 1 1 F(2) F(2) f6 f7 f5 f4 1 1 1 e10 c12 c13 e8 c9 c11 E(2) C(4) TreeSketch Synopsis • Augment graph-synopsis with edge counts • count[u,v]: mean #children in v per element in u XML Document TreeSketch

  11. r R(1) 1 p1 P(1) 2 s2 s3 S(2) 1 1 F(2) F(2) f6 f7 f5 f4 1 1 1 e10 c12 c13 e8 c9 c11 E(2) C(4) TreeSketch Synopsis • Is there a lossless synopsis? • What is the quality of a lossy synopsis? XML Document TreeSketch

  12. r R(1) p1 P(1) s2 s3 S(2) F(2) F(2) f6 f7 f5 f4 e10 c12 c13 e8 c9 c11 E(2) C(4) Count Stability • (u,v) count-stable: all elements in u have the same child-count in v XML Document TreeSketch 1 2 1 1 1 1 1

  13. r p1 s2 s3 f6 f7 f5 f4 e10 c12 c13 e8 c9 c11 Count-Stable TreeSketch • A count-stable synopsis can recover the input tree • Efficient one-pass construction • Stable summary can be too large for practical use! XML Document TreeSketch R(1) 1 P(1) 1 1 S(1) S(1) 2 2 F(2) F(2) 1 1 1 E(2) C(4)

  14. #F r R(1) 2  1 p1 P(1) 1  2 s2 s3  S(2) 1 2 #F 1 1 F(2) F(2) f6 f7 f5 f4 1 1 1 e10 c12 c13 e8 c9 c11 E(2) C(4) Lossy TreeSketch XML Document TreeSketch

  15. TreeSketches and Clustering • TreeSketch  Element clustering • All elements in a node are mapped to a “centroid” • Tight clusters  Accurate synopsis • Synopsis quality  Clustering error • Options: Manhattan Distance, Squared Error, … • Quality can be measured independent of a workload • Key for effective construction

  16. R(1) 2 1 P(1) S 2 1 1+1=2 S(2) C E 1 1 F(2) F(2) 1 1 1 E(2) C(4) Computing Approximate Answers Query Approximate Nesting Tree TreeSketch • Compute TreeSketch of approximate answer • Accuracy depends on quality of clustering R q0 //section q1 .//caption .//equation q2 q3

  17. TreeSketch Construction • Given an XML tree T, build a TreeSketch of size B • Difficult clustering problem • Space dimensionality depends on the clustering itself • Construction based on bottom-up clustering • Compress perfect synopsis by merging clusters • Best merge determined by marginal gains • Heuristic to reduce number of candidate merges … Space Budget Perfect

  18. Element Simulation Distance

  19. r r r s s s s s s 2 4 1 4 6 4 4 6 1 1 2 1 f f f e e e f f f e e e Error of Approximation • Error  Distance between R’ and R • Popular metric: Tree-edit distance • Min-cost sequence of operations that transform R’ to R • Measures syntactic differences between R and R’ • Not intuitive for approximate answers! Same counts Opposite Trait Different counts Similar Trait T1 T T2

  20. f f Recursive application of ESD r r f s s s s e e e e e e e e e e 1 2 4 6 6 4 2 1 f f e e f f e e T T2 Element Simulation Distance • Capture approximate similarity between R and R’ • u simulates v: u and v have identical structure • ESD(u,v): “degree” of simulation between u,v • How well the structure of u matches the structure of v • Modeled as the distance between multi-sets • Efficient computation using perfect summaries

  21. Experimental Results

  22. Methodology • Data Sets: XMark, DBLP, IMDB, SwissProt • Workload: 1000 random twig queries • Evaluation metrics: • Average ESD for approximate answers • Mean absolute relative error for selectivity estimation

  23. Approximate Answers - IMDB IMDB (~102K Elements) Avg. Result Size: 3,477 tuples

  24. Selectivity Estimation - SwissProt SwissProt (~182K Elements) Avg. Result Size: 104,592 tuples

  25. Selectivity Estimation - ALL

  26. Conclusions • Approximate query answering for XML databases • TreeSketch Synopses • Structural summaries for tree-structured XML • Approximate answers for twig-queries • Model: Graph Synopsis + Edge-counts • Efficient processing and construction • Element Simulation Distance • Capture approximate similarity between XML trees • Experimental Results • High accuracy for low space budgets • Efficient construction

  27. Questions?

  28. #C 1  1 #E TreeSketch Model (2/2) • Average number of children <--> Edge count XML Document TreeSketch r R 1 p1 P(1) 2 S(2) s2 s3 1 1 F(2) F(2) f9 f9 f7 f5 1 1 1 E(2) C(4) e13 c17 c17 e11 c12 c14

  29. XML XML Document r p1 p: paper s: section c: caption t: title f: figure e: equation s2 s3 f9 f9 f7 f5 e13 c17 c17 e11 c12 c14

  30. r p1 s2 s3  2 f6 f7 f5 f4 e10 c12 c13 e8 c9 c11 TreeSketch Synopsis • Augment graph-synopsis with edge counts • count[u,v]: mean #children in v per element in u XML Document TreeSketch R(1) 1 P(1) 2 S(2) #F 2 F(4) 1 0.5 E(2) C(4)

  31. Depth-Guided Merging • Key observation: Two elements have similar structure, if their children have similar structure • Bottom-up merging, based on depth • Depth: distance from the leaves of the tree • Build a pool of candidate merges by increasing depth • Replenish the pool when it falls below a given threshold • Reduced construction time - Accurate synopses

  32. Depth-Guided Merging • Observation: Two elements have similar structure, if their children have similar structure • Heuristic: If a merge of two clusters is good, then merges of the child clusters are likely to have been good as well • Bottom-up merging strategy • Savings in construction time - Accurate synopses

More Related