1 / 25

Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access

Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access. Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto. Motivation. Growing importance of XML query processing Plethora of implementations:

ismet
Download Presentation

Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto

  2. Motivation • Growing importance of XML query processing • Plethora of implementations: • native XML dbms (e.g.Timber, Niagara, BEA/XQRL, Natix,ToX) • XQuery systems (e.g.Galax, IPSI-SQ, XSM, MS-XQuery) • XPath processors (e.g.XSQ, SPEX, XPush, Xalan, PathStack) • publish/subscribe (e.g.Y-Filter,IndexFilter,WebFilter,NiagaraCQ) • twig query processors (e.g.TwigStack, PRIX, TurboXPath) • Our contribution: • Apply novel cost-based optimization techniques to XML query processing that exploit path summaries Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods

  3. Example XQuery and Pattern Tree for $x in document(“catalog.xml”)//item, $y in document(“parts.xml”)//part, $z in document(“supplier.xml”)//supplier where $x/part_no = $y/part_no and $z/supplier_no = $x/supplier_no and $z/city = "Toronto" and $z/province = "Ontario" return <result> {$x/part_no} {$x/price} {$y/description} </result> Pattern Tree (PT) or Twig Query Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods

  4. Example XQuery Processing for $x in document(“catalog.xml”)//item, $y in document(“parts.xml”)//part, $z in document(“supplier.xml”)//supplier where $x/part_no = $y/part_no and $z/supplier_no = $x/supplier_no and $z/city = "Toronto" and $z/province = "Ontario" return <result> {$x/part_no} {$x/price} {$y/description} </result> $x = $y $z = $x Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods

  5. Contributions HolisticPath Summary Pruning Access Order Selection • Path Summaries as Catalogs Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods

  6. Outline • Introduction • Path Summaries in the Optimizer • Holistic Path Summary Pruning • Experimental Evaluation • Access Order Selection • Experimental Evaluation • Future Work Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods

  7. ToXin Path Summary • For each distinct path in document there is a path in ToXin - is an exact path summary – reflects the structure of the document [RM01] • Initially proposed as a back-end - can answer any pattern queries <suppliers> <supplier> <supplier_no> 1001 </supplier_no> <name> Magna </name> <city> Toronto </city> <province> ON </province> </supplier> <supplier> <supplier_no> 1002 </supplier_no> <name> MEC </name> <city> Vancouver </city> <province> BC </province> </supplier> </suppliers> TT TI Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods

  8. Augmented ToXin Trees • System catalog: schema + data statistics • DTD and XML Schema are used for validation, they do not describe the actual schema of the instances • ToXin is an exact path summary actual schema • ToXin augmented with statistics system catalog • ToXop statistics: • NCARD– no. of instances for an element • ICARD – no of distinct value for an element • Fan-out – avg. no. of sub-element instances for each sub-element • Augmented ToXin Tree: • existing schema (TT) + statistics + node instances (TI) Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods

  9. Outline • Introduction • Path Summaries in the Optimizer • Holistic Path Summary Pruning • Experimental Evaluation • Access Order Selection • Experimental Evaluation • Future Work Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods

  10. TT TI Holistic Path Summary Pruning • All path summary based query processors perform some path summary pruning specific to the processor • Idea: separate path pruning from the processor and encoding • Holistic Path Summary Pruning (HPSP): • evaluate the pattern tree on the actual schema (TT tree) • compute the twig query using an appropriate algorithm for the particular element encoding • TwigStackScan is one possible HPSP-based Access Method Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods

  11. Stack Algorithms • Stack algorithms: PathStack, TwigStack, TwigStackXB [BSK02] • Use region algebra encoding: Telement: [DocID, Term, StartPos, EndPos, LevelNum] - elements Ttext : [DocID, Term , TextValue, StartPos, LevelNum] - string values • Build a stream (noted as T) for all elements having the same label, e.g. Tauthor encompasses all author elements from the document Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods

  12. TwigStackScan Access Method • Extended region algebra encoding: Telement: [DocID, Term, StartPos, EndPos, LevelNum, TTnodeID] - elements Ttext : [DocID, Term , TextValue, StartPos, LevelNum, TTnodeID] - string values • TwigStackScan = HPSP + TwigStack Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods

  13. Experimental Datasets • DBLP, SWISSPROT: University of Washington XML Repository • Both are large (millions of nodes) and shallow • DBLP – regular in structure (5 structures that repeat) • SWISSPROT – irregular in structure (many one of the kind structures) • XMARK: • simulates an on-line auction site • xmlgen from 0.01 (0.6 MB) – 2.8 (165.9 MB) • removed the content of ‘Text’ elements  30% reduction in size Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods

  14. TwigStackScan Scale-Up Q7 scale-up with (XMARK) file size TwigStackScan speedup with (XMARK) file size • Q7: //site/people/person[@id = "person0"]/name – 1 twig match - @id in person, category, item, open_action • Q8: //site/people/person/name – 38,760 twig matches • When applicable TwigStackScan yields improvements of one order of magnitude Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods

  15. TwigStackScan vs. TwigStack • High selectivity twig queries (Q1, Q4, Q6, Q7): speedup 0.97 to 5.87 • Low selectivity twig queries (Q8, Q11): speedup 1.43 to 1.78 • Scattered twig matches(Q2, Q3, Q5, Q9), grouped twig matches (Q10): speedup 8.96 to 75.38 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods

  16. Outline • Introduction • Path Summaries in the Optimizer • Holistic Path Summary Pruning • Experimental Evaluation • Access Order Selection • Experimental Evaluation • Future Work Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods

  17. Order Selection in Pattern Trees • Order Selection: the order in which to evaluate the branches • Direction Selection: decide how to evaluate a branch: top/down or bottom/up Choosing between top/down and bottom/up is extremely expensive computationally: LORE optimizer [McW99] – for a document with level 7 – millions of possible plans Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods

  18. ToXinScan Access Method • Relational optimizers compute a GOOD plan not THE BEST plan • Similarly we use data statistics and heuristics to compute a good plan • The access-order selection strategy: • Sort the children according to parent selectivity • Evaluate the path with the highest selectivity using a bottom-up evaluation • Evaluate all other paths, in the selectivity order, using a top-down evaluation Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods

  19. ToXinScan Scale-Up Speedup ToXinScan vs. TwigStack with (XMARK) file size • Q8: //site/people/person[@id = "person0"]/name – 1 twig match • Q9: //site/people/person/name– 38,760 twig matches • Q10: //regions/samerica/item[./location = "United States" AND ./@id AND ./quantity AND ./payment] /name – 8 twig matches • Two-order of magnitude improvements over TwigStack Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods

  20. ToXinScan vs. TwigStack High selectivity twig queries (Q3, Q4, Q5, Q8): speedup 2.16 to 9.32 Grouped twig matches (Q11, Q12, Q13): speedup 12.97 to 28.80 Low selectivity (Q2, Q9, Q10, Q14), scattered twig matches (Q1, Q6, Q7): speedup 48.31 to 122.44 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods

  21. ToXinScan vs. Heavier Indexes • Pattern indexes (such as PRIX [RM04], ViST [WPF+03]) are the best twig-query processors • Indexes are expensive to build (three passes over the document) and require extensive space • ViST uses O(SH) space, S # of sequences, H height of tree • Indexes outperform TwigStack by two-orders of magnitude • Good news: • using path summaries and the presented optimization strategy we achieve the same performance improvements as node indexes • path summaries are inexpensive to build (one pass over the document) Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods

  22. Outline • Introduction • Path Summaries in the Optimizer • Holistic Path Summary Pruning • Experimental Evaluation • Access Order Selection • Experimental Evaluation • Future Work Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods

  23. Future Work • Generalize based on the strategy derived from the TwigStackScan access method • Holistic Path Summary Pruning (HPSP) can be used in conjunction with any twig query evaluation method • Can be used with Path summaries other than ToXin • ToXinScan • Add a generalized cost model for access methods • Enhance the XML statistics used • Propose benchmarks for XML Access methods Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods

  24. Thank you for your attention! Attila Barta Mariano P. Consens Alberto O. Mendelzon { atibarta, consens, mendel }@cs.toronto.edu University of Toronto

  25. ToXinScan vs. PRIX [RMo04] Praveen Rao, Bongki Moon, “PRIX: Indexing and Querying XML Using Prufer Sequences”, Proceedings of the 2004 International Conference on Data Engineering, Boston, MA, 2004 [RMo03] Praveen Rao, Bongki Moon, “PRIX: Indexing and Querying XML Using Prufer Sequences”, Technical Report TR-03-06,Univ. of Arizona, Tucson, 2003 Good news: node indexes (e.g. PRIX) are computationally expensive to build (three passes over the document) while path summaries are un-expensive to build (one pass over the document) Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods

More Related