Master Thesis Presentation 22 nd April 2004 Venkatesh Raghavan

VAMANA A High Performance, Scalable and Cost Driven XPath Engine Master Thesis Presentation 22nd April 2004 Venkatesh Raghavan Advisor: Prof. Elke Rundensteiner Reader : Prof. Micha Hofri

Outline • Motivation • Related Work • Background for VAMANA Approach • Our Physical Algebra • Query Execution Engine • Query Optimization • Experimental Evaluation • Conclusions

Motivation Many applications are migrating to native XML database. • Need for an XML query engine • High Performance • Support queries to that emphasize the structural semantics of XML query languages. • Efficient querying engine and database management system tailored for XML data. • Scalable • To support large XML document.s • Support all 13 XPath axes. • Cost Based • Schema independent cost model provides dynamically calculated heuristics. • Intelligent cost-based transformations, further improving performance.

Outline • Motivation • Related Work • Relational Solutions • DOM Solutions • Current Index Based Solutions • Background for VAMANA Approach • Our Physical Algebra • Query Execution Engine • Query Optimization • Experimental Evaluation • Conclusions

Relational Solution • Mature data management tools • Query processing • Crash recovery • Concurrency control • Shredding XML documents • XPeranto [6] and Rainbow [7] • Many mapping algorithms • From XML Schema to Relations: A Cost-based Approach to XML Storage [24] • Workload based query mapping algorithm .xml

Flip-side • Data Model mismatch • Tables Vs XML semi-structured data model • Semantic mismatch • SQL Vs Xquery • Data Fragmentation • Increase query execution cost (More Joins) • High update cost • Overhead • Relational Mapping adds overhead • Managing data • Handling order.

DOM Solution • W3C - Document Object Model is language independent API for accessing various parts of XML document. • Traditional top-down tree traversal. • Disadvantages • Very main memory intensive. • On an average 4-5 times the file size [11] . • Most of the DOM based engines do not support all XPath axes. • Even if they do Imagine Pixar rendering “Finding Nemo” in Windows machine. • Requires complex recursive traversal for even for a few XPath axes.

… • Galax[9](developed by Bell and AT\&T labs) • They do not support all XPath axes. • Performs very poorly against large XML documents • Non-cost driven logical level optimization • Jaxen [22] • Java API to support different XML API (JDom, DOM, ElectricXML,dom4j • Can handle document having file sizes  10Mb • Intel Celeron PC with 512MB of RAM • IPSI, Pathan, etc.

Current Index Solutions • Apache Xindice[13] • User-defined pattern indexes • Capable to index small to medium size documents < 5Mb • Natix[23] • The XML data tree is partitioned into small sub-trees and each sub-tree is stored into a data page. • TOX [25] (University of Toronto) • ToX storage engine stores the XML documents in either a relational database or an object oriented database.

Contd. • TIMBER[14](University of Michigan, University of British Columbia and AT&T labs) • TAX Algebra • Pattern trees • Query execution • Structural joins • Query Optimization • Estimating costs of all promising sets of evaluation plans. • Problem is the exponential increase in possibilities for complex query. • They claim to only select from an elite set of possible evaluation plans. • Cost Estimation • Primary Histograms • Expensive to maintain for frequent updates • Counting Twigs – Frequently occuring co-related sub-path queries.

Outline • Motivation • Related Work • Background for VAMANA Approach • Multi-Axis Storage Structure • Running Example • Our Physical Algebra • Query Execution Engine • Query Optimization • Experimental Evaluation • Conclusions

Resultant Tuples XPath Expression XPath Compiler Default Query Plan Optimized Query Plan --- --- Optimizer Query Execution Engine --- --- --- --- Transformation Library Default Query Plan Cost Estimator Axis or Value Based Queries MASS Storage Structure Loader XML Documents

Clustering Axes self, child, following-sibling, preceding-sibling, attribute, namespace CL1 and CL3 self, parent, ancestor, ancestor-or-self, descendent, descendent-or-self, preceding, following CL2 and CL4 Multi-Axis Storage Structure • Efficient storage and access structure for XML document. • XPath axes, nodetests, range position predicates. • Provides statistics for costing. • Number of tuples per page. • Count. • Fast Lexicographical keys • Four Clusters + Value-based index

Running Examples E.g. 1: descendant::name/parent::*/self::person/address <person id="person0"> <name>Krishna Merle</name> <emailaddress>mailto:Merle@mitre.org</emailaddress> … E.g. 2: //province[text() = “Vermont” ]/ancestor::person • <person id="person41"> • <name>Muneo Yemenis</name> • ... • <phone>+0 (807) 6372999</phone> • … • <province>Vermont</province> • …

Outline • Motivation • Related Work • Background for VAMANA Approach • Our Physical Algebra • VAMANA Approach • Operators • Context Node • Query Execution Engine • Query Optimization • Experimental Evaluation • Conclusions

VAMANA Approach • Data-Flow style of querying. • Data flows • Control flows • Pipelined-iterative fashion • Avoid temporary copies of intermediate results whenever possible. • Facilitates the reduction of I/O operations • All tuples for a particular context node are clustered together. • Sequential traversal over node sets . • Hence minimal I/O and key comparisons. • Used in most of commercial relational database system. • Structural joins Best Case : When maximum number of tuples in the join - pairs Worst Case : Few of joins - pairs

Operators where, opsymbol of the operator type. cond represents a set of conditions applied by the operator id is an identifier that uniquely identifies in given plan P Root Operator R Step Operator Φ Literal Operator L Node Base Exist Operator ξ Binary Operator β Join Operator J

Context Node We extend the idea.. • The context node of any given VAMANA operator opidcond defines uniquely the position of an XML node in the index structure. The position is obtained by the structural path information encoded in the context node. “..context node is defined as the current node being processed…” – XPath (1.0 & 2.0)

Concepts //a/b[/c] Context Side /b Predicate Child Predicate Side Context Child EXIST //a /c

Dynamic Context Set /b root //a MASS index a,a,a,a,a,b

Φ2child::phone Φ2ancestor::person a. Q1 b. Q2 R1 R1 Φ5 child::text L4 ‘Vermont’ Φ3 self::person Φ6 //::province β3EQ Φ5 descendant::name Φ4 parent::* XPath Compiler Q1: descendant::name/self::*/parent::person/address Q2: //province[text() = “Vermont” ]/ancestor::person

Outline • Motivation • Related Work • Background for VAMANA Approach • Our Physical Algebra • Query Execution Engine • Query Optimization • Experimental Evaluation • Conclusions

Execution States • An operator can be in • INITIAL • Has not yet started fetching tuples. • FETCHING • When the operator has not yet exhausted all nodes from MASS that meets its condition(s). • When the operator is waiting for its context-child to return tuples. • When the operator is waiting for the predicate condition to process the nodes. • OUT_OF_NODE • When the operator has exhausted all the nodes from MASS that satisfy the condition(s) specified by the node. • When the context-child has no further tuples to return operator.

INTIAL FETCHING OUT_OF_NODE Step1:Setting Context for the “leaf operators on context-path” R1 //province[text()=“Vermont”]/ancestor::person Φ2 ancestor::person Φ6 //::province β3EQ L4 ‘Vermont’ Φ5 child::text

Step2:Ask the root node for tuples. R1 Φ2 ancestor::person Φ6 //::province a.d.y.a

a.d.y.a.a “Massachussets” Φ6 //::province a.d.y.a β3EQ L4 ‘Vermont’ Φ5 child::text a.d.y.a a.d.y.a a.d.y.a.a

Φ6 //::province a.d.y.a β3EQ L4 ‘Vermont’ Φ5 child::text a.d.y.a a.d.y.a

a.d.y.b a.d.y.b.a “Vermont” a.d.y.b Φ6 //::province a.d.y.b β3EQ L4 ‘Vermont’ Φ5 child::text a.d.y.a a.d.y.b a.d.y.b.a

R1 Φ2 ancestor::person a.d.y.b a.d.y Φ6 //::province a.d.y.b

Outline • Motivation • Related Work • Background for VAMANA Approach • Our Physical Algebra • Query Execution Engine • Query Optimization • Query Clean-Up • Cost Model • Transformation • Experimental Evaluation • Conclusions

Optimization Query Plan (P “)+ Heuristics (L( P “) ) Query Plan (P I) Default Query Plan (P ) Clean Up Cost Estimator Transformation Optimal Query Plan (P opt ) Transformed Query Plan (P t )

Φ2child::phone Φ2child::phone Φ3 self::person Φ3 self::person Φ5 descendant::name Φ5 descendant::name Φ4 parent::* Clean Up a. Default Query Plan b. Cleaned Query Plan

VAMANA Cost Model • The cost is usually calculated with respect to the root of the XML document or a node specified by the user. • Query costs are obtained from the actual data rather than a data dictionary and thus are always up to date. • Our cost model, does not suffer the overhead of parsing the entire document. This is the case for does histogram-based costing like StaTiX[14]. • Starting from the leaf operators we propagate the cost upwards towards the root operator.

VAMANA Cost Model COUNT(opidcond) • This heuristics is only calculated for step operators (Φidaxis::nodetest). It represents the count of the number of XML nodes in the underlying index structure that satisfy the node test of the step operator axis::nodetest. • MASS provides an API to efficiently gather count of a particular node test in its storage structure. TC(opidcond) • For a literal operator(Lid value), text count is the number of occurrences of a particular literal value in the index structure.

Contd.. IN (opi) The maximum number of tuples that the operator opi will receive in total from its context child. • Case 1: • For a leaf step operator on the context path of the query plan, the total number of tuples received is equal to the number of tuples available in the underlying index structure, i.e. IN(opi ) = COUNT(opi ). • Case 2: • For all non-leaf operator(s), IN(opi ) = OUT(opj), where opj is the context child of opi. • Case 3: • For all leaf step operator(s) on the predicate path of query plan, the total number of tuples received is equal to the number of tuples received by its predicate operator.

Contd.. OUT(opi) The maximum number of tuples that the current operator opi returns • Case 1: • A leaf step operator on the context path of the query plan returns all the tuples that occur in the underlying index structure with respect to the context of the leaf operator, i.e., OUT(opi )= COUNT(opi ). • Case 2: • A literal operator(s) returns the same values every time a request for tuples is received. To facilitate the optimization of literal operators by using a value-index, we define output as OUT(opi ) = TC(opi ). • Case 3: • For binary predicate operators that have value-based equivalence, the OUT(opi) is calculated as follows Minimum(# tuples from parent operator, TC of the literal value)

Contd.. • Non-leaf operators • Context path • Predicate path • Leaf operators • predicate path

Contd.. “Cost determined with respect to context node provider”

Example 1: Φ2child::address Count : 1256 IN : 4825 OUT : 1256 Φ3parent::person Count : 2550 IN : 4825

Example 2: Φ3parent::person Count : 2550 IN : 4825 OUT : 4825 Φ6descendant::name Count : 4825 OUT : 4825

Heuristics Ratio = IN/OUT • Higher the ratio, better the selectivity. δ i = scale0..1 (IN/OUT) • Inverted index <scaled(IN/OUT), opi>.

Transformation Library • XPath equivalence rules [5] extend for VAMANA physical algebra. Rule 1 : /descendant::n/parent::m/.. //m[child::n]/.. Rule 2 : /descendant-or-self::n/child::m/..  //m[parent::n]/.. Rule 3 : p/following-sibling::n/parent::m  p[following-sibling::n]parent::m Rule 4 : /child::m/preceding-sibling::n  descendant::n[following-sibling::n] • Binary predicate • Value based equivalence. • Value-index optimization

COUNT= 4825 IN= 4825 OUT= 4825 Φ5 descendant::name Q1 R1 COUNT= 1256 IN = 4825 OUT= 1256 Φ2child::address COUNT= 2550 IN= 4825 OUTT = 4825 Φ3 parent::person

COUNT= 1256 IN = 2550 OUT= 2550 COUNT= 2550 IN= 2550 OUT= 2550 COUNT= 4825 IN= 4825 OUT= 4825 Φ5 descendant::name Φ5 child::name IN= 4825 OUT= 2550 COUNT= 4825 IN= 2550 OUT= 4825 /descendant::n/parent::m/.. //m[child::n]/.. R1 R1 Φ2child::address COUNT= 1256 IN = 4825 OUT= 2550 Φ2child::address Φ3 //::person COUNT= 2550 IN= 4825 OUT= 4825 Φ3 parent::person ξ6 a. Initial Query Plan b. Transformed Query Plan

COUNT= 4825 IN= 1256 OUT= 4825 Φ5 child::name R1 /descendant-or-self::n/child::m/..  //m[parent::n]/.. COUNT= 1256 IN = 1256 OUT= 1256 Φ2//::address IN= 1256 OUT= 1256 ξ7 COUNT= 2550 IN= 1256 OUT= 1256 Φ3 parent::person Optimal Query Plan IN= 4825 OUT= 1256 ξ6

Φ5 child::text Φ6 //::province L4 ‘Vermont’ β3EQ Running Example 2: R1 Φ2ancestor::person COUNT= 2550 IN =1256OUT= 1256 COUNT= 1256 IN = 1256 OUT= 1256 IN = 1256 OUT= 13 TC = 13 COUNT= 304819 IN = OUT =304819

Φ2ancestor::person Φ2ancestor::person Φ6 parent::province Φ5 child::text Φ5 value:: ‘Vermont’ L4 ‘Vermont’ Φ6 //::province β3EQ R1 R1 a. Default Query Plan b. Transformed Query Plan

Can we produce BAD queries? • No! • Our optimization aims to reduce the number of tuples the parent operator receives. • Hence we try to push the most selective operators downward. • An operator is considered for transformation only if: • It is selective. • There exist an equivalent transformation rule. i.e. its parent is not affected by the transformation. • The number of tuples filtered generated by the operator is reduced. • Now, whether the transformation process ends is another question. • To solve this infinite running we have to brute stop the optimization process after a specific number of iterations.

Outline • Motivation • Related Work • Background for VAMANA Approach • Our Physical Algebra • Query Execution Engine • Query Optimization • Experimental Evaluation • Criterions • Queries • Results • Conclusions

Experimental Evaluation • XMark[17] auction database. • Compared the CPU execution time of the test queries over different XPath engines • Galax [9] • Jaxen [22] • Others • IPSI [10] • Pathan [11] • Xindices [14] • Variable factors • Document size • Factor (100Kb, 1Mb, 5Mb, 10Mb, 20Mb, 30Mb, 40Mb, 50Mb) • Queries

Master Thesis Presentation 22 nd April 2004 Venkatesh Raghavan

Master Thesis Presentation 22 nd April 2004 Venkatesh Raghavan

Presentation Transcript

Master thesis

April 22 nd

AFEPA Master thesis Presentation on

The 22 nd of April Day

Thursday, April 22 nd

Master Thesis

April 22 nd , 2014

Master Thesis

Master Thesis

Master Thesis

Le Wang MASTER THESIS PRESENTATION

Presentation April 22 nd , 2014

MASTER THESIS ORAL EXAM PRESENTATION

April 22, 2004

Prague , 22 nd of September 2004

April 22, 2004

April 22 nd

Thesis Presentation April 13, 2004

Master thesis