slide1
Download
Skip this Video
Download Presentation
Master Thesis Presentation 22 nd April 2004 Venkatesh Raghavan

Loading in 2 Seconds...

play fullscreen
1 / 61

Master Thesis Presentation 22 nd April 2004 Venkatesh Raghavan - PowerPoint PPT Presentation


  • 91 Views
  • Uploaded on

VAMANA A High Performance, Scalable and Cost Driven XPath Engine. Master Thesis Presentation 22 nd April 2004 Venkatesh Raghavan Advisor: Prof. Elke Rundensteiner Reader : Prof. Micha Hofri. Outline. Motivation Related Work Background for VAMANA Approach Our Physical Algebra

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Master Thesis Presentation 22 nd April 2004 Venkatesh Raghavan ' - ksena


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

VAMANA

A High Performance, Scalable and Cost Driven XPath Engine

Master Thesis Presentation

22nd April 2004

Venkatesh Raghavan

Advisor: Prof. Elke Rundensteiner

Reader : Prof. Micha Hofri

outline
Outline
  • Motivation
  • Related Work
  • Background for VAMANA Approach
  • Our Physical Algebra
  • Query Execution Engine
  • Query Optimization
  • Experimental Evaluation
  • Conclusions
motivation
Motivation

Many applications are migrating to native XML database.

  • Need for an XML query engine
    • High Performance
      • Support queries to that emphasize the structural semantics of XML query languages.
      • Efficient querying engine and database management system tailored for XML data.
    • Scalable
      • To support large XML document.s
      • Support all 13 XPath axes.
    • Cost Based
      • Schema independent cost model provides dynamically

calculated heuristics.

      • Intelligent cost-based transformations, further improving

performance.

outline1
Outline
  • Motivation
  • Related Work
    • Relational Solutions
    • DOM Solutions
    • Current Index Based Solutions
  • Background for VAMANA Approach
  • Our Physical Algebra
  • Query Execution Engine
  • Query Optimization
  • Experimental Evaluation
  • Conclusions
relational solution
Relational Solution
  • Mature data management tools
    • Query processing
    • Crash recovery
    • Concurrency control
  • Shredding XML documents
  • XPeranto [6] and Rainbow [7]
  • Many mapping algorithms
    • From XML Schema to Relations: A Cost-based Approach

to XML Storage [24]

    • Workload based query mapping algorithm

.xml

flip side
Flip-side
  • Data Model mismatch
    • Tables Vs XML semi-structured data model
  • Semantic mismatch
    • SQL Vs Xquery
  • Data Fragmentation
    • Increase query execution cost (More Joins)
    • High update cost
  • Overhead
    • Relational Mapping adds overhead
      • Managing data
      • Handling order.
dom solution
DOM Solution
  • W3C - Document Object Model is language independent API for accessing various parts of XML document.
  • Traditional top-down tree traversal.
  • Disadvantages
    • Very main memory intensive.
    • On an average 4-5 times the file size [11] .
    • Most of the DOM based engines do not support all XPath axes.
    • Even if they do

Imagine Pixar rendering “Finding Nemo” in Windows machine.

    • Requires complex recursive traversal for even for a few XPath axes.
slide8
  • Galax[9](developed by Bell and AT\&T labs)
    • They do not support all XPath axes.
    • Performs very poorly against large XML documents
    • Non-cost driven logical level optimization
  • Jaxen [22]
    • Java API to support different XML API (JDom, DOM, ElectricXML,dom4j
    • Can handle document having file sizes  10Mb
      • Intel Celeron PC with 512MB of RAM
  • IPSI, Pathan, etc.
current index solutions
Current Index Solutions
  • Apache Xindice[13]
    • User-defined pattern indexes
    • Capable to index small to medium size documents < 5Mb
  • Natix[23]
    • The XML data tree is partitioned into small sub-trees and

each sub-tree is stored into a data page.

  • TOX [25] (University of Toronto)
    • ToX storage engine stores the XML documents in either a relational database or an object oriented database.
contd
Contd.
  • TIMBER[14](University of Michigan, University of British Columbia and AT&T labs)
    • TAX Algebra
      • Pattern trees
    • Query execution
      • Structural joins
    • Query Optimization
      • Estimating costs of all promising sets of evaluation plans.
      • Problem is the exponential increase in possibilities for complex query.
      • They claim to only select from an elite set of possible evaluation plans.
    • Cost Estimation
      • Primary Histograms
        • Expensive to maintain for frequent updates
      • Counting Twigs – Frequently occuring co-related

sub-path queries.

outline2
Outline
  • Motivation
  • Related Work
  • Background for VAMANA Approach
    • Multi-Axis Storage Structure
    • Running Example
  • Our Physical Algebra
  • Query Execution Engine
  • Query Optimization
  • Experimental Evaluation
  • Conclusions
slide12

Resultant Tuples

XPath Expression

XPath Compiler

Default Query Plan

Optimized Query Plan

---

---

Optimizer

Query Execution

Engine

---

---

---

---

Transformation Library

Default Query Plan

Cost Estimator

Axis or Value Based Queries

MASS Storage Structure

Loader

XML Documents

multi axis storage structure

Clustering

Axes

self, child, following-sibling, preceding-sibling, attribute, namespace

CL1 and CL3

self, parent, ancestor, ancestor-or-self, descendent, descendent-or-self, preceding, following

CL2 and CL4

Multi-Axis Storage Structure
  • Efficient storage and access structure for XML document.
      • XPath axes, nodetests, range position predicates.
  • Provides statistics for costing.
      • Number of tuples per page.
      • Count.
  • Fast Lexicographical keys
  • Four Clusters + Value-based index
running examples
Running Examples

E.g. 1: descendant::name/parent::*/self::person/address

<person id="person0">

<name>Krishna Merle</name>

<emailaddress>mailto:[email protected]</emailaddress>

E.g. 2: //province[text() = “Vermont” ]/ancestor::person

  • <person id="person41">
  • <name>Muneo Yemenis</name>
  • ...
  • <phone>+0 (807) 6372999</phone>
  • <province>Vermont</province>
outline3
Outline
  • Motivation
  • Related Work
  • Background for VAMANA Approach
  • Our Physical Algebra
    • VAMANA Approach
    • Operators
    • Context Node
  • Query Execution Engine
  • Query Optimization
  • Experimental Evaluation
  • Conclusions
vamana approach
VAMANA Approach
  • Data-Flow style of querying.
    • Data flows
    • Control flows
  • Pipelined-iterative fashion
    • Avoid temporary copies of intermediate results whenever possible.
    • Facilitates the reduction of I/O operations
      • All tuples for a particular context node are clustered together.
      • Sequential traversal over node sets .
      • Hence minimal I/O and key comparisons.
    • Used in most of commercial relational database system.
  • Structural joins

Best Case : When maximum number of tuples in the join - pairs

Worst Case : Few of joins - pairs

operators
Operators

where, opsymbol of the operator type.

cond represents a set of conditions applied by the operator

id is an identifier that uniquely identifies in given plan P

Root Operator

R

Step Operator

Φ

Literal Operator

L

Node Base

Exist Operator

ξ

Binary Operator

β

Join Operator

J

context node
Context Node

We extend the idea..

  • The context node of any given VAMANA operator opidcond defines uniquely the position of an XML node in the index structure. The position is obtained by the structural path information encoded in the context node.

“..context node is defined as the current node being processed…”

– XPath (1.0 & 2.0)

concepts
Concepts

//a/b[/c]

Context Side

/b

Predicate Child

Predicate Side

Context Child

EXIST

//a

/c

dynamic context set
Dynamic Context Set

/b

root

//a

MASS index

a,a,a,a,a,b

xpath compiler

Φ2child::phone

Φ2ancestor::person

a. Q1

b. Q2

R1

R1

Φ5 child::text

L4 ‘Vermont’

Φ3 self::person

Φ6 //::province

β3EQ

Φ5 descendant::name

Φ4 parent::*

XPath Compiler

Q1: descendant::name/self::*/parent::person/address

Q2: //province[text() = “Vermont” ]/ancestor::person

outline4
Outline
  • Motivation
  • Related Work
  • Background for VAMANA Approach
  • Our Physical Algebra
  • Query Execution Engine
  • Query Optimization
  • Experimental Evaluation
  • Conclusions
execution states
Execution States
  • An operator can be in
    • INITIAL
      • Has not yet started fetching tuples.
    • FETCHING
      • When the operator has not yet exhausted all nodes from MASS that meets its condition(s).
      • When the operator is waiting for its context-child to return tuples.
      • When the operator is waiting for the predicate condition to process the nodes.
    • OUT_OF_NODE
      • When the operator has exhausted all the nodes from MASS that satisfy the condition(s) specified by the node.
      • When the context-child has no further tuples to return operator.
step1 setting context for the leaf operators on context path

INTIAL

FETCHING

OUT_OF_NODE

Step1:Setting Context for the “leaf operators on context-path”

R1

//province[text()=“Vermont”]/ancestor::person

Φ2 ancestor::person

Φ6 //::province

β3EQ

L4 ‘Vermont’

Φ5 child::text

step2 ask the root node for tuples
Step2:Ask the root node for tuples.

R1

Φ2 ancestor::person

Φ6 //::province

a.d.y.a

slide26

a.d.y.a.a

“Massachussets”

Φ6 //::province

a.d.y.a

β3EQ

L4 ‘Vermont’

Φ5 child::text

a.d.y.a

a.d.y.a

a.d.y.a.a

slide27

Φ6 //::province

a.d.y.a

β3EQ

L4 ‘Vermont’

Φ5 child::text

a.d.y.a

a.d.y.a

slide28

a.d.y.b

a.d.y.b.a

“Vermont”

a.d.y.b

Φ6 //::province

a.d.y.b

β3EQ

L4 ‘Vermont’

Φ5 child::text

a.d.y.a

a.d.y.b

a.d.y.b.a

slide29

R1

Φ2 ancestor::person

a.d.y.b

a.d.y

Φ6 //::province

a.d.y.b

outline5
Outline
  • Motivation
  • Related Work
  • Background for VAMANA Approach
  • Our Physical Algebra
  • Query Execution Engine
  • Query Optimization
    • Query Clean-Up
    • Cost Model
    • Transformation
  • Experimental Evaluation
  • Conclusions
optimization
Optimization

Query Plan (P “)+ Heuristics (L( P “) )

Query Plan (P I)

Default

Query Plan (P )

Clean Up

Cost Estimator

Transformation

Optimal Query Plan (P opt )

Transformed Query Plan (P t )

clean up

Φ2child::phone

Φ2child::phone

Φ3 self::person

Φ3 self::person

Φ5 descendant::name

Φ5 descendant::name

Φ4 parent::*

Clean Up

a. Default Query Plan

b. Cleaned Query Plan

vamana cost model
VAMANA Cost Model
  • The cost is usually calculated with respect to the root of the XML document or a node specified by the user.
  • Query costs are obtained from the actual data rather than a data dictionary and thus are always up to date.
  • Our cost model, does not suffer the overhead of parsing the entire document. This is the case for does histogram-based costing like StaTiX[14].
  • Starting from the leaf operators we propagate the cost upwards towards the root operator.
vamana cost model1
VAMANA Cost Model

COUNT(opidcond)

  • This heuristics is only calculated for step operators (Φidaxis::nodetest). It represents the count of the number of XML nodes in the underlying index structure that satisfy the node test of the step operator axis::nodetest.
  • MASS provides an API to efficiently gather count of a particular node test in its storage structure.

TC(opidcond)

  • For a literal operator(Lid value), text count is the number of occurrences of a particular literal value in the index structure.
contd1
Contd..

IN (opi)

The maximum number of tuples that the operator opi will receive in total from its context child.

  • Case 1:
    • For a leaf step operator on the context path of the query plan, the total number of tuples received is equal to the number of tuples available in the underlying index structure, i.e. IN(opi ) = COUNT(opi ).
  • Case 2:
    • For all non-leaf operator(s), IN(opi ) = OUT(opj), where opj is the context child of opi.
  • Case 3:
    • For all leaf step operator(s) on the predicate path of

query plan, the total number of tuples received is equal

to the number of tuples received by its predicate

operator.

contd2
Contd..

OUT(opi)

The maximum number of tuples that the current operator opi returns

  • Case 1:
    • A leaf step operator on the context path of the query plan returns all the tuples that occur in the underlying index structure with respect to the context of the leaf operator, i.e., OUT(opi )= COUNT(opi ).
  • Case 2:
    • A literal operator(s) returns the same values every time a request for tuples is received. To facilitate the optimization of literal operators by using a value-index, we define output as OUT(opi ) = TC(opi ).
  • Case 3:
    • For binary predicate operators that have value-based

equivalence, the OUT(opi) is calculated as follows

Minimum(# tuples from parent operator, TC of the literal value)

contd3
Contd..
  • Non-leaf operators
    • Context path
    • Predicate path
  • Leaf operators
    • predicate path
contd4
Contd..

“Cost determined with respect to context node provider”

example 1
Example 1:

Φ2child::address

Count : 1256

IN : 4825

OUT : 1256

Φ3parent::person

Count : 2550

IN : 4825

example 2
Example 2:

Φ3parent::person

Count : 2550

IN : 4825

OUT : 4825

Φ6descendant::name

Count : 4825

OUT : 4825

heuristics
Heuristics

Ratio = IN/OUT

  • Higher the ratio, better the selectivity.

δ i = scale0..1 (IN/OUT)

  • Inverted index <scaled(IN/OUT), opi>.
transformation library
Transformation Library
  • XPath equivalence rules [5] extend for VAMANA physical algebra.

Rule 1 : /descendant::n/parent::m/.. //m[child::n]/..

Rule 2 : /descendant-or-self::n/child::m/..  //m[parent::n]/..

Rule 3 : p/following-sibling::n/parent::m  p[following-sibling::n]parent::m

Rule 4 : /child::m/preceding-sibling::n  descendant::n[following-sibling::n]

  • Binary predicate
    • Value based equivalence.
      • Value-index optimization
slide43

COUNT= 4825

IN= 4825 OUT= 4825

Φ5 descendant::name

Q1

R1

COUNT= 1256

IN = 4825 OUT= 1256

Φ2child::address

COUNT= 2550

IN= 4825 OUTT = 4825

Φ3 parent::person

slide44

COUNT= 1256

IN = 2550 OUT= 2550

COUNT= 2550

IN= 2550 OUT= 2550

COUNT= 4825

IN= 4825 OUT= 4825

Φ5 descendant::name

Φ5 child::name

IN= 4825 OUT= 2550

COUNT= 4825

IN= 2550 OUT= 4825

/descendant::n/parent::m/.. //m[child::n]/..

R1

R1

Φ2child::address

COUNT= 1256

IN = 4825 OUT= 2550

Φ2child::address

Φ3 //::person

COUNT= 2550

IN= 4825 OUT= 4825

Φ3 parent::person

ξ6

a. Initial Query Plan

b. Transformed Query Plan

slide45

COUNT= 4825

IN= 1256 OUT= 4825

Φ5 child::name

R1

/descendant-or-self::n/child::m/..  //m[parent::n]/..

COUNT= 1256

IN = 1256 OUT= 1256

Φ2//::address

IN= 1256 OUT= 1256

ξ7

COUNT= 2550

IN= 1256 OUT= 1256

Φ3 parent::person

Optimal Query Plan

IN= 4825 OUT= 1256

ξ6

running example 2

Φ5 child::text

Φ6 //::province

L4 ‘Vermont’

β3EQ

Running Example 2:

R1

Φ2ancestor::person

COUNT= 2550

IN =1256OUT= 1256

COUNT= 1256

IN = 1256 OUT= 1256

IN = 1256 OUT= 13

TC = 13

COUNT= 304819

IN = OUT =304819

slide47

Φ2ancestor::person

Φ2ancestor::person

Φ6 parent::province

Φ5 child::text

Φ5 value:: ‘Vermont’

L4 ‘Vermont’

Φ6 //::province

β3EQ

R1

R1

a. Default Query Plan

b. Transformed Query Plan

can we produce bad queries
Can we produce BAD queries?
  • No!
  • Our optimization aims to reduce the number of tuples the parent operator receives.
  • Hence we try to push the most selective operators downward.
  • An operator is considered for transformation only if:
    • It is selective.
    • There exist an equivalent transformation rule.

i.e. its parent is not affected by the transformation.

    • The number of tuples filtered generated by the operator is reduced.
  • Now, whether the transformation process ends is another question.
    • To solve this infinite running we have to brute stop

the optimization process after a specific number of iterations.

outline6
Outline
  • Motivation
  • Related Work
  • Background for VAMANA Approach
  • Our Physical Algebra
  • Query Execution Engine
  • Query Optimization
  • Experimental Evaluation
    • Criterions
    • Queries
    • Results
  • Conclusions
experimental evaluation
Experimental Evaluation
  • XMark[17] auction database.
  • Compared the CPU execution time of the test queries over different XPath engines
    • Galax [9]
    • Jaxen [22]
    • Others
      • IPSI [10]
      • Pathan [11]
      • Xindices [14]
  • Variable factors
    • Document size
      • Factor (100Kb, 1Mb, 5Mb, 10Mb, 20Mb, 30Mb, 40Mb, 50Mb)
    • Queries
xpath queries
XPath Queries
  • XMark [17] Queries not very useful in evaluation XPath queries.
  • Criterions for selecting the queries
    • Cover major forward and reverse XPath axes.
    • Experimentally prove the correctness of the execution strategy.
    • Empirically prove that optimization does not make the query more expensive
  • Query List
    • Q1 : //person/address
    • Q2 : //watches/watch/ancestor::person
    • Q3 : /descendant::name/parent::*/parent::person/address
    • Q4 : //itemref/following-sibling::price/parent::*
    • Q5 : //province[text()="Vermont"]/ancestor::person
  • DNF-2hrs : Did Not Finish – In 2 Hours
  • DNF-MO : Did Not Finish – Memory Overflow
  • NS : Not Supported
improved performance
Improved Performance

Q1: //person/address

Q2: //watches/watch/ancestor::person

correctness of query optimization
Correctness of Query Optimization

Q3: /descendant::name/parent::*/parent::person/address

slide54

Q4: //itemref/following-sibling::price/parent::*

Q5: //province[text()="Vermont"]/ancestor::person

outline7
Outline
  • Motivation
  • Related Work
  • Background for VAMANA Approach
  • Our Physical Algebra
  • Query Execution Engine
  • Query Optimization
  • Experimental Evaluation
  • Conclusions
conclusion
Conclusion
  • Efficient, cost driven XML query execution (XPath)
  • Execution Engine
    • VAMANA\'s index-only pipelined execution is novel to XML query evaluation as most of the existing engines use structural joins for query evaluation.
    • Physical algebra that supports such execution for any given valid XPath expression.
  • Cost Model
    • The VAMANA cost model is simple but very effective for identifying highly selective nodes. And the costing can be specific to a particular XML document or even to a specific point in the XML document
    • The dynamic cost estimation support makes costing up-to-date in environments which have frequent XML updates.
  • The model is well supported by a set of transformation rules used for operator optimization.
  • Our experiments shows the correctness, effectiveness and feasibility of our model.
future work
Future Work
  • VAMANA is the building block for the next phase of developing a XQuery Engine.
  • VAMANA currently implements only a subset of the XPath equivalence rules stated in [5].
  • The current cost model can be exploited for XQuery optimization.
acknowledgement
Acknowledgement
  • Advisor : Prof. Elke Rundensteiner
  • Reader : Prof. Micha Hofri
  • MASS-VAMANA team:
    • Kurt W. Deschler
      • Developing MASS
      • Helping me in design, development of VAMAN Architecture
      • Unix and systems related issues
    • Tammy L Worthington
      • XPath Compiler
  • Faculty : Inputs from Professor Dan Dougherty
  • Research Group : DSRG
  • Systems : Jesse Banning & Micheal Voorhis
references 1
References [1]
  • T. Bray, J. Paoli, C. Sperberg-McQueen, E. Maler and F.Yergeau. Extensible Markup Language (XML) 1.0 W3C Recommendation. Available at http://www.w3.org/TR/2004/REC-xml-20040204/, Feb 2004.
  • J. Clark and S. DeRose. XML Path Language (XPath) 1.0, W3C Recommendation. Available at http://www.w3.org/TR/xpath, 2002.
  • A. Berglund, S. Boag, D. Chamberlin. M. F. Fernandez, M. Kay, J. Robie and J. Sim \' e on, XML Path Language (XPath) 2.0, W3C Working Draft. Available at http://www.w3.org/TR/xpath20/, Nov 2003.
  • K. Deschler and E. Rundensteiner. MASS- Multi Axis Storage Structure , CIKM, New Orleans, USA, Nov 2003
  • D. Olteanu, H. Meuss, T. Furche, and F. Bry. Symmetry in XPath. Proceedings of Seminar on Rule Markup Techniques, no. 02061, Schloss Dagstuhl, Germany, Feb 2002.
  • M. Carey, J. Kiernan, J. Shanmugasundaram, E. Shekita and S. N. Subramanian. XPERANTO: Middleware for Publishing Object-Relational Data as XML Documents , The VLDB Journal, pages 646-648, 2000.
  • X. Zhang, M. Mulchandani, S. Christ, B. Murphy and E. Rundensteiner. Rainbow: Mapping-Driven XQuery Processing System , SIGMOD, page 614, 2002. Kweelt. From http://kweelt.sourceforge.net.
  • Galax system. Available at http://db.bell-labs.com/galax/.
  • IPSI-XQ - XQuery demonstrator. Available at http://www.ipsi.fraunhofer.de
  • Pathan, Available at http://software.decisionsoft.com
  • A. Marian and J. Sim \' e on. Projecting XML Documents , VLDB\'2003. Berlin, Germany, September 2003
  • W. Meier. eXist: An Open Source Native XML Database. Web and Database-Related Workshops, Erfurt, Germany, October 2002.
  • Apache Xindice. Available at: http://xml.apache.org/xindice/
  • H. Jagadish, S. Al-Khalifa, A. Chapman, L. Lakshmanan, A. Nierman, S. Paparizos, J. Patel, D. Srivastava, N. Wiwatwattanan, Y. Wu and C. Yu. TIMBER: A native XML database , The VLDB Journal, 11(4): 274-291, 2002.
reference 2
Reference [2]
  • J. Freire, J. Haritsa, M. Ramanath, P. Roy, and J. Simeon. Statix: Making XML count. In Proc. of SIGMOD, pages 181-191, 2002.
  • XMark - The XML Benchmark project. www.xml-benchmark.org
  • R. Goldman and J. Widom. DataGuides: Enabling Query Formulation and Optimization in Semi-structured Databases , the 23rd Int. Conf. on Very Large Databases (VLDB), Athens, Greece, pages 436-445, 1997.
  • T. Milo and D. Suciu. Index structure for path expression , In Proceedings of 7th International Conference on Database Theory, pages 277-295, 1999.
  • Q. Li and B. Moon. Indexing and Querying XML Data for Regular Path Expressions , Proceedings of 27th International Conference on Very Large Database (VLDB\'2001), Rome, Italy, pages 361-370, September 2001.
  • S. Boag, D. Chamberlin, M. Fern \' a ndez, D. Florescu, J. Robie, J. Sim \' e on, An XML Query Language (XQuery) 1.0, W3C Working Draft. Available at http://www.w3.org/TR/xquery/, Nov 2003.
  • B. McWhirter and J. Strachnan. Jaxen: Universal XPath Engine . Available at http://jaxen.org/.
  • C.C. Kanne and G. Moerkotte, Efficient storage of XML data . In Proc. ICDE Conf., poster abstract, page 198, San Diego, USA, 2000.
  • P. Bohannon and J. Freire and P. Roy and J. Simeon, From (XML) Schema to Relations: A Cost-based Approach to XML Storage , ICDE, 2002
  • T. Worthington and E. Rundenstiner, XPath Processor , WPI MQP Report, April 2002.
  • Flavio Rizzolo, Alberto Mendelzon. Indexing XML Data with ToXin , WebDB, pages 49-54, Santa Barbara, USA, 2001.
  • Z. Chen, H. Jagadish, F. Korn, N. Koudas, S. Muthukrishnan, R. Ng and D. Srivastava, Counting Twig Matches in a Tree , ICDE, pages 595-604, 2001.
  • Document Object Model (DOM), Available at http://www.w3.org/DOM/.
ad