Loading in 5 sec....

BLAS: An Efficient XPath Processing SystemPowerPoint Presentation

BLAS: An Efficient XPath Processing System

- By
**wyatt** - Follow User

- 124 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' BLAS: An Efficient XPath Processing System' - wyatt

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### BLAS: An Efficient XPath Processing System

Zhimin Song

Advanced Database System

Professor: Dr. Mengchi Liu

Outline

- Introduction
- BLAS System
- Experimental Results
- Conclusions

- <ProteinDatabase>
- <ProteinEntry>
- <Protein>
- <Name> cytochrome c [validated]</name>
- <classification>
- <superfamily>cytochrome c</superfamily>
- </classification>…
- </protein>
- <reference>
- <refinfo>
- <authors>
- <author>Evans, M.J.</author>…
- </authors>
- <year>2001</year>
- <title> The human somatic cytochrome c gene </title> …
- </refinfo>…
- </reference>…
- </ProteinEntry> …
- </ProteinDatabase>
- Figure 1 : Sample XML protein repository

Introduction

- XML has complex, tree-like structure(nodes).
- Languages for Querying XML are based on path navigation(XPath [1]).
Given node Child node(Child axis)

Given node Descendant node(Descendant axis)

Introduction(cont..)

- Some techniques were already proposed in order to improve XPath Processing. For example, D-labeling which is used to efficiently handle descendant axis traversal.
- What about complex queries including child axis, branch???
- In this case P-labeling is proposed in this paper. It optimizes an important class of queries called suffix path queries.

BLAS(Bi-LAbeling based System)

- Basic definitions
- The labeling scheme(Index generator)
- Query translator

- Basic definitions:
- BLAS: a system for efficiently process complex queries based D-labeling and P-labeling.
- The BLAS deals with a subset of XPath queires consisting of:
- Child axis navigation ( / )
- Descendant axis navigation ( // )
- Branches ( […..] )

- The evaluation of a path expression P( [P] ) returns the set of nodes in an XML tree T which are reachable by P starting from the root of T.
- Since P can be evaluated to retrieve a set of XML nodes, we use “Path expression” and “query” interchangeably.
- P Q if and only if [P] [Q].
- P Q = if and only if [P] [Q] =

- Basic definitions(cont..):
- Suffix path expression: a path expression P which optionally begins with a descendant axis step(//), followed by zero or more child axis steps (/).
- Example: //protein/name
- Another one : /proteinDatabase/proteinEntry/protein/name

- SP(n) : the unique simple path P from the root to the node n.
- So evaluating a suffix path expression Q is to find all the nodes n such that SP(n) Q.

- Suffix path expression: a path expression P which optionally begins with a descendant axis step(//), followed by zero or more child axis steps (/).

Suffix Path Query

Subquery

Generator

(based on

P-labeling)

Query

…

…

XPath

Query

Query

decomposition

Subquery

composition

(based on

D-labeling)

Subquery

Suffix Path Query

Ancestor-descendant relationship between the results of the suffix path queries

Query Translator

Query

Engine

P-labeling

generator

P-labelings

SAX

Parser

XML

Events

Storage

Data values

Query result

Data loader

D-labeling

generator

D-labelings

Architecture of BLAS- The labeling scheme(Index generator)
- D-labeling scheme: triplet <d1,d2,d3> for a XML node n(n.d1 <= n.d2) and m(m.d1<=m.d2).
- m is a descendant of n if and only if n.d1<m.d1 and n.d2>m.d2.
- m is a child of n if and only if m is a descendant of n and n.d3+1=m.d3.
- Let d1 and d2 for a node n be the position of the start tag and end tag.
- d3 is set to be the level of n in the XML tree which is the length of the path from the root to n.
D-label will be represented as <start,end,level>

- D-labeling scheme: triplet <d1,d2,d3> for a XML node n(n.d1 <= n.d2) and m(m.d1<=m.d2).

Query: //proteinDatabase//refinfo

First retrieve all the nodes reachable by refinfo and by proteinDatabase

Let pDB and refinfo be two relations which store these nodes, then D-join them

- Example: using D-labeling

proteinDatabase

proteinEntry

protein

reference

superfamily

//

refinfo

“cytochrome c”

//

author

Title

year

Select pDB.start,pDB.end,refinfo.start,refinfo.end

From pDB, refinfo

Where pDB.start < refinfo.start and pDB.end > refinfo.end

“Evans, M.J.”

“2001”

- P-labeling Scheme
- It is also important to implement child axis navigation efficiently.
- e.g. /proteinDatabase/proteinEntry/protein/name
- Target: improve “/” evaluation
- Focus on suffix path queries:
e.g. //protein/name

- Assign each node a number<p1>, and each suffix path an interval <p1,p2> such that:
- For any two suffix paths Q1 and Q2, Q1 is contained in Q2 if
Q1.p1<= Q2.p1 and Q1.p2>= Q2.p2

- A node n is contained in the suffix path Q if
Q.p1<= SP(n).p1 <=Q.p2.

- Let Q be a suffix path query. Then
[Q] = {n | Q.p1 <= n.plabel<=Q.p2} when n.plabel=SP(n).p1

- P-labeling Construction(algorithm)
- Suppose that there are n distinct tags (t1,t2,….,tn).
- Assign “/” a ratio r0 and each tag ti a ratio ri such that
r0+r1+r2+…….+ri = 1.

- Let ri = 1/(n+1).
- Define the domain of the numbers in a P-label to be integers in [0, m-1], here m is chosen such that
m>= , where h is the longest path in an XML tree.

- Algorithms as follows:
- Path // is assigned an interval(P-label) of <o, m-1>.
- Partition the interval <0, m-1> in tag order proportional to ti’s ratio ri, for each path //ti and child axis navigation’s ratio r0.
- This means we allocate the interval<0, m*r0 -1> to “/” and <pi, pi+1> to each ti such that (pi+1 - pi)/m=ri and p1/m = r0

...

4.0301*1010

4.03*1010

4.04*1010

//proteinDatabase/name

//proteinEntry/name

//protein/name

/name

...

4.04*1010

5*1010

4*1010

4.01*1010

4.02*1010

4.03*1010

//protein

Database

//protein

Entry

//protein

//name

/

...

1012

0

1010

2*1010

3*1010

4*1010

5*1010

Query: //protein/name

M=1012

99 tags

Ri=0.01

- P-labeling Construction(Example)

- Query translator:translates an input XPath query into standard SQL.
- Query decomposition
- Splits the query in to a set of suffix path queries and records the ancestor-descendant relationship.

- SQL generation
- Computes the query’s p-labeling and generates a corresponding subquery in SQL.

- SQL composition
- The subqueries are combined into a single SQL query based on D-labeling and the ancestor-descendant relationship.

- Query decomposition

P//q p and //q

Q1

- Split algorithm:
- D-elimination(query tree Q)

proteinDatabase

proteinEntry

Depth-first traversal

protein

reference

Split p//q into p and //q

Q2

Invokes the B-elimination if branches in Q. Otherwise, it evaluates Q using P-labels.

//

refinfo

superfamily

year

“cytochrome c”

Title

“2001”

Join intermediate results by their D-labels

//

Q3

author

“Evans, M.J.”

Q4

proteinDatabase

proteinDatabase

proteinEntry

proteinEntry

Q6

Q5

//

//

protein

reference

reference

protein

refinfo

refinfo

year

Title

year

Title

“2001”

“2001”

P[q1,q2….qi]/r p, //q1, //q2,…..,//qi, //r

- B-elimination(query tree Q1)

B-elimination(cont..):

Q4

proteinDatabase

proteinEntry

Q7

//

Q5

//

reference

refinfo

Q8

Q9

//

//

year

Title

“2001”

Since p/qi and p/r are more specific than //qi and //r,

Then split P[q1,q2,….,qi]/r p, p/q1, p/q2, …..p/qi, p/r

- Push up algorithm: optimize the branch elimination (B-elimination).

proteinDatabase

Q4

proteinDatabase

proteinEntry

proteinEntry

proteinDatabase

reference

proteinEntry

refinfo

reference

Q5

proteinDatabase

refinfo

proteinDatabase

proteinEntry

year

reference

proteinEntry

“2001”

refinfo

protein

title

- Unfold algorithm:A further optimization of descendant-axis elimination(D-elimination).
There is example as follows:

Q2=/ProteinDatabase/ProteinEntry/protein//superfamily=“cytochrome c”

Q21 = /ProteinDatabase/ProteinEntry/protein/classification/

superfamily=“cytochrome c” ,

P//q p/r1/q, p/r2/q, ….., p/ri/q

Experimental Results elimination(D-elimination). Query Engine: RDBMS or File System

- Data sets
- Query sets
- Suffix path queries
- Path queries
- XPath queries

Query Execution Time elimination(D-elimination).

1: suffix path query

2: path query

3: XPath query

A:Auction

P: Protein

S: Shakespeare

Query time for Shakespeare, Protein and Auction data sets

Scalability elimination(D-elimination).

The performance of D-labeling, Split and Push up for the suffix path query

Conclusion elimination(D-elimination).

- P-labeling scheme is proposed to evaluate suffix path queries efficiently.
- BLAS combines P-labeling and D-labeling to evaluate XPath queries.
- BLAS is more efficient because the queries translated from XPath queries require:
- fewer disk accesses
- fewer joins

- Experiments show the effectiveness of BLAS

- [1] elimination(D-elimination).J. Clark and S. DeRose. XML Path language (XPath), November1999. http://www.w3.org/TR/xpath.
- [13] D. DeHaan, D. Toman, M. Consens, and M. T. Ozsu. A
comprehensive XQuery to SQL translation using dynamic intervalencoding. In Proceedings of SIGMOD, 2001.

- [26] J.-K. Min, M.-J. Park, and C.-W. Chung. XPRESS: A queriablecompression for XML data. In Proceedings of SIGMOD, 2003.

Thank you! elimination(D-elimination).

Question ?

Download Presentation

Connecting to Server..