Indexing and querying xml data for regular path expressions
This presentation is the property of its rightful owner.
Sponsored Links
1 / 40

Indexing and Querying XML Data for Regular Path Expressions PowerPoint PPT Presentation


  • 45 Views
  • Uploaded on
  • Presentation posted in: General

Indexing and Querying XML Data for Regular Path Expressions. A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot. Our Objective. Developing a system that will enable us to perform XML data queries efficiently. XML Queries Languages. Used for retrieving data from XML files.

Download Presentation

Indexing and Querying XML Data for Regular Path Expressions

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Indexing and querying xml data for regular path expressions

Indexing and Querying XML Data for Regular Path Expressions

A Paper by Quanzhong Li and Bongki Moon

Presented by Amnon Shochot


Indexing and querying xml data for regular path expressions

Our Objective

  • Developing a system that will enable us to perform XML data queries efficiently.


Indexing and querying xml data for regular path expressions

XML Queries Languages

  • Used for retrieving data from XML files.

  • Use a regular path expression syntax.

  • e.g. XPath, XQuery.


Indexing and querying xml data for regular path expressions

Queries Today - Inefficient

  • Usually XML tree traversals – Inefficient.

    • Top-Down Approach

    • Bottom-Up Approach

    • An example:

      the query:

      /chapter/_*/figure

      (finding all figures in all chapters.)


Indexing and querying xml data for regular path expressions

Our Objective - Refined

  • Developing a system that will enable us to perform XML data queries efficiently

  • Developing such a system consists of:

    • Developing a way to efficiently store XML data.

    • Developing efficient algorithms for processing regular path expressions (e.g. XQuery expressions).


Indexing and querying xml data for regular path expressions

Storing XML Documents

  • Question: What would we need from a data structure to be able to perform an efficient query?

  • Answer: A mechanism for:

    • Efficiently finding all elements/attributes with a given name.

    • Efficiently finding all values with a given name.

    • Efficiently resolving ancestor-descendant relationship.


Indexing and querying xml data for regular path expressions

Storing XML Documents - XISS

  • XISS - XML Indexing and Storage System.

  • Provides us with ways to:

    • efficiently find all elements or attributes with the same name string grouped by document which they belong to.

    • quickly determine the ancestor-descendant relationship between elements and/or attributes in the hierarchy of XML data hierarchy.


Indexing and querying xml data for regular path expressions

Determining Ancestor-Descendent Relationship

  • According to Dietz’s: for two given nodes x and y of a tree T, x is an ancestor of y iff x occurs before y in the preorder traversal and after y in the postorder traversal.

  • Example:


Indexing and querying xml data for regular path expressions

Determining Ancestor-Descendent Relationship – cont.

  • Advantage: the ancestor-descendent relationship can be determined in constant time.

  • Disadvantage: a lack of flexibility.

    • e.g. inserting a new node requires recomputation of many tree nodes.


Indexing and querying xml data for regular path expressions

exclusive

Determining Ancestor-Descendent Relationship – cont.

  • A new numbering scheme:

    • Each node is associated with a <order, size> pair:

      • For a tree node y and its parent x:

        [order(y), order(y) + size(y)] Ì (order(x), order(x) + size(x)]

      • For two sibling nodes x and y, if x is the predecessor of y in preorder traversal holds:

        order(x) + size(x) < order(y).


Indexing and querying xml data for regular path expressions

Determining Ancestor-Descendent Relationship – cont.

  • Fact: for two given nodes x and y of a tree T, x is an ancestor of y iff:

    order(x) < order(y) £ order(x) + size(x)


Indexing and querying xml data for regular path expressions

Determining Ancestor-Descendent Relationship – cont.

  • Properties:

    • the ancestor-descendent relationship can be determined in constant time.

    • flexibility – node insertion usually doesn’t require recomputation of tree nodes.

    • an element can be uniquely identified in a document by its order value.


Indexing and querying xml data for regular path expressions

XISS System Overview


Indexing and querying xml data for regular path expressions

XISS System Overview

  • How the system works:

    • XML documents are loaded into the XISS system.

    • These documents are added to the XISS data structures.

      • Each document is assigned a document id (did).

      • Index structures are organized as paged files for efficient disk IO.

    • When a query is performed the query processor interacts with XISS in order to obtain the information required for the query.


Indexing and querying xml data for regular path expressions

XISS - cont.

  • XISS consists of 5 components:

    • Name Index

    • Value Table

    • Element Index

    • Attribute Index

    • Structure Index


Indexing and querying xml data for regular path expressions

Name Index and Value Table

  • Objective: minimizing the storage and computation overhead by eliminating replicated strings and string comparisons.

  • Name Index -mapping distinct name strings into unique name identifiers (nid).

  • Value Table - mapping distinct value strings (i.e. attribute value and text value) into unique value identifiers (vid).

  • Both implemented as a B+-tree.


Indexing and querying xml data for regular path expressions

The Element Index

  • Objective: quickly finding all elements with the same name string.

  • Structure:


Indexing and querying xml data for regular path expressions

The Element Index – cont.

  • Structure:

    • B+-tree using nid as a key.

    • Leaf nodes: pointers to a set of records for elements (or attributes) having an identical name string, grouped by the document they belong to.

    • Element Record = {<order,size>, Depth, Parent ID}

      • where Depth is the depth of the element in the XML tree.

    • Element Records are ordered by <order,size>.


Indexing and querying xml data for regular path expressions

The Attribute Index

  • Objective: quickly finding all elements with the same name string.

  • Structure:

    • Same structure as the Element Index except that the record in attribute index has a value identifier vid which is a key used to obtain the attribute from the value table.


Indexing and querying xml data for regular path expressions

The Structure Index

  • Objectives:

    • Finding the parent element and child elements (or attributes) for a given element.

    • Finding the parent element for a given attribute.

  • Structure:


Indexing and querying xml data for regular path expressions

The Structure Index – cont.

  • Structure:

    • B+-tree using document identifier (did) as a key.

    • Leaf nodes: linear arrays with records for all elements and attributes from an XML document.

    • Each record: {nid, <order,size>, Parent order, Child order, Sibling order, Attribute order}.

    • Records are ordered by order value.


Indexing and querying xml data for regular path expressions

Querying Method

  • Decomposing path expressions into simple path expressions.

  • Applying algorithms on simple path expressions and their intermediate results.


Indexing and querying xml data for regular path expressions

Decomposition of Path Expressions

  • The main idea:

    • A complex path expression is decomposed into several simple path expressions.

    • Each simple path expression produces an intermediate result that can be used in the subsequent stage of processing.

    • The results of the simple path expressions are than combined or joined together to obtain the final result of the given query.


Indexing and querying xml data for regular path expressions

(1)

(1)

(1)

(1)

(1)

(1)

(1)

/

[ ]

/_*/

(3)

(2)

(3)

(1) Single Element/Attribute

(2) Element-Attribute

(3) Element-Element

(4) Kleene Closure

(5) Union

*

|

(4)

(5)

/

(3)

/

(3)

Basic Subexpressions - Example

Decomposition of

(E1/E2)*/ E3 / ((E4[@a=V]) | (E5/_*/E6)):


Indexing and querying xml data for regular path expressions

Basic Subexpressions

5 basic subexpressions:

(1) A subexpression with a single element or a single attribute.

(2)A subexpression with an element and an attribute.

  • e.g. figure[@caption = “Tree Frogs”]

    (3)A subexpression with two elements

  • e.g. chapter/_*/figure where ‘_’ denotes any kind of node.


Indexing and querying xml data for regular path expressions

Basic Subexpressions - cont.

5 basic subexpressions - cont.:

(4)A subexpression that is a Kleene closure (+,*) of another subexpression.

(5)A subexpression that is a union of two other subexpressions.


Indexing and querying xml data for regular path expressions

3 Algorithms

  • 3 Algorithms:

    • EA-Join: Element and Attribute Join.

    • EE-Join: Element and Element Join

    • Kleene Closure


Indexing and querying xml data for regular path expressions

EA-Join: Element and Attribute Join

Input:

{E1,…,Em}: Ei is a set of elements having a common document identifier (did);

{A1,…,An}: Aj is a set of elements having a common document identifier (did);

Output:

A set of (e,a) pairs such that the element e is the parent of the attribute a.


Indexing and querying xml data for regular path expressions

EA-Join: Element and Attribute Join

The Algorithm:

// Sort-merge {Ei} and {Aj} by did.

(1)foreachEi and Aj with the same diddo:

// Sort-merge Ei and Aj by

// PARENT-CHILD relationship

(2)foreache ÎEi and aÎAjdo

(3)if (e is a parent of a) then output (e,a)

end

end


Indexing and querying xml data for regular path expressions

Ele <1,3>

Ele <3,1>

Att <2,0>

Att <4,0>

EA-Join – Example

  • Consider the XML document:

    <Ele Att=“A1”>

    <Ele Att=“A2”> </Ele>

    </Ele>

  • And the query: /Ele[@Att=“A1”]


Indexing and querying xml data for regular path expressions

Ele <1,3>

Ele <3,1>

Att <2,0>

Att <4,0>

EA-Join – Querying /Ele[@Att=“A1”]

<Ele Att=“A1”>

<Ele Att=“A2”> </Ele>

</Ele>

  • Sort-merging “Ele”s and “Att”s by parent-child relation ship will give us the list:

    <1,3>, <2,0>, <3,1>, <4,0>

  • Finding the elements “Ele”s with a child attribute “Att” with a value “A1” from the accepted list is easy using the information in the Element Record.


Indexing and querying xml data for regular path expressions

EA-Join – Comments

  • Only a two-stage sort-merge operation without additional cost of sorting:

    • First merge: by did.

    • Second merge: by examining parent-child relationship.

      • This merge is based on the order values of the element and attribute as defined by the numbering scheme.

  • Attributes should be placed before their sibling elements in the order of the numbering scheme.

    • guarantees that elements and attributes with the same did can be merged in a single scan.


Indexing and querying xml data for regular path expressions

EE-Join: Element and Element Join

Input:

{E1,…,Em} and {F1,…,Fm}: Ei or Fj is a set of elements having a common document identifier (did).

Output:

A set of (e,f) pairs such that element e is an ancestor of element f.


Indexing and querying xml data for regular path expressions

EE-Join: Element and Element Join

The Algorithm:

// Sort-merge {Ei} and {Fj} by did.

(1)foreachEi and Fj with the same diddo:

// Sort-merge Ei and Fj bythe

// ANCESTOR-DESCENDANT relationship.

(2)foreache Î Ei and fÎFjdo

(3)if (e is an ancestor of f) then output (e,f);

end

end


Indexing and querying xml data for regular path expressions

EE-Join – Comments

  • Only two-stage sort-merge operation without the additional cost of sorting:

    • First merge: by did.

    • Second merge: by examining parent-child relationship.

  • The sets of elements with a matching did cannot be merged in a single scan.


Indexing and querying xml data for regular path expressions

Kleene Closure

Input:

{E1,…,Em}, where Ei is a group of elements from an XML document.

Output:

A Kleene closure of {E1,…,Em}.


Indexing and querying xml data for regular path expressions

Kleene Closure

The Algorithm:

  • Set i¬ 1;

  • Set KiC¬{E1,…,Em};

    (3)repeat

    (4)set i¬i + 1;

    (5) set KiC¬EE-Join(Ki-1C,K1C);

    until (KiC is empty);

    (6) output the union of K1C,K2C,…, KiC;


Indexing and querying xml data for regular path expressions

Performance Experiments

  • EE-Join:

  • Results:

    • Real World: an order of magnitude faster.

    • Synthetic Data: 6 to 10 times faster.


Indexing and querying xml data for regular path expressions

Performance Experiments

  • EA-Join:

  • Results:

    • Compared to Top-Down: a better performance.

    • Compared to Bottom-Up: no winner - close results.


Indexing and querying xml data for regular path expressions

Performance Results - Conclusions

  • The proposed algorithms can achieve performance improvement over the conventional methods (top-down and bottom-up tree traversals) by up to an order of magnitude.


  • Login