Indexing and querying xml data for regular path expressions
Download
1 / 40

Indexing and Querying XML Data for Regular Path Expressions - PowerPoint PPT Presentation


  • 59 Views
  • Uploaded on

Indexing and Querying XML Data for Regular Path Expressions. A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot. Our Objective. Developing a system that will enable us to perform XML data queries efficiently. XML Queries Languages. Used for retrieving data from XML files.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Indexing and Querying XML Data for Regular Path Expressions' - kyle-jensen


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Indexing and querying xml data for regular path expressions

Indexing and Querying XML Data for Regular Path Expressions

A Paper by Quanzhong Li and Bongki Moon

Presented by Amnon Shochot


Our Objective

  • Developing a system that will enable us to perform XML data queries efficiently.


XML Queries Languages

  • Used for retrieving data from XML files.

  • Use a regular path expression syntax.

  • e.g. XPath, XQuery.


Queries Today - Inefficient

  • Usually XML tree traversals – Inefficient.

    • Top-Down Approach

    • Bottom-Up Approach

    • An example:

      the query:

      /chapter/_*/figure

      (finding all figures in all chapters.)


Our Objective - Refined

  • Developing a system that will enable us to perform XML data queries efficiently

  • Developing such a system consists of:

    • Developing a way to efficiently store XML data.

    • Developing efficient algorithms for processing regular path expressions (e.g. XQuery expressions).


Storing XML Documents

  • Question: What would we need from a data structure to be able to perform an efficient query?

  • Answer: A mechanism for:

    • Efficiently finding all elements/attributes with a given name.

    • Efficiently finding all values with a given name.

    • Efficiently resolving ancestor-descendant relationship.


Storing XML Documents - XISS

  • XISS - XML Indexing and Storage System.

  • Provides us with ways to:

    • efficiently find all elements or attributes with the same name string grouped by document which they belong to.

    • quickly determine the ancestor-descendant relationship between elements and/or attributes in the hierarchy of XML data hierarchy.


Determining Ancestor-Descendent Relationship

  • According to Dietz’s: for two given nodes x and y of a tree T, x is an ancestor of y iff x occurs before y in the preorder traversal and after y in the postorder traversal.

  • Example:


Determining Ancestor-Descendent Relationship – cont.

  • Advantage: the ancestor-descendent relationship can be determined in constant time.

  • Disadvantage: a lack of flexibility.

    • e.g. inserting a new node requires recomputation of many tree nodes.


exclusive

Determining Ancestor-Descendent Relationship – cont.

  • A new numbering scheme:

    • Each node is associated with a <order, size> pair:

      • For a tree node y and its parent x:

        [order(y), order(y) + size(y)] Ì (order(x), order(x) + size(x)]

      • For two sibling nodes x and y, if x is the predecessor of y in preorder traversal holds:

        order(x) + size(x) < order(y).


Determining Ancestor-Descendent Relationship – cont.

  • Fact: for two given nodes x and y of a tree T, x is an ancestor of y iff:

    order(x) < order(y) £ order(x) + size(x)


Determining Ancestor-Descendent Relationship – cont.

  • Properties:

    • the ancestor-descendent relationship can be determined in constant time.

    • flexibility – node insertion usually doesn’t require recomputation of tree nodes.

    • an element can be uniquely identified in a document by its order value.



XISS System Overview

  • How the system works:

    • XML documents are loaded into the XISS system.

    • These documents are added to the XISS data structures.

      • Each document is assigned a document id (did).

      • Index structures are organized as paged files for efficient disk IO.

    • When a query is performed the query processor interacts with XISS in order to obtain the information required for the query.


XISS - cont.

  • XISS consists of 5 components:

    • Name Index

    • Value Table

    • Element Index

    • Attribute Index

    • Structure Index


Name Index and Value Table

  • Objective: minimizing the storage and computation overhead by eliminating replicated strings and string comparisons.

  • Name Index -mapping distinct name strings into unique name identifiers (nid).

  • Value Table - mapping distinct value strings (i.e. attribute value and text value) into unique value identifiers (vid).

  • Both implemented as a B+-tree.


The Element Index

  • Objective: quickly finding all elements with the same name string.

  • Structure:


The Element Index – cont.

  • Structure:

    • B+-tree using nid as a key.

    • Leaf nodes: pointers to a set of records for elements (or attributes) having an identical name string, grouped by the document they belong to.

    • Element Record = {<order,size>, Depth, Parent ID}

      • where Depth is the depth of the element in the XML tree.

    • Element Records are ordered by <order,size>.


The Attribute Index

  • Objective: quickly finding all elements with the same name string.

  • Structure:

    • Same structure as the Element Index except that the record in attribute index has a value identifier vid which is a key used to obtain the attribute from the value table.


The Structure Index

  • Objectives:

    • Finding the parent element and child elements (or attributes) for a given element.

    • Finding the parent element for a given attribute.

  • Structure:


The Structure Index – cont.

  • Structure:

    • B+-tree using document identifier (did) as a key.

    • Leaf nodes: linear arrays with records for all elements and attributes from an XML document.

    • Each record: {nid, <order,size>, Parent order, Child order, Sibling order, Attribute order}.

    • Records are ordered by order value.


Querying Method

  • Decomposing path expressions into simple path expressions.

  • Applying algorithms on simple path expressions and their intermediate results.


Decomposition of Path Expressions

  • The main idea:

    • A complex path expression is decomposed into several simple path expressions.

    • Each simple path expression produces an intermediate result that can be used in the subsequent stage of processing.

    • The results of the simple path expressions are than combined or joined together to obtain the final result of the given query.


(1)

(1)

(1)

(1)

(1)

(1)

(1)

/

[ ]

/_*/

(3)

(2)

(3)

(1) Single Element/Attribute

(2) Element-Attribute

(3) Element-Element

(4) Kleene Closure

(5) Union

*

|

(4)

(5)

/

(3)

/

(3)

Basic Subexpressions - Example

Decomposition of

(E1/E2)*/ E3 / ((E4[@a=V]) | (E5/_*/E6)):


Basic Subexpressions

5 basic subexpressions:

(1) A subexpression with a single element or a single attribute.

(2) A subexpression with an element and an attribute.

  • e.g. figure[@caption = “Tree Frogs”]

    (3) A subexpression with two elements

  • e.g. chapter/_*/figure where ‘_’ denotes any kind of node.


Basic Subexpressions - cont.

5 basic subexpressions - cont.:

(4) A subexpression that is a Kleene closure (+,*) of another subexpression.

(5) A subexpression that is a union of two other subexpressions.


3 Algorithms

  • 3 Algorithms:

    • EA-Join: Element and Attribute Join.

    • EE-Join: Element and Element Join

    • Kleene Closure


EA-Join: Element and Attribute Join

Input:

{E1,…,Em}: Ei is a set of elements having a common document identifier (did);

{A1,…,An}: Aj is a set of elements having a common document identifier (did);

Output:

A set of (e,a) pairs such that the element e is the parent of the attribute a.


EA-Join: Element and Attribute Join

The Algorithm:

// Sort-merge {Ei} and {Aj} by did.

(1) foreachEi and Aj with the same diddo:

// Sort-merge Ei and Aj by

// PARENT-CHILD relationship

(2) foreache ÎEi and aÎAjdo

(3) if (e is a parent of a) then output (e,a)

end

end


Ele <1,3>

Ele <3,1>

Att <2,0>

Att <4,0>

EA-Join – Example

  • Consider the XML document:

    <Ele Att=“A1”>

    <Ele Att=“A2”> </Ele>

    </Ele>

  • And the query: /Ele[@Att=“A1”]


Ele <1,3>

Ele <3,1>

Att <2,0>

Att <4,0>

EA-Join – Querying /Ele[@Att=“A1”]

<Ele Att=“A1”>

<Ele Att=“A2”> </Ele>

</Ele>

  • Sort-merging “Ele”s and “Att”s by parent-child relation ship will give us the list:

    <1,3>, <2,0>, <3,1>, <4,0>

  • Finding the elements “Ele”s with a child attribute “Att” with a value “A1” from the accepted list is easy using the information in the Element Record.


EA-Join – Comments

  • Only a two-stage sort-merge operation without additional cost of sorting:

    • First merge: by did.

    • Second merge: by examining parent-child relationship.

      • This merge is based on the order values of the element and attribute as defined by the numbering scheme.

  • Attributes should be placed before their sibling elements in the order of the numbering scheme.

    • guarantees that elements and attributes with the same did can be merged in a single scan.


EE-Join: Element and Element Join

Input:

{E1,…,Em} and {F1,…,Fm}: Ei or Fj is a set of elements having a common document identifier (did).

Output:

A set of (e,f) pairs such that element e is an ancestor of element f.


EE-Join: Element and Element Join

The Algorithm:

// Sort-merge {Ei} and {Fj} by did.

(1) foreachEi and Fj with the same diddo:

// Sort-merge Ei and Fj bythe

// ANCESTOR-DESCENDANT relationship.

(2) foreache Î Ei and fÎFjdo

(3) if (e is an ancestor of f) then output (e,f);

end

end


EE-Join – Comments

  • Only two-stage sort-merge operation without the additional cost of sorting:

    • First merge: by did.

    • Second merge: by examining parent-child relationship.

  • The sets of elements with a matching did cannot be merged in a single scan.


Kleene Closure

Input:

{E1,…,Em}, where Ei is a group of elements from an XML document.

Output:

A Kleene closure of {E1,…,Em}.


Kleene Closure

The Algorithm:

  • Set i¬ 1;

  • Set KiC¬{E1,…,Em};

    (3) repeat

    (4) set i¬i + 1;

    (5) set KiC¬EE-Join(Ki-1C,K1C);

    until (KiC is empty);

    (6) output the union of K1C,K2C,…, KiC;


Performance Experiments

  • EE-Join:

  • Results:

    • Real World: an order of magnitude faster.

    • Synthetic Data: 6 to 10 times faster.


Performance Experiments

  • EA-Join:

  • Results:

    • Compared to Top-Down: a better performance.

    • Compared to Bottom-Up: no winner - close results.


Performance Results - Conclusions

  • The proposed algorithms can achieve performance improvement over the conventional methods (top-down and bottom-up tree traversals) by up to an order of magnitude.


ad