indexing and querying xml data for regular path expressions
Download
Skip this Video
Download Presentation
Indexing and Querying XML Data for Regular Path Expressions

Loading in 2 Seconds...

play fullscreen
1 / 40

Indexing and Querying XML Data for Regular Path Expressions - PowerPoint PPT Presentation


  • 70 Views
  • Uploaded on

Indexing and Querying XML Data for Regular Path Expressions. A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot. Our Objective. Developing a system that will enable us to perform XML data queries efficiently. XML Queries Languages. Used for retrieving data from XML files.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Indexing and Querying XML Data for Regular Path Expressions' - kyle-jensen


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
indexing and querying xml data for regular path expressions

Indexing and Querying XML Data for Regular Path Expressions

A Paper by Quanzhong Li and Bongki Moon

Presented by Amnon Shochot

slide2

Our Objective

  • Developing a system that will enable us to perform XML data queries efficiently.
slide3

XML Queries Languages

  • Used for retrieving data from XML files.
  • Use a regular path expression syntax.
  • e.g. XPath, XQuery.
slide4

Queries Today - Inefficient

  • Usually XML tree traversals – Inefficient.
    • Top-Down Approach
    • Bottom-Up Approach
    • An example:

the query:

/chapter/_*/figure

(finding all figures in all chapters.)

slide5

Our Objective - Refined

  • Developing a system that will enable us to perform XML data queries efficiently
  • Developing such a system consists of:
    • Developing a way to efficiently store XML data.
    • Developing efficient algorithms for processing regular path expressions (e.g. XQuery expressions).
slide6

Storing XML Documents

  • Question: What would we need from a data structure to be able to perform an efficient query?
  • Answer: A mechanism for:
    • Efficiently finding all elements/attributes with a given name.
    • Efficiently finding all values with a given name.
    • Efficiently resolving ancestor-descendant relationship.
slide7

Storing XML Documents - XISS

  • XISS - XML Indexing and Storage System.
  • Provides us with ways to:
    • efficiently find all elements or attributes with the same name string grouped by document which they belong to.
    • quickly determine the ancestor-descendant relationship between elements and/or attributes in the hierarchy of XML data hierarchy.
slide8

Determining Ancestor-Descendent Relationship

  • According to Dietz’s: for two given nodes x and y of a tree T, x is an ancestor of y iff x occurs before y in the preorder traversal and after y in the postorder traversal.
  • Example:
slide9

Determining Ancestor-Descendent Relationship – cont.

  • Advantage: the ancestor-descendent relationship can be determined in constant time.
  • Disadvantage: a lack of flexibility.
    • e.g. inserting a new node requires recomputation of many tree nodes.
slide10

exclusive

Determining Ancestor-Descendent Relationship – cont.

  • A new numbering scheme:
    • Each node is associated with a <order, size> pair:
      • For a tree node y and its parent x:

[order(y), order(y) + size(y)] Ì (order(x), order(x) + size(x)]

      • For two sibling nodes x and y, if x is the predecessor of y in preorder traversal holds:

order(x) + size(x) < order(y).

slide11

Determining Ancestor-Descendent Relationship – cont.

  • Fact: for two given nodes x and y of a tree T, x is an ancestor of y iff:

order(x) < order(y) £ order(x) + size(x)

slide12

Determining Ancestor-Descendent Relationship – cont.

  • Properties:
    • the ancestor-descendent relationship can be determined in constant time.
    • flexibility – node insertion usually doesn’t require recomputation of tree nodes.
    • an element can be uniquely identified in a document by its order value.
slide14

XISS System Overview

  • How the system works:
    • XML documents are loaded into the XISS system.
    • These documents are added to the XISS data structures.
      • Each document is assigned a document id (did).
      • Index structures are organized as paged files for efficient disk IO.
    • When a query is performed the query processor interacts with XISS in order to obtain the information required for the query.
slide15

XISS - cont.

  • XISS consists of 5 components:
    • Name Index
    • Value Table
    • Element Index
    • Attribute Index
    • Structure Index
slide16

Name Index and Value Table

  • Objective: minimizing the storage and computation overhead by eliminating replicated strings and string comparisons.
  • Name Index -mapping distinct name strings into unique name identifiers (nid).
  • Value Table - mapping distinct value strings (i.e. attribute value and text value) into unique value identifiers (vid).
  • Both implemented as a B+-tree.
slide17

The Element Index

  • Objective: quickly finding all elements with the same name string.
  • Structure:
slide18

The Element Index – cont.

  • Structure:
    • B+-tree using nid as a key.
    • Leaf nodes: pointers to a set of records for elements (or attributes) having an identical name string, grouped by the document they belong to.
    • Element Record = {<order,size>, Depth, Parent ID}
      • where Depth is the depth of the element in the XML tree.
    • Element Records are ordered by <order,size>.
slide19

The Attribute Index

  • Objective: quickly finding all elements with the same name string.
  • Structure:
    • Same structure as the Element Index except that the record in attribute index has a value identifier vid which is a key used to obtain the attribute from the value table.
slide20

The Structure Index

  • Objectives:
    • Finding the parent element and child elements (or attributes) for a given element.
    • Finding the parent element for a given attribute.
  • Structure:
slide21

The Structure Index – cont.

  • Structure:
    • B+-tree using document identifier (did) as a key.
    • Leaf nodes: linear arrays with records for all elements and attributes from an XML document.
    • Each record: {nid, <order,size>, Parent order, Child order, Sibling order, Attribute order}.
    • Records are ordered by order value.
slide22

Querying Method

  • Decomposing path expressions into simple path expressions.
  • Applying algorithms on simple path expressions and their intermediate results.
slide23

Decomposition of Path Expressions

  • The main idea:
    • A complex path expression is decomposed into several simple path expressions.
    • Each simple path expression produces an intermediate result that can be used in the subsequent stage of processing.
    • The results of the simple path expressions are than combined or joined together to obtain the final result of the given query.
slide24

(1)

(1)

(1)

(1)

(1)

(1)

(1)

/

[ ]

/_*/

(3)

(2)

(3)

(1) Single Element/Attribute

(2) Element-Attribute

(3) Element-Element

(4) Kleene Closure

(5) Union

*

|

(4)

(5)

/

(3)

/

(3)

Basic Subexpressions - Example

Decomposition of

(E1/E2)*/ E3 / ((E4[@a=V]) | (E5/_*/E6)):

slide25

Basic Subexpressions

5 basic subexpressions:

(1) A subexpression with a single element or a single attribute.

(2) A subexpression with an element and an attribute.

  • e.g. figure[@caption = “Tree Frogs”]

(3) A subexpression with two elements

  • e.g. chapter/_*/figure where ‘_’ denotes any kind of node.
slide26

Basic Subexpressions - cont.

5 basic subexpressions - cont.:

(4) A subexpression that is a Kleene closure (+,*) of another subexpression.

(5) A subexpression that is a union of two other subexpressions.

slide27

3 Algorithms

  • 3 Algorithms:
    • EA-Join: Element and Attribute Join.
    • EE-Join: Element and Element Join
    • Kleene Closure
slide28

EA-Join: Element and Attribute Join

Input:

{E1,…,Em}: Ei is a set of elements having a common document identifier (did);

{A1,…,An}: Aj is a set of elements having a common document identifier (did);

Output:

A set of (e,a) pairs such that the element e is the parent of the attribute a.

slide29

EA-Join: Element and Attribute Join

The Algorithm:

// Sort-merge {Ei} and {Aj} by did.

(1) foreachEi and Aj with the same diddo:

// Sort-merge Ei and Aj by

// PARENT-CHILD relationship

(2) foreache ÎEi and aÎAjdo

(3) if (e is a parent of a) then output (e,a)

end

end

slide30

Ele <1,3>

Ele <3,1>

Att <2,0>

Att <4,0>

EA-Join – Example

  • Consider the XML document:

<Ele Att=“A1”>

<Ele Att=“A2”> </Ele>

</Ele>

  • And the query: /Ele[@Att=“A1”]
slide31

Ele <1,3>

Ele <3,1>

Att <2,0>

Att <4,0>

EA-Join – Querying /Ele[@Att=“A1”]

<Ele Att=“A1”>

<Ele Att=“A2”> </Ele>

</Ele>

  • Sort-merging “Ele”s and “Att”s by parent-child relation ship will give us the list:

<1,3>, <2,0>, <3,1>, <4,0>

  • Finding the elements “Ele”s with a child attribute “Att” with a value “A1” from the accepted list is easy using the information in the Element Record.
slide32

EA-Join – Comments

  • Only a two-stage sort-merge operation without additional cost of sorting:
    • First merge: by did.
    • Second merge: by examining parent-child relationship.
      • This merge is based on the order values of the element and attribute as defined by the numbering scheme.
  • Attributes should be placed before their sibling elements in the order of the numbering scheme.
    • guarantees that elements and attributes with the same did can be merged in a single scan.
slide33

EE-Join: Element and Element Join

Input:

{E1,…,Em} and {F1,…,Fm}: Ei or Fj is a set of elements having a common document identifier (did).

Output:

A set of (e,f) pairs such that element e is an ancestor of element f.

slide34

EE-Join: Element and Element Join

The Algorithm:

// Sort-merge {Ei} and {Fj} by did.

(1) foreachEi and Fj with the same diddo:

// Sort-merge Ei and Fj bythe

// ANCESTOR-DESCENDANT relationship.

(2) foreache Î Ei and fÎFjdo

(3) if (e is an ancestor of f) then output (e,f);

end

end

slide35

EE-Join – Comments

  • Only two-stage sort-merge operation without the additional cost of sorting:
    • First merge: by did.
    • Second merge: by examining parent-child relationship.
  • The sets of elements with a matching did cannot be merged in a single scan.
slide36

Kleene Closure

Input:

{E1,…,Em}, where Ei is a group of elements from an XML document.

Output:

A Kleene closure of {E1,…,Em}.

slide37

Kleene Closure

The Algorithm:

  • Set i¬ 1;
  • Set KiC¬{E1,…,Em};

(3) repeat

(4) set i¬i + 1;

(5) set KiC¬EE-Join(Ki-1C,K1C);

until (KiC is empty);

(6) output the union of K1C,K2C,…, KiC;

slide38

Performance Experiments

  • EE-Join:
  • Results:
    • Real World: an order of magnitude faster.
    • Synthetic Data: 6 to 10 times faster.
slide39

Performance Experiments

  • EA-Join:
  • Results:
    • Compared to Top-Down: a better performance.
    • Compared to Bottom-Up: no winner - close results.
slide40

Performance Results - Conclusions

  • The proposed algorithms can achieve performance improvement over the conventional methods (top-down and bottom-up tree traversals) by up to an order of magnitude.
ad