presentation for cmpe 521 n.
Skip this Video
Loading SlideShow in 5 Seconds..
Presentation for Cmpe-521 PowerPoint Presentation
Download Presentation
Presentation for Cmpe-521

Loading in 2 Seconds...

play fullscreen
1 / 54

Presentation for Cmpe-521 - PowerPoint PPT Presentation

  • Uploaded on

Presentation for Cmpe-521. VIST – Virtual Suffix Tree Prepared by : Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321. VIST : A Dynamic Index Method for Querying XML Data by Tree Structures Written by: Haixun Wang, Sanghyun Park, Wei Fan, Philip S. Yu – SIGMOD 2003. What is XML? .

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Presentation for Cmpe-521' - mavis

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
presentation for cmpe 521
Presentation for Cmpe-521

VIST – Virtual Suffix Tree

Prepared by:

Evren CEYLAN – 2003700163

Aslı UYAR - 2003701321



A Dynamic IndexMethod for Querying XML Data by Tree Structures

Written by:Haixun Wang, Sanghyun Park, Wei Fan, Philip S. Yu – SIGMOD 2003

what is xml
What is XML?
  • XML : Extentional Markup Language
  • Has a great importance in Data Exchange.
  • So, lots of research has been done in providing flexible query mechanisms in order to extract data from XML Documents.
vist virtual suffix tree
VIST : Virtual Suffix Tree
  • In this paper, VIST is proposed to search XML Documents.
  • XML Documents and XML Queries will be represented in structured-encoded sequences (that will be explained in on-going pages).
  • By using this type of sequences it is shown that, querying XML data is equal to finding subsequence matches.
index methods in xml
Index Methods in XML
  • Previous index methods:

Disassemble a query into multiple sub-queries, and then join the results of these sub-queries to provide final answers.

what does vist do
What does VIST do?
  • Converts both XML Data and XML Queries to structure-encoded sequences
  • Uses tree structures as the basic unit of query in order to avoidhighly expensive join operations
  • In other words, uses structured-encoded sequences instead of nodes or paths
what does vist do1
What does VIST do?
  • Matches structured queries against structured data as a whole, without breaking down the queries into sub-queries of paths or nodes and relying on join operations.
  • Supports dynamic index update.
What does VIST do?

ðIn this paper, it is shown that VIST is effective and efficient in supporting structural queries.

  • XML has a growing importance in data exchange (extracting data from XML documents)
  • XML provides a flexible way to define semi-structured data
  • In this paper a ‘novel index structure’ is introduced called “VIST”(Virtual Suffix Tree)
  • VIST provides solutions, offers better performance and usabilitythan previous approaches in XML indexing.
In XML query language design, expressing complex structural or graphical queries is one of the major concept.
    • (In figure 2, four sample queries is displayed in graph form)
in previous approaches
In previous approaches;
  • i. Indexes are created on path (e.g. “/P/S/I/M” in Q1)Path indexes can answer simple queries efficiently (no branches in Q1).
  • ii. However, queries that involves branching structures (such as Q2), have to be disassembled into sub-queries, then combined by expensive join operations to produce final results.
  • iii. So, these methods are inefficient in handling.
in vist approach
In VIST approach;

Objective: to provide a general method so that structural XML queries need not to be decomposed into sub-queries.

Result: no need to perform expensive join operations.

  • XML Data and XML Queries is transformed into to “structure-encoded sequences”.
  • In order to organize structure-encoded sequences Virtual Suffix Tree is used.
  • VIST also speeds up the matching process.
  • VIST’s index structure includes two parts: D-Ancestor index, S-Ancestor index (that will be explained in on-going pages).
  • VIST unifies structural indexes and value indexes into a single index.
  • Toachieve this, a method is proposed called “dynamic virtual suffix tree labeling” (index update can be performed directly on B+Trees.
structure encoded sequences
Structure-Encoded Sequences
  • Sequential representation of both XML Data and XML Queries.
Objective: Modeling of XML queries through sequence matching makes us to avoid unnecessary join operations in query processing.
  • Result: Structure-Encoded Sequences are used instead of paths or nodes.
mapping data and queries to structure encoded sequences
Mapping Data and Queries to Structure-Encoded Sequences:

Stage 1:

  • Lets consider the purchase record example in figure 3.
  • Notation: Capital letters represent names of Attributes.
  • Lowercase letter represent names of attribute values.
  • To encode attribute values into integers we use hash( ) function.
  • e.g. v1 = h(“dell”) and v2 = h(“ibm”)
  • V1 and v2 is used to represent delle and ibm respectively.
stage 2
Stage 2:
  • Representing an XML document by the preorder sequence of its tree structure.
  • e.g. preorder sequence of the tree in Figure 3 is:


stage 3
Stage 3:
  • Definition: A structure-encoded sequence is a sequence of (symbol,prefix) pairs:

D = (a1,p1), (a2,p2), . . . , (an,pn)

ai: node in the XML doc tree.

pi: path from the root node to node ai.

  • Modeling XML queries through sequence matching is that structural queries can be processed as a whole instead of being broken into smaller query units(paths or nodes of XML doc tree)
  • Combining the results of the sub queries by join operations is expensive.
the vist approach
The VIST Approach:

Presented in 3 stages:

  • Naïve algorithm based on the suffix trees
  • RIST : improves the naïve algorithm by using B+Trees to index suffix tree nodes
  • VIST : an index structure but relying only on the B+Trees
  • XML indexing method needs to include:
    • Should support structural queries directly. This is done by “structure-encoded sequences”.
    • Instead of relying on “suffix trees”, the index method uses better indexing techniques such as B+Trees.
    • The index structure should allow dynamic data insertion and deletion, etc.
a na ve algorithm based on suffix trees
A Naïve Algorithm Based on Suffix Trees
  • Most widely used index structure forsubsequence matching is the suffix tree.
  • 2 XML Documents called Doc1 and Doc2,
  • 2 XML Queries called Q1 and Q2

in structure-encoded sequences.

Doc1 : (P,e)(S,P)(N,PS)(V1,PSN)(L,PS) (V2,PSL)

Doc2 : (P,e) (B,P) (L,PB) (V2,PBL)

Q1 : (P,e) (B,P) (L,PB) (V2,PBL)

Q2 : (P,e) (L,P*) (V2,P*L)

example cont d
Example: (Cont’d)
  • A tree structure for Doc1 and Doc2 is shown in Figure 5
example cont d1
Example: (Cont’d)
  • As it is shown above elements in the sequences represent nodes in the suffix tree.
  • Since the nodes are involed in 2 different trees, there is 2 kinds of ancestor-descendent relationships among the nodes.

i ) D-Ancestorship

e.g. (S,P) is a D-ancestor of (L,PS)

ii ) S-Ancestorship

e.g. (v1,PSN) is a S-ancestor of (L,PS)

na ve algorithm based on the suffix trees
Naïve Algorithm based on the suffix trees:
  • NaiveSearch algorithm based on suffix trees.
  • Represents a naïve method for non-contigious subsequence matching.
for example to match q2
For example to match Q2;
  • Start with the root node, which matches the 1st element of Q2 that is (P,e).
  • Then search under the root for ll nodes that match (L,P*) which yields to (L,PS) and (L,PB)
  • Finally, search for

- (v2,PSL) under the node labeled (L,PS)

- (v2,PBL) under the node labeled (L,PB)

  • Algorithm 1, searches nodes first by

S-Ancestorship, and then D-Ancestorship.

difficulties of naive algorithm
Difficulties ofNaive Algorithm:
  • There are difficulties in using suffix tree to index structure-encoded sequences.
  • Major difficulty is explained below:

Searching for nodes satisfying both S-Ancestorship, and D-Ancestorship is extremely costly. (because we need to go over a large portion of the subtree for each match)

rist indexing by ancestor descendent relationships
RIST: Indexing by Ancestor-Descendent Relationships
  • Improves Naïve Algorithm by eliminating the expensive go-over operations in suffix tree.
  • When we reach node X after matching, we can jump directly to those nodes Y to which X is both D-Ancestor and S-Ancestor.
  • So, no longer need to search among the descendents of X to find Ys one by one.
rist algorithm
RIST Algorithm:
  • 1.index nodes in suffix tree by their (Symbol,Prefix) pairs. This is represented by a B+Tree.
  • i.This enables us to search nodes by these (Symbol,Prefix) pairs that is D-Ancestorship.
  • ii.This B+Tree is called D-Ancestorship B+Tree.
rist algorithm1
RIST Algorithm:
  • 2.among all the nodes satisfying D-Ancestorship, we are interested in the ones satisfying S-Ancestorship as well.
  • i.Labels are created for suffix tree nodes in order to tell the relationship btw 2 nodes.
  • ii.We use B+Trees to index nodes by labels.
  • iii.This B+Tree is called S-Ancestorship B+Tree.
labeling notation
Labeling Notation
  • <nx, sizex>
  • nx: prefix traversal order of x in the suffix tree.
  • Sizex:total number of descendants of x in the suffix tree.
  • That kind of labeling is shown in figure 5.

Labeling Notation

  • Note: with that labeling, the S-Ancestorship between any two nodes can be decide easily:
  • If x and y are labeled <nx, sizex> and <ny, sizey>, node x is an S- Ancestor of y if ny Є ( nx , <nx + sizex> )
constructing the b trees
Constructing the B+Trees:
  • Insert all suffix tree nodes into the D-Ancestorship B+Tree using their symbols as their keys.
  • For all nodes that x inserted with the same (Symbol,Prefix), we index them by an S-Ancestorship B+Tree, using the nx values of their labels as keys.
  • Shown in FIGURE 6
building the docid b tree
Building the DocID B+Tree:
  • DocID B+Tree stores for each node x ( using nx as key ), the document IDs of those XML sequences that end up at node x when they are inserted into the suffix tree.
  • Shown in DocID B+Tree
in summary
In summary;
  • Unlike the naïve algorithm, RIST does not use suffix trees for subsequence matching (it uses D-Ancestorship B+Tree and S-Ancestorship B+Tree )
  • Form any node , instead of searching the entire subtree under the node, we can jump to the sub nodes that match the next element in the query.
  • So, RIST supports non-contigious subsequence matching efficiently.
vist the virtual suffix tree
VIST: The Virtual Suffix Tree
  • RIST uses a static scheme to label suffix tree nodes and that prevents it from supporting dynamic insertions.
  • Because any node x labeled <n,size> , late insertions can change the number of nodes that appear before x. (in the prefix order)
  • As well as the size of the subtree rooted at x, which means neither n nor size can be fixed.
vist the virtual suffix tree1
VIST: The Virtual Suffix Tree
  • The purpose of the suffix tree is to provide a labeling mechanism to encode S-Ancestorship.
  • Suppose a node x is created for element di ,during the insertion of sequence

d1, … , di,… ,dk.

vist the virtual suffix tree2
VIST: The Virtual Suffix Tree
  • If it is estimated many different elements will possibly follow di in future insertions.

ii.The occurrence probability of each of these elements

  • Then we can label x’s child nodes instead of waiting until all sequences are inserted.

VIST: The Virtual Suffix Tree (Cont’d)

  • It also means ;
    • the suffix tree itself is no longer needed, because it’s labeling mechanism is inefficient.
    • It supports dynamic data insertion and deletion.
top down scope allocation
Top down scope allocation:
  • A tree structure defines nested scopes: the scope of a child node is a subscope of its parent node, and the root node has the max scope which covers the scope of each node.
top down scope allocation1
Top down scope allocation:
  • In dynamic scope allocation there is a parameter called λ, which is the expected number of child nodes of any node,
  • λ is usually assumed as 2.
  • without the knowledge of the occurrence rate of the each child node, 1/λ of the remaining scope is allocated to x’s 1st inserted child.
    • Child1 : <n+1,size/2>
    • Child2 : <(n+1+size)/2, size/4>
dynamic scope of a s uffix tree node
Dynamic scope of a Suffix Tree Node:
  • The dynamic scope of a node is triple <n,size,k> ,
  • where k is the number of subscopes allocated inside current scope.
algorithm of vist
Algorithm of VIST:
  • VIST uses the same sequence matching algorithm as RIST
  • Dynamic method for labeling suffix tree nodes is represented without building the suffix tree.
algorithm of vist1
Algorithm of VIST:
  • The method relies on insensitive estimations of the number of attribute values.
  • Because of that the labeling mechanism is based on a virtual suffix tree .

- lets look at the index structure before and after insertion

algortihm of vist
Algortihm of VIST:
  • Suppose, before the insertion the index structure already contains the following sequence:

Doc1 = (P,e) (S,P) (N,PS) (V1,PSN) (L,PS) (V2,PSL)

  • The sequence to be inserted

=> Doc2 = (P,e) (S,P) (L,PS) (V2,PSL)

assumptions of the example
Assumptions of the Example:
  • There are 2 assumptions for the algorithm:
    • Max = 20480
    • Dynamic scope allocation method uses the parameter λ =2
The insertion process is much like that of inserting a sequence into a suffix tree.
  • We follow the branches, and when there is no branch to follow we create one.
  • VIST (a dynamic index method) is developed for XML Documents.
  • XML data and XML queries is converted into sequences that encode their structural information.
vist s pros
VIST’s Pros:
  • Uses tree structure as the basic unit of query to avoid expensive join operations.
  • Supports dynamic data insertion and deletion.
  • Unlike some other data structures used in other approaches, the index structure of VIST which is based on B+Trees, are well supported by DBMSs.