presentation for cmpe 521 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Presentation for Cmpe-521 PowerPoint Presentation
Download Presentation
Presentation for Cmpe-521

Loading in 2 Seconds...

play fullscreen
1 / 54

Presentation for Cmpe-521 - PowerPoint PPT Presentation


  • 77 Views
  • Uploaded on

Presentation for Cmpe-521. VIST – Virtual Suffix Tree Prepared by : Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321. VIST : A Dynamic Index Method for Querying XML Data by Tree Structures Written by: Haixun Wang, Sanghyun Park, Wei Fan, Philip S. Yu – SIGMOD 2003. What is XML? .

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Presentation for Cmpe-521' - mavis


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
presentation for cmpe 521
Presentation for Cmpe-521

VIST – Virtual Suffix Tree

Prepared by:

Evren CEYLAN – 2003700163

Aslı UYAR - 2003701321

slide2

VIST:

A Dynamic IndexMethod for Querying XML Data by Tree Structures

Written by:Haixun Wang, Sanghyun Park, Wei Fan, Philip S. Yu – SIGMOD 2003

what is xml
What is XML?
  • XML : Extentional Markup Language
  • Has a great importance in Data Exchange.
  • So, lots of research has been done in providing flexible query mechanisms in order to extract data from XML Documents.
vist virtual suffix tree
VIST : Virtual Suffix Tree
  • In this paper, VIST is proposed to search XML Documents.
  • XML Documents and XML Queries will be represented in structured-encoded sequences (that will be explained in on-going pages).
  • By using this type of sequences it is shown that, querying XML data is equal to finding subsequence matches.
index methods in xml
Index Methods in XML
  • Previous index methods:

Disassemble a query into multiple sub-queries, and then join the results of these sub-queries to provide final answers.

what does vist do
What does VIST do?
  • Converts both XML Data and XML Queries to structure-encoded sequences
  • Uses tree structures as the basic unit of query in order to avoidhighly expensive join operations
  • In other words, uses structured-encoded sequences instead of nodes or paths
what does vist do1
What does VIST do?
  • Matches structured queries against structured data as a whole, without breaking down the queries into sub-queries of paths or nodes and relying on join operations.
  • Supports dynamic index update.
slide8
What does VIST do?

ðIn this paper, it is shown that VIST is effective and efficient in supporting structural queries.

introduction
Introduction
  • XML has a growing importance in data exchange (extracting data from XML documents)
  • XML provides a flexible way to define semi-structured data
  • In this paper a ‘novel index structure’ is introduced called “VIST”(Virtual Suffix Tree)
  • VIST provides solutions, offers better performance and usabilitythan previous approaches in XML indexing.
slide10
In XML query language design, expressing complex structural or graphical queries is one of the major concept.
    • (In figure 2, four sample queries is displayed in graph form)
in previous approaches
In previous approaches;
  • i. Indexes are created on path (e.g. “/P/S/I/M” in Q1)Path indexes can answer simple queries efficiently (no branches in Q1).
  • ii. However, queries that involves branching structures (such as Q2), have to be disassembled into sub-queries, then combined by expensive join operations to produce final results.
  • iii. So, these methods are inefficient in handling.
in vist approach
In VIST approach;

Objective: to provide a general method so that structural XML queries need not to be decomposed into sub-queries.

Result: no need to perform expensive join operations.

method
Method:
  • XML Data and XML Queries is transformed into to “structure-encoded sequences”.
  • In order to organize structure-encoded sequences Virtual Suffix Tree is used.
  • VIST also speeds up the matching process.
structure
Structure:
  • VIST’s index structure includes two parts: D-Ancestor index, S-Ancestor index (that will be explained in on-going pages).
  • VIST unifies structural indexes and value indexes into a single index.
  • Toachieve this, a method is proposed called “dynamic virtual suffix tree labeling” (index update can be performed directly on B+Trees.
structure encoded sequences
Structure-Encoded Sequences
  • Sequential representation of both XML Data and XML Queries.
slide16
Objective: Modeling of XML queries through sequence matching makes us to avoid unnecessary join operations in query processing.
  • Result: Structure-Encoded Sequences are used instead of paths or nodes.
mapping data and queries to structure encoded sequences
Mapping Data and Queries to Structure-Encoded Sequences:

Stage 1:

  • Lets consider the purchase record example in figure 3.
  • Notation: Capital letters represent names of Attributes.
  • Lowercase letter represent names of attribute values.
  • To encode attribute values into integers we use hash( ) function.
  • e.g. v1 = h(“dell”) and v2 = h(“ibm”)
  • V1 and v2 is used to represent delle and ibm respectively.
stage 2
Stage 2:
  • Representing an XML document by the preorder sequence of its tree structure.
  • e.g. preorder sequence of the tree in Figure 3 is:

PSNv1IMv2Nv3IMv4Inv5Lv6BLv7Nv8

stage 3
Stage 3:
  • Definition: A structure-encoded sequence is a sequence of (symbol,prefix) pairs:

D = (a1,p1), (a2,p2), . . . , (an,pn)

ai: node in the XML doc tree.

pi: path from the root node to node ai.

benefits
Benefits:
  • Modeling XML queries through sequence matching is that structural queries can be processed as a whole instead of being broken into smaller query units(paths or nodes of XML doc tree)
  • Combining the results of the sub queries by join operations is expensive.
the vist approach
The VIST Approach:

Presented in 3 stages:

  • Naïve algorithm based on the suffix trees
  • RIST : improves the naïve algorithm by using B+Trees to index suffix tree nodes
  • VIST : an index structure but relying only on the B+Trees
requirements
Requirements
  • XML indexing method needs to include:
    • Should support structural queries directly. This is done by “structure-encoded sequences”.
    • Instead of relying on “suffix trees”, the index method uses better indexing techniques such as B+Trees.
    • The index structure should allow dynamic data insertion and deletion, etc.
a na ve algorithm based on suffix trees
A Naïve Algorithm Based on Suffix Trees
  • Most widely used index structure forsubsequence matching is the suffix tree.
example
Example:
  • 2 XML Documents called Doc1 and Doc2,
  • 2 XML Queries called Q1 and Q2

in structure-encoded sequences.

Doc1 : (P,e)(S,P)(N,PS)(V1,PSN)(L,PS) (V2,PSL)

Doc2 : (P,e) (B,P) (L,PB) (V2,PBL)

Q1 : (P,e) (B,P) (L,PB) (V2,PBL)

Q2 : (P,e) (L,P*) (V2,P*L)

example cont d
Example: (Cont’d)
  • A tree structure for Doc1 and Doc2 is shown in Figure 5
example cont d1
Example: (Cont’d)
  • As it is shown above elements in the sequences represent nodes in the suffix tree.
  • Since the nodes are involed in 2 different trees, there is 2 kinds of ancestor-descendent relationships among the nodes.

i ) D-Ancestorship

e.g. (S,P) is a D-ancestor of (L,PS)

ii ) S-Ancestorship

e.g. (v1,PSN) is a S-ancestor of (L,PS)

na ve algorithm based on the suffix trees
Naïve Algorithm based on the suffix trees:
  • NaiveSearch algorithm based on suffix trees.
  • Represents a naïve method for non-contigious subsequence matching.
for example to match q2
For example to match Q2;
  • Start with the root node, which matches the 1st element of Q2 that is (P,e).
  • Then search under the root for ll nodes that match (L,P*) which yields to (L,PS) and (L,PB)
  • Finally, search for

- (v2,PSL) under the node labeled (L,PS)

- (v2,PBL) under the node labeled (L,PB)

  • Algorithm 1, searches nodes first by

S-Ancestorship, and then D-Ancestorship.

difficulties of naive algorithm
Difficulties ofNaive Algorithm:
  • There are difficulties in using suffix tree to index structure-encoded sequences.
  • Major difficulty is explained below:

Searching for nodes satisfying both S-Ancestorship, and D-Ancestorship is extremely costly. (because we need to go over a large portion of the subtree for each match)

rist indexing by ancestor descendent relationships
RIST: Indexing by Ancestor-Descendent Relationships
  • Improves Naïve Algorithm by eliminating the expensive go-over operations in suffix tree.
  • When we reach node X after matching, we can jump directly to those nodes Y to which X is both D-Ancestor and S-Ancestor.
  • So, no longer need to search among the descendents of X to find Ys one by one.
rist algorithm
RIST Algorithm:
  • 1.index nodes in suffix tree by their (Symbol,Prefix) pairs. This is represented by a B+Tree.
  • i.This enables us to search nodes by these (Symbol,Prefix) pairs that is D-Ancestorship.
  • ii.This B+Tree is called D-Ancestorship B+Tree.
rist algorithm1
RIST Algorithm:
  • 2.among all the nodes satisfying D-Ancestorship, we are interested in the ones satisfying S-Ancestorship as well.
  • i.Labels are created for suffix tree nodes in order to tell the relationship btw 2 nodes.
  • ii.We use B+Trees to index nodes by labels.
  • iii.This B+Tree is called S-Ancestorship B+Tree.
labeling notation
Labeling Notation
  • <nx, sizex>
  • nx: prefix traversal order of x in the suffix tree.
  • Sizex:total number of descendants of x in the suffix tree.
  • That kind of labeling is shown in figure 5.
slide35

Labeling Notation

  • Note: with that labeling, the S-Ancestorship between any two nodes can be decide easily:
  • If x and y are labeled <nx, sizex> and <ny, sizey>, node x is an S- Ancestor of y if ny Є ( nx , <nx + sizex> )
constructing the b trees
Constructing the B+Trees:
  • Insert all suffix tree nodes into the D-Ancestorship B+Tree using their symbols as their keys.
  • For all nodes that x inserted with the same (Symbol,Prefix), we index them by an S-Ancestorship B+Tree, using the nx values of their labels as keys.
  • Shown in FIGURE 6
building the docid b tree
Building the DocID B+Tree:
  • DocID B+Tree stores for each node x ( using nx as key ), the document IDs of those XML sequences that end up at node x when they are inserted into the suffix tree.
  • Shown in DocID B+Tree
in summary
In summary;
  • Unlike the naïve algorithm, RIST does not use suffix trees for subsequence matching (it uses D-Ancestorship B+Tree and S-Ancestorship B+Tree )
  • Form any node , instead of searching the entire subtree under the node, we can jump to the sub nodes that match the next element in the query.
  • So, RIST supports non-contigious subsequence matching efficiently.
vist the virtual suffix tree
VIST: The Virtual Suffix Tree
  • RIST uses a static scheme to label suffix tree nodes and that prevents it from supporting dynamic insertions.
  • Because any node x labeled <n,size> , late insertions can change the number of nodes that appear before x. (in the prefix order)
  • As well as the size of the subtree rooted at x, which means neither n nor size can be fixed.
vist the virtual suffix tree1
VIST: The Virtual Suffix Tree
  • The purpose of the suffix tree is to provide a labeling mechanism to encode S-Ancestorship.
  • Suppose a node x is created for element di ,during the insertion of sequence

d1, … , di,… ,dk.

vist the virtual suffix tree2
VIST: The Virtual Suffix Tree
  • If it is estimated

i.how many different elements will possibly follow di in future insertions.

ii.The occurrence probability of each of these elements

  • Then we can label x’s child nodes instead of waiting until all sequences are inserted.
slide42

VIST: The Virtual Suffix Tree (Cont’d)

  • It also means ;
    • the suffix tree itself is no longer needed, because it’s labeling mechanism is inefficient.
    • It supports dynamic data insertion and deletion.
top down scope allocation
Top down scope allocation:
  • A tree structure defines nested scopes: the scope of a child node is a subscope of its parent node, and the root node has the max scope which covers the scope of each node.
top down scope allocation1
Top down scope allocation:
  • In dynamic scope allocation there is a parameter called λ, which is the expected number of child nodes of any node,
  • λ is usually assumed as 2.
  • without the knowledge of the occurrence rate of the each child node, 1/λ of the remaining scope is allocated to x’s 1st inserted child.
    • Child1 : <n+1,size/2>
    • Child2 : <(n+1+size)/2, size/4>
dynamic scope of a s uffix tree node
Dynamic scope of a Suffix Tree Node:
  • The dynamic scope of a node is triple <n,size,k> ,
  • where k is the number of subscopes allocated inside current scope.
algorithm of vist
Algorithm of VIST:
  • VIST uses the same sequence matching algorithm as RIST
  • Dynamic method for labeling suffix tree nodes is represented without building the suffix tree.
algorithm of vist1
Algorithm of VIST:
  • The method relies on insensitive estimations of the number of attribute values.
  • Because of that the labeling mechanism is based on a virtual suffix tree .
slide48
Example:

- lets look at the index structure before and after insertion

algortihm of vist
Algortihm of VIST:
  • Suppose, before the insertion the index structure already contains the following sequence:

Doc1 = (P,e) (S,P) (N,PS) (V1,PSN) (L,PS) (V2,PSL)

  • The sequence to be inserted

=> Doc2 = (P,e) (S,P) (L,PS) (V2,PSL)

assumptions of the example
Assumptions of the Example:
  • There are 2 assumptions for the algorithm:
    • Max = 20480
    • Dynamic scope allocation method uses the parameter λ =2
slide51
The insertion process is much like that of inserting a sequence into a suffix tree.
  • We follow the branches, and when there is no branch to follow we create one.
conclusion
CONCLUSION:
  • VIST (a dynamic index method) is developed for XML Documents.
  • XML data and XML queries is converted into sequences that encode their structural information.
vist s pros
VIST’s Pros:
  • Uses tree structure as the basic unit of query to avoid expensive join operations.
  • Supports dynamic data insertion and deletion.
  • Unlike some other data structures used in other approaches, the index structure of VIST which is based on B+Trees, are well supported by DBMSs.