VIST: A Dynamic Index Method for Efficient XML Data Querying

Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

VIST: A Dynamic IndexMethod for Querying XML Data by Tree Structures Written by:Haixun Wang, Sanghyun Park, Wei Fan, Philip S. Yu – SIGMOD 2003

What is XML? • XML : Extentional Markup Language • Has a great importance in Data Exchange. • So, lots of research has been done in providing flexible query mechanisms in order to extract data from XML Documents.

VIST : Virtual Suffix Tree • In this paper, VIST is proposed to search XML Documents. • XML Documents and XML Queries will be represented in structured-encoded sequences (that will be explained in on-going pages). • By using this type of sequences it is shown that, querying XML data is equal to finding subsequence matches.

Index Methods in XML • Previous index methods: Disassemble a query into multiple sub-queries, and then join the results of these sub-queries to provide final answers.

What does VIST do? • Converts both XML Data and XML Queries to structure-encoded sequences • Uses tree structures as the basic unit of query in order to avoidhighly expensive join operations • In other words, uses structured-encoded sequences instead of nodes or paths

What does VIST do? • Matches structured queries against structured data as a whole, without breaking down the queries into sub-queries of paths or nodes and relying on join operations. • Supports dynamic index update.

What does VIST do? ðIn this paper, it is shown that VIST is effective and efficient in supporting structural queries.

Introduction • XML has a growing importance in data exchange (extracting data from XML documents) • XML provides a flexible way to define semi-structured data • In this paper a ‘novel index structure’ is introduced called “VIST”(Virtual Suffix Tree) • VIST provides solutions, offers better performance and usabilitythan previous approaches in XML indexing.

In XML query language design, expressing complex structural or graphical queries is one of the major concept. • (In figure 2, four sample queries is displayed in graph form)

In previous approaches; • i. Indexes are created on path (e.g. “/P/S/I/M” in Q1)Path indexes can answer simple queries efficiently (no branches in Q1). • ii. However, queries that involves branching structures (such as Q2), have to be disassembled into sub-queries, then combined by expensive join operations to produce final results. • iii. So, these methods are inefficient in handling.

In VIST approach; Objective: to provide a general method so that structural XML queries need not to be decomposed into sub-queries. Result: no need to perform expensive join operations.

Method: • XML Data and XML Queries is transformed into to “structure-encoded sequences”. • In order to organize structure-encoded sequences Virtual Suffix Tree is used. • VIST also speeds up the matching process.

Structure: • VIST’s index structure includes two parts: D-Ancestor index, S-Ancestor index (that will be explained in on-going pages). • VIST unifies structural indexes and value indexes into a single index. • Toachieve this, a method is proposed called “dynamic virtual suffix tree labeling” (index update can be performed directly on B+Trees.

Structure-Encoded Sequences • Sequential representation of both XML Data and XML Queries.

Objective: Modeling of XML queries through sequence matching makes us to avoid unnecessary join operations in query processing. • Result: Structure-Encoded Sequences are used instead of paths or nodes.

Mapping Data and Queries to Structure-Encoded Sequences: Stage 1: • Lets consider the purchase record example in figure 3. • Notation: Capital letters represent names of Attributes. • Lowercase letter represent names of attribute values. • To encode attribute values into integers we use hash( ) function. • e.g. v1 = h(“dell”) and v2 = h(“ibm”) • V1 and v2 is used to represent delle and ibm respectively.

Stage 2: • Representing an XML document by the preorder sequence of its tree structure. • e.g. preorder sequence of the tree in Figure 3 is: PSNv1IMv2Nv3IMv4Inv5Lv6BLv7Nv8

Stage 3: • Definition: A structure-encoded sequence is a sequence of (symbol,prefix) pairs: D = (a1,p1), (a2,p2), . . . , (an,pn) ai: node in the XML doc tree. pi: path from the root node to node ai.

Figure 3 can be converted into the structure-encoded sequence. • D = ... ... (Figure 4)

Benefits: • Modeling XML queries through sequence matching is that structural queries can be processed as a whole instead of being broken into smaller query units(paths or nodes of XML doc tree) • Combining the results of the sub queries by join operations is expensive.

The VIST Approach: Presented in 3 stages: • Naïve algorithm based on the suffix trees • RIST : improves the naïve algorithm by using B+Trees to index suffix tree nodes • VIST : an index structure but relying only on the B+Trees

Requirements • XML indexing method needs to include: • Should support structural queries directly. This is done by “structure-encoded sequences”. • Instead of relying on “suffix trees”, the index method uses better indexing techniques such as B+Trees. • The index structure should allow dynamic data insertion and deletion, etc.

A Naïve Algorithm Based on Suffix Trees • Most widely used index structure forsubsequence matching is the suffix tree.

Example: • 2 XML Documents called Doc1 and Doc2, • 2 XML Queries called Q1 and Q2 in structure-encoded sequences. Doc1 : (P,e)(S,P)(N,PS)(V1,PSN)(L,PS) (V2,PSL) Doc2 : (P,e) (B,P) (L,PB) (V2,PBL) Q1 : (P,e) (B,P) (L,PB) (V2,PBL) Q2 : (P,e) (L,P*) (V2,P*L)

Example: (Cont’d) • A tree structure for Doc1 and Doc2 is shown in Figure 5

Example: (Cont’d) • As it is shown above elements in the sequences represent nodes in the suffix tree. • Since the nodes are involed in 2 different trees, there is 2 kinds of ancestor-descendent relationships among the nodes. i ) D-Ancestorship e.g. (S,P) is a D-ancestor of (L,PS) ii ) S-Ancestorship e.g. (v1,PSN) is a S-ancestor of (L,PS)

Naïve Algorithm based on the suffix trees: • NaiveSearch algorithm based on suffix trees. • Represents a naïve method for non-contigious subsequence matching.

For example to match Q2; • Start with the root node, which matches the 1st element of Q2 that is (P,e). • Then search under the root for ll nodes that match (L,P*) which yields to (L,PS) and (L,PB) • Finally, search for - (v2,PSL) under the node labeled (L,PS) - (v2,PBL) under the node labeled (L,PB) • Algorithm 1, searches nodes first by S-Ancestorship, and then D-Ancestorship.

Difficulties ofNaive Algorithm: • There are difficulties in using suffix tree to index structure-encoded sequences. • Major difficulty is explained below: Searching for nodes satisfying both S-Ancestorship, and D-Ancestorship is extremely costly. (because we need to go over a large portion of the subtree for each match)

RIST: Indexing by Ancestor-Descendent Relationships • Improves Naïve Algorithm by eliminating the expensive go-over operations in suffix tree. • When we reach node X after matching, we can jump directly to those nodes Y to which X is both D-Ancestor and S-Ancestor. • So, no longer need to search among the descendents of X to find Ys one by one.

RIST Algorithm: • 1.index nodes in suffix tree by their (Symbol,Prefix) pairs. This is represented by a B+Tree. • i.This enables us to search nodes by these (Symbol,Prefix) pairs that is D-Ancestorship. • ii.This B+Tree is called D-Ancestorship B+Tree.

RIST Algorithm: • 2.among all the nodes satisfying D-Ancestorship, we are interested in the ones satisfying S-Ancestorship as well. • i.Labels are created for suffix tree nodes in order to tell the relationship btw 2 nodes. • ii.We use B+Trees to index nodes by labels. • iii.This B+Tree is called S-Ancestorship B+Tree.

Labeling Notation • <nx, sizex> • nx: prefix traversal order of x in the suffix tree. • Sizex:total number of descendants of x in the suffix tree. • That kind of labeling is shown in figure 5.

Labeling Notation • Note: with that labeling, the S-Ancestorship between any two nodes can be decide easily: • If x and y are labeled <nx, sizex> and <ny, sizey>, node x is an S- Ancestor of y if ny Є ( nx , <nx + sizex> )

Constructing the B+Trees: • Insert all suffix tree nodes into the D-Ancestorship B+Tree using their symbols as their keys. • For all nodes that x inserted with the same (Symbol,Prefix), we index them by an S-Ancestorship B+Tree, using the nx values of their labels as keys. • Shown in FIGURE 6

Building the DocID B+Tree: • DocID B+Tree stores for each node x ( using nx as key ), the document IDs of those XML sequences that end up at node x when they are inserted into the suffix tree. • Shown in DocID B+Tree

In summary; • Unlike the naïve algorithm, RIST does not use suffix trees for subsequence matching (it uses D-Ancestorship B+Tree and S-Ancestorship B+Tree ) • Form any node , instead of searching the entire subtree under the node, we can jump to the sub nodes that match the next element in the query. • So, RIST supports non-contigious subsequence matching efficiently.

VIST: The Virtual Suffix Tree • RIST uses a static scheme to label suffix tree nodes and that prevents it from supporting dynamic insertions. • Because any node x labeled <n,size> , late insertions can change the number of nodes that appear before x. (in the prefix order) • As well as the size of the subtree rooted at x, which means neither n nor size can be fixed.

VIST: The Virtual Suffix Tree • The purpose of the suffix tree is to provide a labeling mechanism to encode S-Ancestorship. • Suppose a node x is created for element di ,during the insertion of sequence d1, … , di,… ,dk.

VIST: The Virtual Suffix Tree • If it is estimated i.how many different elements will possibly follow di in future insertions. ii.The occurrence probability of each of these elements • Then we can label x’s child nodes instead of waiting until all sequences are inserted.

VIST: The Virtual Suffix Tree (Cont’d) • It also means ; • the suffix tree itself is no longer needed, because it’s labeling mechanism is inefficient. • It supports dynamic data insertion and deletion.

Top down scope allocation: • A tree structure defines nested scopes: the scope of a child node is a subscope of its parent node, and the root node has the max scope which covers the scope of each node.

Top down scope allocation: • In dynamic scope allocation there is a parameter called λ, which is the expected number of child nodes of any node, • λ is usually assumed as 2. • without the knowledge of the occurrence rate of the each child node, 1/λ of the remaining scope is allocated to x’s 1st inserted child. • Child1 : <n+1,size/2> • Child2 : <(n+1+size)/2, size/4>

Dynamic scope of a Suffix Tree Node: • The dynamic scope of a node is triple <n,size,k> , • where k is the number of subscopes allocated inside current scope.

Algorithm of VIST: • VIST uses the same sequence matching algorithm as RIST • Dynamic method for labeling suffix tree nodes is represented without building the suffix tree.

Algorithm of VIST: • The method relies on insensitive estimations of the number of attribute values. • Because of that the labeling mechanism is based on a virtual suffix tree .

Example: - lets look at the index structure before and after insertion

Algortihm of VIST: • Suppose, before the insertion the index structure already contains the following sequence: Doc1 = (P,e) (S,P) (N,PS) (V1,PSN) (L,PS) (V2,PSL) • The sequence to be inserted => Doc2 = (P,e) (S,P) (L,PS) (V2,PSL)

Assumptions of the Example: • There are 2 assumptions for the algorithm: • Max = 20480 • Dynamic scope allocation method uses the parameter λ =2

VIST: A Dynamic Index Method for Efficient XML Data Querying

VIST: A Dynamic Index Method for Efficient XML Data Querying

Presentation Transcript

Cmpe 472

CmpE 104

CMPE 259

CMPE 259

CHEM 521 Safety Presentation

CMPE 451 FINAL PRESENTATION

CMPE 490 Presentation Video Game

CMPE 421

CMPE 259

CMPE 484

Cmpe 589

CMPE Community

CMPE 471

CMPE 259

CMPE 155

CMPE 155

CMPE 155

CMPE 155

CMPE 259

Cmpe 589

Cmpe 472

CHEM 521 Safety Presentation