Loading in 5 sec....

Covering Index for Branching Path QueriesPowerPoint Presentation

Covering Index for Branching Path Queries

- 299 Views
- Updated On :
- Presentation posted in: Sports / GamesEducation / CareerFashion / BeautyGraphics / DesignNews / Politics

Covering Index for Branching Path Queries Raghav kaushik University of Wisconsin Philip Bohannon Bell Laboratories Jeffrey F Naughton University of Wisconsin Henry F Korth Bell Laboratories SIGMOD 2002 Presented by: Yu Fan Overview Motivation Problem Introduction Background

Covering Index for Branching Path Queries

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Covering Index for Branching Path Queries

Raghav kaushik

University of Wisconsin

Philip Bohannon

Bell Laboratories

Jeffrey F Naughton

University of Wisconsin

Henry F Korth

Bell Laboratories

SIGMOD 2002

Presented by: Yu Fan

Overview

- Motivation
- Problem
- Introduction
- Background
- Covering Index Definition Scheme
- Performance Study
- Conclusion

Motivation

- Covering index is a well-known technique in relation database systems
- Define an index that “cover” all attributes of a table that are referenced in a query
- Evaluate query without the table
- Speed up query performance

- Can covering index used to accelerate the branching path queries?
- Yes

Problem

- The existing index are large in practice
- DataGuide
- 1-Index
- Forward and Backward Index (F&B Index)

The Labeled Graph Data Model

- Model XML or semi-structured data as a directed, node-labeled tree with extra set of special edges called idrefedges
- Directed graph

Branching Path Expressions

- Forward and Backward Separators
- If ni and ni+1 are separated by a
- /: then ni is the parent of ni+1
- //: then ni is the ancestor of ni+1
- : then ni points to ni+1 through an idref edge
- \: then ni is the child of ni+1
- \\:then ni is the descendant of ni+1
- : then ni is poined byni+1 through an idref edge

- If ni and ni+1 are separated by a

Branching Path Expressions

- Label-path
- A sequence of labels l1, l2,…lp separated by the separators

- Node-path
- A sequence of nodes n1,n2,…np separated by the separators

- A node-path matches a label-path if the corresponding separators are the same and label(ni) = li

Branching Path Expressions

- Primary path is the path that remains when all parts between brackets “[” and “]” are removed.
- Example:
Root/metro/neighorhoods/neighbornood[/business hotel]/cultural museum

Index Graph

- Index Graph I(G), where G is the data graph
- A is the node in I, ext(A), the extent of A, is the subset of VG
- Query result
- A branching path expression P on I(G)
- Union of the extents of the index nodes that result from evaluating P on I(G)

Bisimularity

- Definition: a symmetric, binary relation on VG is called a bisimulation if, for any two data nodes u and v with u v, we have that:
- u and v have the same label
- If paru is the parent of u and parv is the parent of v, then paru parv
- If u’ points to u through an idref edge, then there is a v’ that points to v through an idref such that u’ v’, and vice-versa.

DataGuide

- Concise and accurate structural summaries of semi-structured databases

1-index

- Index graph which is constructed on data graph G using bisimulation
- Intuition: try to group together nodes if they have the same incoming paths

Forward and Backward index

- Construct F&B-Index on edge-labeled data graph
- For every (edge) label l, add a new label l-1
- For every edge e labeled l from node u to node v, add an (inverse) edge e-1 with label l-1 from v to u
- Compute the 1-Index (or DataGuide) on this modified graph

Succ-Stable and Pred-Stable

- For a set of nodes A, Let Succ(A) denote the set of successors of the nodes in A.
- Given two sets of data graph nodes A and B, A is said to be succ-stable with respect to B if either A is a subset of Succ(B) or A and Succ(B) are disjoint
- Pred-stable can be defined in the same way

Stability

- If A is succ-stable with respect to B and there is an edge from B to A, then every note in extent of A has a parent in the extent of B
- Important for precision of index graph
- Stabilize A and B
- Splite A into A1 and A2
- A1 is A succ(B)
- A2 is A – succ(B)

- 1-Index
- Initialization by label grouping
- Splitting the label grouping till we obtain succ-stable refinement

Another View of F&B-Index

- Another way to build F&B-Index
- Reverse all edges in G
- Compute the bisimilarity partition
- Set the current partition to what is output by the previous step
- Reverse edges in G again
- Compute the bisimilarity partition
- Set the current partition to what is output by the previous step
- Repeat the above steps till the current partition does not change

- Obtain a partition of the data nodes that is both succ-stable and pred-stable

Size of the F&B-Index

- F&B-Index over a data graph G covers all branching path expressions over G
- Any index graph that covers all branching path expressions over G must be a refinement of F&B Index
- F&B-Index is the smallest index graph that covers all branching path expressions over G
- F&B-Index is often big. It can approach the size of the base data itself

Covering Index Definition Scheme

- Eliminating branching path expressions which are deemed less important.
- Smaller index handling the remaining branching queries more efficiently
- Four approaches towards the goal
- Tags to be indexed
- Tree edges vs idref edges
- Exploiting local similarity
- Restricting tree depth

Tags to be indexed

- Tags that never queried
- Need not be indexed
- Alter the label with a unique label: other
- If not in the tree path to any node that is indexed, it can be assumed to be absent

- Can have a lot of effect in practice
- XMark data, 100MB(1.43M nodes)
- F&B-Index has 436000 nodes
- Ignore text tags such as bold and emph
- Number of nodes drops to 18000

Tree Edges vs idref Edges

- Effect of idref edges
- XMard data
- F&B-Index on tree edges and idref edges has 1.35M nodes (ignore text nodes)
- F&B-Index on only tree edges has 18000 nodes (ignore text nodes)

- Give tree edges priority
- Specify the set of idref edges to be indexed

Exploiting Local similarity

- Observations:
- Most queries refer to short paths and seldom ask for long paths
- Two nodes are locally similar, but they may be stored in different extents due to a variety of complex paths

- Exploiting local similarity
- Give up absolute precision and group similar pieces of data together
- A(k)-Index

K-bisimulation

- Definition: k (k-bisimilarity) is defined inductively
- For any two nodes, v and v, u 0 v iff u and v have the same label
- Node u kv iff u k-1v, paru k-1 parv
- For every u’ that points to u through an idref edge, there is a v’ that points to v through an idref edge such that u’ k-1 v’, and vice versa

A(k)-index

- Constructed on data graph G using k-bisimulation
- Precise for any simple path expression of length less than or equal to k
- Use k to control the size of the index and the maximum area of the index graph affected
- Increasing k refines the partition until a fixed point is reached, which is 1-Index.

Restricting Tree Depth

- Tree Depth
- Given a branching path expression
- All nodes that do not have tree-depth 0
- Nodes that have a path from some node in the primary path have tree-depth 1
- Nodes that do not have tree-depth 1 and have a path to some node of tree-depth 1 have tree-depth 2
- Nodes that do not have tree-depth 2 and have a path from some node of tree-depth 2 have tree-depth 3
- And so on…

- Tree depth of a query is the maximum tree-depth of its nodes

Tree Depth Example

- Query example
- //museums/history/museum[/featured and cultural\neighborhood [/cultural museum [\art]]]
- asks for history museums that have a featured exhibit and also have an art museum in the same neighborhood

F+B-Index

- Consider one iteration of F&B-Index Computation
- Reverse all edges in G.
- Compute the bisimilarity partition
- Reverse edges in G again
- Compute the bisimilarity partition

- Call this index graph F+B-Index
- F+B+F+B-Index: two iteration

F+B-Index

- F+B-Index is accurate for branching path expressions that have tree depth at most 1
- F+B+F+B-Index is accurate for branching path expressions that have tree depth at most 3
- Can not handle all the queries
- Meaningful queries are often with small tree depth

Putting it together

- Index definition
- A set of tags T to be indexed.
- For each of the forward and backward didrecions
- Set of idref edges to be indexed (denote as reffwd and refback)
- The extent of local similarity desired (denote as kfwd and kback)

- Tree depth td, the number of iterations in the F&B-index computation to be performed

Example

- Tags to be indexed
- ROOT, metro, cinema-hall, neighborhoods, neighborhood, business

- Local similatiry
- kfwd= kback = ∞
- td = ∞

ROOT

metro

business

neighborhoods

neighborhood

neighborhood

Cinema-halls

9,10

business

Cinema-hall

business

24,26

Index Selection

- Given query
- The tag should be indexed
- kfwd≥ path length of the query
- kback ≥ path length of the query
- td ≥ tree depth of the query

- More generic index, more queries coverd, worse performance we get.
- Depends heavily on the data and the queries

Performance study

- XMark XML benchmark dataset
- Models an auction site

Performance on Queries

- Use defn 5,6,8, called Iall, Ialmost-alland Ispecific
- Use 5 different queries
- Some index may not cover the queries due to the reduction
- Three scenarios
- RELSTORE: stored in relational system
- NSTORE: stored using a native storage engine
- RELPUBLISH: stored in relation system and queries are over an XML view of data

Conclusion

- Covering indexes are a promising approach to their efficient evaluation
- F&B-Index can be a covering index for all set of branching path queries, but the size of the index is to big in practice
- Using scheme definition, we can get much smaller covering indexes that cover certain classes of queries