1 / 30

ViST: a dynamic index method for querying XML data by tree structures

This paper presents ViST, a dynamic index method for querying XML data using tree structures. It introduces structure-encoded sequences, indexing techniques, and the ViST algorithm for efficient subsequence matching. Experimental results demonstrate the advantages of ViST over other methods.

mrebecca
Download Presentation

ViST: a dynamic index method for querying XML data by tree structures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ViST: a dynamic index method for querying XML data by tree structures Authors: Haixun Wang, Sanghyun Park, Wei Fan, Philip Yu Presenter: Elena Zheleva, November 2003

  2. Overview • Modeling XML Queries • Structure-encoded sequences • Indexing • ViST • Experimental Results

  3. Modeling XML Queries

  4. DTD of purchase records: (!ELEMENT purchases (purchase*)) (!ELEMENT purchase (seller, buyer)) (!ATTRIST seller ID ID location CDATA name CDATA) (!ELEMENT seller (item*)) (!ATTRIST buyer ID ID location CDATA name CDATA) (!ELEMENT item (item*)) (!ATTRIST item name CDATA manufacturer CDATA)

  5. Modeling XML Queries • Focus in XML query language design: ability to express complex structural or graphical queries

  6. Modeling XML Queries • Querying XML data = finding sub structures of the data graph that match the sequence • Structure-encoded sequences: a sequential representation of both XML data and XML queries

  7. Structure-Encoded Sequences

  8. Structure-Encoded Sequences • Maps the data and the queries • Matches the subsequence • Purpose: to avoid as many join operations as possible • Def. Sequence of (symbol, prefix) pairs

  9. Mapping Data • Represent XML document/tree in preorder • Represent in structure-encoded seq

  10. Mapping Queries • Benefit of sequence matching: query gets processed as whole • Path Expression

  11. Structure-Encoded Sequences • Query • Data

  12. Querying XML • through Structure-Encoded Sequence Matching

  13. Indexing

  14. Role of Indexing • To provide an algorithm to perform this sequence matching • Desired features for algorithm: • Efficient support for subsequence matching • Use well-supported DB indexing techniques such as B+ trees • Allow dynamic index insertion

  15. What is indexing useful for • Auxiliary access structures • Used to speed up the retrieval of records • In response to certain search conditions • Provide efficient support for arbitrary structured queries • Using wild-cards // and *

  16. Indexing • State-of the-art approaches • Indexes on paths • Indexes on nodes • Indexes on both (structures) – ViST

  17. ViST

  18. Algorithms • Naïve Algorithm based on Suffix Trees • RIST: Relationships Indexed Suffix Tree • ViST: Virtual Suffix Tree

  19. Algorithm Using Suffix Trees • Suffix Tree: a compact index to all distinct, contiguous substrings of a string • D-Ancestorship – in XML doc tree • Through structure-encoded sequence • S-Ancestorship – in suffix tree

  20. Example Using Suffix Trees

  21. Algorithm Using Suffix Trees • Searches • first by S-Ancestorship: searching under suffix tree • then by D-Ancestorship: matching nodes and prefixes • Disadvantages: • Costly – traverse large portion of subtree • Most commercial DBMSs do not support

  22. RIST: Indexing by Ancestor-Descendant Relationships • Jumps directly to the nodes Y to which X is both a D-Ancestor and S-Ancestor • Index Construction: uses B+ trees

  23. RIST: Indexing by Ancestor-Descendant Relationships • Subsequence Matching • Determine D-Ancestorship by prefixes • Determine S-Ancestorship by label <nx,sizex> • x – suffix tree node (root of S-tree) • nx – prefix traversal order • sizex – number of descendants

  24. ViST: the Virtual Suffix Tree • Same sequence algorithm as RIST • BUT supports dynamic insertions • Uses dynamic method to assign labels • Once assigned, the labels are fixed and are not affected by subsequent data insertion or deletion • Labeling the suffix tree w/o building it • Relies on statistical information about the XML data

  25. ViST: the Virtual Suffix Tree Index structure contains the sequence: Sequence to be inserted: Dynamic scope of x = <nx, sizex,kx>

  26. ViST: the Virtual Suffix Tree

  27. Experimental Results • Datasets used • DBLP: CS bibliography DB • 289,627 records/publications • Each publication – tree of max depth 6 • Avg length of structure-encoded seq = 31 • XMARK • 1 record • Complicated tree structure • Synthetic

  28. Experimental Results • Comparison Methods • Index Fabric Algorithm – XML paths • XISS – uses nodes as basic query unit • ViST – appx. 1/10 of time to perform queries due to (multiple) join operations

  29. Experimental Results - remove • Index Structure and Size (1/3 less from suffix tree) • DocId B+ Tree – N elements • Combined D-ancestor and S-ancestor B+ tree - N x L elements • Index Construction

  30. Conclusion • XML Queries = Subsequence Matching • Advantages of ViST – algorithm for subsequence matching • Avoids expensive join operations • Index on both content and structure of XML documents • B+ trees – supported by disk-based data • Dynamic data insertion and deletion

More Related