1 / 85

Sequence Indexing Schemes

Sequence Indexing Schemes. Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601. Introduction. Graph indexes precise Path, (twig only few methods) Sequence indexing schemes Top-down or bottom-up XML document and XML queries in structure-encoded sequences Path and twig.

vartan
Download Presentation

Sequence Indexing Schemes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence Indexing Schemes Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601

  2. Introduction • Graph indexes • precise • Path, (twig only few methods) • Sequence indexing schemes • Top-down or bottom-up • XML document and XML queries in structure-encoded sequences • Path and twig

  3. Top-Down Sequence Indexes: ViST

  4. ViST – Virtual Suffix Tree • Top-down Sequence Indexes • Represent XML documents and XML queries in structure-encoded sequences • Querying XML data is equivalent to finding subsequence matching • Avoid to expensive join operations • Provides unified index on both content and structure • Support dynamic index update • B+Trees which are supported in DBMSs

  5. DTD of purchase records <!ELEMENT purchases (purchase*)> <!ELEMENT purchase (seller, buyer)> <!ATTRIST seller ID ID location CDATA name CDATA> <!ELEMENT seller (item*)> <!ATTRIST buyer ID ID location CDATA name CDATA> <!ELEMENT item (item*)> <!ATTRIST item name CDATA manufacturer CDATA>

  6. A Single Purchase Record

  7. Preorder Sequence of XML • Use capital letters to represent names of elements/attributes • Use hash function h(), to encode attribute values into integers • v1 = h(“dell”) • v2=h(“ibm”) • Preorder sequence of XML purchase record example • PSNv1IMv2Nv3IMv4INv5Lv6BLv7Nv8 • Isomorphic trees may produce different preorder seq. • DTD schema embodies linear order of all elements/attributes • Without DTD – use lexicographical order

  8. Structure-Encoded Sequence Definition: A Structure-Encoded Sequence, derived from a prefix traversal of semi-structured XML document, is a sequence of (symbol, prefix) pairs: D = (a1,p1), (a2,p2),…, (an,pn) Where ai represents a node in the XML document tree, (of which a1, … ,an is the preorder sequence), and pi is the path from the root node to node ai.

  9. Structure-Encoded Sequence D= (P,ϵ),(S,P),(N,PS),(v1,PSN),(I,PS),(M,PSI),(v2,PSIM),(N,PSI), (v3,PSIN),(I,PSI),(M,PSII),(v4,PSIIM),(I,PS),(N,PSI),(v5,PSIN), (L,PS),(v6,PSL),(B,P),(L,PB),(v7,PBL),(N,PB),(v8,PBN)

  10. XML Queries in Graph Form

  11. XML Queries in Path Expression and Sequence Form • Query: Path Expression Structure-Encoded Sequence • Q1 : /Purchase/Seller/Item/Manufacturer (P, ϵ)(S,P)(I,PS)(M,PSI) • Q2 : /Purchase[Seller[Loc = v5]]/Buyer[Loc = v7] (P,ϵ)(S,P)(L,PS)(v5,PSL)(B,P)(L,PB)(v7,PBL) • Q3 : /Purchase/*[Loc= v5] (P, ϵ)(L, P)(v5,P*L) • Q4 : /Purchase//Item[Manufacturer = v3] (P, ϵ)(I,P//)(M, P//I)(v3,P//IM)

  12. Querying XML through Structure-Encoded Sequence Matching • Querying XML is equivalent to finding (non-contiguous) subsequence matches • Most structural XML queries can be performed through direct subsequence matching • Exception: branch has multiple identical child nodes • Q5=/A[B/C]/B/D • Two different sequences • (A, ϵ)(B,A)(C,AB)(B,A)(D,AB) • (A, ϵ)(B,A)(D,AB)(B,A)(C,AB) • Find matches separately and union their result • We may find false matches if the indexed documents contain branches with identical child nodes, then we ask multiple queries and compute set difference on result • If the query contains a large number of same child nodes under the branch, we can choose disassemble the tree into multiple trees and use join operations to combine their results

  13. Algorithms • Naïve algorithm • RIST – Relationships Indexed Suffix Tree • ViST – Virtual Suffix Tree

  14. Naïve algorithm: Suffix-Tree-Like structure • Doc1 : (P, ϵ)( S, P)(N, PS)(v1, PSN)(L, PS)(v2, PSL) • Doc2 : (P, ϵ)(B, P)(L, PB)(v2, PBL) • Q1 : (P, ϵ)(B, P)(L,PB)(v2, PBL) • Q2 : (P, ϵ)(L, P*)(v2,P*L)

  15. D-Ancestorship and S-Ancestorship • D-Ancestorship • Ancestor-descendant relationships in original XML tree • Element (S,P) is a D-Ancestorship of (L,PS) • S-Ancestorship • Ancestor-descendant relationships in suffix tree • Element (v1, PSN) is an S-Ancestorship of (L, PS)

  16. Naïve search :A naïve algorithm based on suffix trees

  17. RIST – Indexing Construction • S-Ancestorship requires additional information • Label each suffix tree node x by pair <nx, sizex> • nx prefix traversal order of x in suffix tree • sizex is total number of descendants of x in suffix tree • x … <nx, sizex>, y …<ny, sizey> • x is S-Ancestor of node y if nyϵ (nx, nx + sizex] • Construct the B+Trees: • Tree nodes into the D-Ancestorship B+Tree using (Symbol, Prefix) as keys • For all nodes x inserted with the same (Symbol, Prefix) we index them by S-Ancestorship B+Tree, using the nx values of their labels as keys.

  18. The RIST index structure

  19. Search: non-contiguous subsequence matchingusing B+Tree

  20. ViST – Virtual Suffix Tree • Dynamic Virtual suffix tree labeling • Semantic and statistical clues • Dynamic scope allocation without clues

  21. Dynamic scope allocation • Number of child nodes of x is λ. We allocate 1/λ of the remaining scope to x’s first child Dynamic scope allocation with λ=2

  22. Dynamic Scope of a Suffix Tree Node

  23. subScope(parent, e): create a sub scopewithin the parent scope for e

  24. Insertion index • Doc1 = (P,ϵ)(S,P)(N,PS)(v1,PSN)(L,PS)(v2,PSL) • Doc2 = (P,ϵ)(S,P)(L,PS)(v2,PSL)

  25. Index an XML document

  26. EXPERIMENTS - Sample queries Path Expression Dataset Q1 /inproceedings/title DBLP Q2 /book/author[text=‘David’] DBLP Q3 /*/author[text= ‘David’] DBLP Q4 //author[text= ‘David’] DBLP Q5 /book[key=‘books/bc/MaierW88’]/author DBLP Q6 /site//item[location=‘US’]/mail/date[text=‘12/15/1999’] XMARK Q7 /site//person/*/city[text=‘Pocatello’] XMARK Q8 //closed_auction[*[person=‘person1’]]/date[text=‘12/15/1999’] XMARK

  27. Comparing indexing methods time in seconds

  28. Index structure • DBLP (301 MB of data) • XMARK (52MB of data)

  29. Conclusion • structure-encoded sequences • Sequence matching • Avoid expensive join operations • Top-down scope allocation method • Index structure – B+Tree

  30. PRIX:Prufer Sequences for Indexing XML

  31. PRIX: PRufer sequences for Indexing Xml • Rao & Moon (2006) proposed a new method for indexing XML documents using sequences • It uses the same idea as in ViST index: • The XML tree is transformed into a sequence and saved in the database • Each query is also transformed into a sequence • The answer of the query is acquired by performing subsequence matching

  32. PRIX: PRufer sequences for Indexing Xml

  33. PRIX: PRufer sequences for Indexing Xml

  34. Motivation: Twig Queries and Wildcards • Like in ViST, PRIX also tries to efficiently answer twig queries as well as queries containing wildcards (‘*’ any and ‘//’ self or descendant queries) P P Q Q T S S Twig query XPath: P/Q[T]/S Query with wildcards XPath: P//Q/S

  35. Motivation: Problems in ViST Index • Memory requirements: • In the worst case, ViST requires O(N2) space to index the document A <A> <B> <C> <D> <E> </E> </D> </C> </B> </A> B D = (A, ε), (B, A), (C, AB), (D, ABC), (E, ABCD) C D Elements in height k appear k times E

  36. Motivation: Problems in ViST Index • Memory requirements: • In the worst case, ViST requires O(N2) space to index the document • False positives • In many cases, query processing in Vist results in false alarms P P P Q R Q Q Q T S U T T S T S Doc1 = (P, e) (Q, P) (T, PQ) (S, PQ) (R, P) (U, PR) (T, PR) Doc2 = (P, e) (Q, P) (T, PQ) (Q, P) (S, PQ) XPath: P/Q[T]/S Q = (P, e) (Q, P) (T, PQ) (S, PQ)

  37. Motivation: Problems in ViST Index • Memory requirements: • In the worst case, ViST requires O(N2) space to index the document • False positives • In many cases, query processing in Vist results in false alarms • False negatives • Correctly answering a twig query depends on the order the branches are created P P N F F N T G Doc = (P, e) (F, P) (T, PF) (N, P) (G, PN) Xpath: P[N]/F Q = (P, e) (N, P) (F, P) ???

  38. Motivation: Problems in ViST Index • Memory requirements: • In the worst case, ViST requires O(N2) space to index the document • False positives • In many cases, query processing in Vist results in false alarms • False negatives • Correctly answering a twig query depends on the order the branches are created

  39. PRIX: PRufer sequences for Indexing Xml

  40. PRIX Architecture

  41. Indexing and Querying in PRIX Indexing: • The first step is to take as input an XML document and convert it into a sequence • This is achieved using Prufer Sequences • The sequence is saved in the database in a way equivalent to the one used in ViST • It is a Virtual Trie implemented as B+ Trees XML document

  42. Indexing and Querying in PRIX Querying • Queries are also transformed to trees and then to Prufer Sequences • The query sequence looked up in the document sequence and all matching subsequences are retrieved • After this initial filtering, three refinement phases follow XPath Query

  43. PRIX: PRufer sequences for Indexing Xml

  44. Indexing XML Documents • The first step is to transform the XML document to the equivalent XML tree • Notice that both elements and text values are represented as nodes (the same stands for attributes) • The tree is not saved in the database <A> <B></B> <B> <C> D </C> <C> <F/> <E/> </C> </B> </A> A B B F D E C C

  45. Indexing XML Documents • Then the Prufer Sequence is created from the XML tree • A Prufer Sequence is a method proposed by Prufer (1918) that constructs a one-to-one correspondence between a labeled tree and a sequence 8,A 8, 3, 7, 6, 6, 7, 8 1,B 7,B 2,D 5,E 4,F 3,C 6,C

  46. Indexing XML Documents • Prufer Sequences can only be created from trees with numerical labeling, with each node having a unique number • Since the XML tree contains string labels (the names of elements etc.) we add an additional label to each node • We will use the post-order traversal to name the nodes • The prufer sequence can be extracted for any labeling of the tree, but using post-order numbering has some properties that makes the querying process easier

  47. Indexing XML Documents • Initial labeling A 8,A B B 1,B 7,B F 2,D D E 5,E 4,F C C 3,C 6,C

  48. Indexing XML Documents Finding the Prufer Sequence • The algorithm to find the Prufer sequence is the following: • Find the leaf with the smallest value and delete it. • Add the label of its parent to the sequence • Repeat until only one node is left • In PRIX index, two sequences are held: • The actual Prufer Sequence holding the numbers of the labels called Numbered Prufer Sequence: NPS • The corresponding sequence holding the actual labels of the nodes of the XML Tree called Labeled Prufer Sequence: LPS

  49. Indexing XML Documents Finding the Prufer Sequence • The algorithm to find the Prufer sequence is the following: • Find the leaf with the smallest value and delete it. • Add the label of its parent to the sequence • Repeat until only one node is left 8,A 1,B 7,B 2,D NPS : 8, LPS : A, 5,E 4,F 3,C 6,C

  50. Indexing XML Documents Finding the Prufer Sequence • The algorithm to find the Prufer sequence is the following: • Find the leaf with the smallest value and delete it. • Add the label of its parent to the sequence • Repeat until only one node is left 8,A 7,B 2,D NPS : 8, 3 LPS : A, C 5,E 4,F 1,B 3,C 6,C

More Related