1 / 48

A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data

A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data. Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes (BGU). Outline. Problem statement Prior work and limitations Background A(k)-index Query Evaluation Preliminary experiments

lita
Download Presentation

A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A(k)-index :Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash)Philip Bohannon (Bell Labs)Ehud Gudes (BGU)

  2. Outline • Problem statement • Prior work and limitations • Background • A(k)-index • Query Evaluation • Preliminary experiments • Update • Conclusions

  3. Data Model • Rooted, node-labeled graph with unique root; root has unique label • Nodes - objects • Arcs - object-subobject relationship • In XML context • Index tag structure • No distinction between elements and attributes • No distinction between tree and idref arcs • Order ignored

  4. Problem Statement • Practical indexing schemes for large graph data (like XML data) (100K - 1M nodes) • Size ~10% of database size • Efficient construction and update • Tunable to a workload • Queries of the form R x, where R is a regular path expression • Schemaless data

  5. Flavor of Approach • Different from traditional value indices • Structural summaries for indexing paths • Both data and index are rooted graphs • Example: Dataguide

  6. Index Graph • Structural summary • Associate a set of data nodes with each index node, called its extent • Preserve data paths in index graph

  7. Example index graph 0 0 2 1 2 1 3,4 4 3 5,6 6 5 Data graph Index graph

  8. Index Graph (cont’d) • Can be constructed from any partition • Node for every equivalence class C • Edge between C and C’ if exists an edge v v’ with v in C and v’ in C’ • Preserves data paths, no false drops • Our structures are all index graphs

  9. Prior Schemes • Dataguide [Goldman, Widom 1997] • Deterministic automaton corresponding to data graph • Each set of data nodes that can be distinguished by a path query is summarized by a single node in the index • Can be exponential in size!

  10. Prior Schemes (cont’d) • 1-index [Milo, Suciu 1999] • NFA rather than DFA (smaller) • split graph nodes into equivalence classes based on incoming paths from the root • Computing best split is PSPACE complete • Go for refinements (approximations) • similarity • bisimilarity

  11. Limitations of Prior Work • Size • Dataguide sizes subject to exponential blow-up • 1-index size can be big too! • Update • No known update algorithm for 1-index • Designed to answer queries involving arbitrarily complex paths, but... • such paths may never show up in queries

  12. Local Similarity ROOT metro cultural neighborhoods business museum museum hotel nhd. nhd. nearby attr. attr. cult. cult.

  13. Main Contributions • New family of approximate index structures • Applicable to • Approximate Schema • Statistics • Query evaluation using approximate indexes • Preliminary performance study • Update algorithms

  14. Approximate Indexes • Motivation: • Smaller • More efficient query processing • Limited update cost - maintain local information • Approximate dataguide [Goldman, et.al] • path merging, object matching, etc • no formal basis (but different goal) • no study of effect on query processing

  15. Outline • Problem statement • Prior work and limitations • Background • A(k)-index • Query Evaluation • Preliminary experiments • Update • Conclusions

  16. Graph Bisimulation • A bisimulation is a symmetric relation R between nodes • If A1 R A2 then • A1 and A2 have the same labels • and ...

  17. B1 A1 A2 R B1 B2 R A1 A2 R Graph Bisimulation (cont’d) and vice-versa!

  18. Bisimilarity • Two nodes a and b are bisimilar if they are related in some bisimulation • 1-index is index graph constructed from bisimulation partition • Simulation partition: similar

  19. Bisimulation on example ROOT metro cultural neighborhoods business museum museum hotel nhd. nhd. nearby attr. attr. cult. cult.

  20. k-bisimulation • Nodes A1 and A2 are 0-bisimilar iff same label • A1 and A2 are k-bisimilar iff • k-1 bisimilar and • if (B1, A1), exists (B2, A2): B1 and B2 are k-1 bisimilar, and vice versa

  21. 0 0 0 1 2 1 2 1 2 3 4 3,4 3 4 5,6 5,6 5 6 Data graph 0-bisimulation 1-bisimulation Example for k-bisimulation

  22. A(2) for example ROOT metro cultural neighborhoods business museum museum hotel nhd. nhd. nearby attr. attr. cult. cult.

  23. Properties • If a and b are bisimilar • set of incoming paths into them is same • If a and b are k-similar or k-bisimilar • set of incoming paths of length <= k are same • If k-bisim = k+1-bisim then k-bisim = bisim • Size: certainly smaller than bisimulation

  24. Query Evaluation • Only queries studied are regular path queries of the form R x • Query Evaluation Approach: • Create automaton for regexp query • Run automaton on the index graph • Result is union of extents belonging to index nodes accepted by automaton

  25. 0 1 2 3,4 5,6 Example Query Evaluation Automaton Graph Index Graph

  26. Approximate Indexes • Caveat: False positives possible • Approach: verify each node on data graph by running reverse automaton • Prohibitive cost? • Then why use approx. indices? • In fact, frequently more efficient than data graph or precise index

  27. Improving Validation • First cut: Keep track of accepting-path-length • for accepted nodes with path length <= k, verification not required • Second step: Share traversals among verification calls • mark node-state pairs on a successful verification path as accept • similar marking for failed path

  28. Improving Validation (cont’d) • Third Step: Avoid needless verification • Example: For _*.R queries, no need to verify all the way up to the root • Generalize the above!

  29. Outline • Problem statement • Prior work and limitations • Background • A(k)-index • Query Evaluation • Preliminary experiments • Update • Conclusions

  30. Preliminary Experiments • Data used: Internet Move Database (http://www.imdb.com) • 250,000 movies & TV shows • 460,000 actors, etc • XML version = ~1GB • We used subsets of this database ranging from 200 - 2000 movies • Whole database --> future work!

  31. Preliminary Experiments • Second source: Open Directory Project (http://www.dmoz.org) • Entire source available in RDF format • Subsets: (entire subtree under a topic, say shopping)

  32. Storage Model • Results independent of any particular storage model • In-memory rooted graph • Performance metrics are abstract • Cost = total number of nodes visited (graph + index)

  33. Bisimulation Sizes IMDB #Nodes: 190,000 ODP #Nodes: 143,000

  34. Query Evaluation Plans 1. Forward eval 2. Backward eval(assume a label index)

  35. Short Queries - IMDB

  36. Long Queries - IMDB

  37. Queries beginning with _*

  38. Queries containing _*

  39. Approximate Answers

  40. A(k)-index Update • Edge added from u to v • A(0)-index -> no change except possible addition of edge • A(1)-index -> index node containing v may change • determined by set of labels in v’s parents

  41. A(k)-index Update (contd) • A(k)-index • only nodes to be considered are those at distance < k from v • Maintain tree of splits • Work iteratively: • find new A(1) position of v • find new A(2) positions of v and its children • …

  42. Updating the 1-index • One way is generalization of A(k) update • R - any binary relation on the nodes that is • reflexive • transitively closed. • A refinement of R is any subset that is • reflexive • transitively closed

  43. Refinement • B - bisimulation relation • B’ - any refinement of B • B(G) - index graph built using B • B’(G) - index graph built using B’

  44. Theorem • Theorem: B(B’(G)) = B(G) • Intuition: • Similar nodes behave similarly • So, fuse them together!

  45. Lazy Update • Basic Idea: • G  G’ , and meanwhile B(G)  B(G’) • Instead, “relax” the graph B(G) to B’(G’) • How? • A “stable” partitioning of G is either B(G) or its refinement. • Propagate graph update on B(G) by splitting nodes until stable.

  46. Lazy Update Performance

  47. Conclusions • Novel approximate index structures and validation techniques • Experiments demonstrate k-bisimulation index is • Efficiently constructed • Effective for query answering

  48. Future Work • Handle more query types • Branching queries • Queries with selection • Annotating A(k) with statistics for query optimization • Storage • Application of update algorithms to triggers

More Related