1 / 41

Connected Substructure Similarity Search

Connected Substructure Similarity Search. Haichuan Shang The University of New South Wales & NICTA, Australia Joint Work: Xuemin Lin (The University of New South Wales & NICTA, Australia) Ying Zhang (The University of New South Wales, Australia)

judson
Download Presentation

Connected Substructure Similarity Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Connected Substructure Similarity Search Haichuan Shang The University of New South Wales & NICTA, Australia Joint Work: Xuemin Lin (The University of New South Wales & NICTA, Australia) Ying Zhang (The University of New South Wales, Australia) Jeffrey Xu Yu (Chinese University of Hong Kong, China) Wei Wang(The University of New South Wales & NICTA, Australia)

  2. Outline 1. Motivation 2. Similarity Measure 3. Techniques 4. Experimental Study 5. Conclusion

  3. Application 1. Chemistry 2. Bioinformatics 3. Software Engineering 4. Social Network Chemical Compounds

  4. Substructure Search

  5. Substructure Similarity Search Why Similarity Search? Input Mistake Exploration ......

  6. Substructure Similarity Search Why Similarity Search? Input Mistake Exploration ...... Existing Work SIGMOD’05 Grafil ICDE’06 Closure-tree ICDE’07 GDIndex VLDB’09 Comparing Stars

  7. Graph Similarity Subgraph Similarity • Similarity Measures • Maximum Common Subgraph (MCS) • (# of missing edges) • Edit Distance. • Variants. • No enforcement of • connectivity.

  8. Graph Similarity A New Similarity Measure. Maximum Connected Common Subgraph – MCCS (counting missing edges while retaining the connectivity)

  9. Graph Similarity Maximum Connected Common Subgraph – MCCS: Given two graphs g1 and g2, the maximum connected common subgraph of g1 and g2 is the largest connected subgraph of g1 which is subgraph isomorphic to g2, denoted as mccs(g1, g2)

  10. Graph Similarity Maximum Connected Common Subgraph – MCCS: Given two graphs g1 and g2, the maximum connected common subgraph of g1 and g2 is the largest connected subgraph of g1 which is subgraph isomorphic to g2, denoted as mccs(g1, g2) Subgraph Distance: Given a query graph q and a data graph g, the Subgraph Distance is defined as, dist(q, g) = |q| − |mccs(q, g)| The graph size is defined as the number of edges. (# of missing edges from the query)

  11. Graph Similarity Maximum Connected Common Subgraph – MCCS: Given two graphs g1 and g2, the maximum connected common subgraph of g1 and g2 is the largest connected subgraph of g1 which is subgraph isomorphic to g2, denoted as mccs(g1, g2) Subgraph Distance: Given a query graph q and a data graph g, the Subgraph Distance is defined as, dist(q, g) = |q| − |mccs(q, g)| The graph size is defined as the number of edges. (# of missing edges from the query) Substructure Similarity Search: Given a graph database D = {g1, g2, ..., gn}, a query graph q, and a subgraph distance threshold , the substructure similarity search is to retrieve all the graphs gi ∈ D with dist(q, gi) ≤ .

  12. Framework

  13. Feature-based exact subgraph search: overview Pruning: Query Feature(Index) Data Query Data

  14. Feature-based exact subgraph search: overview Pruning: Validation: Query Feature(Index) Data Query Data

  15. Similarity Search (triangular inequality) dist(Q,F)+dist(F,D) ≥ dist(Q,D) ? dist(Q,D) Query Data dist(Q,F) dist(F,D) Query Feature(Index) Data

  16. Similarity Search (triangular inequality) dist(Q,F)+dist(F,D) ≥ dist(Q,D) ? 1 dist(Q,F) dist(F,D) Query Feature(Index) Data dist(Q,D) Query Data

  17. Similarity Search (triangular inequality) dist(Q,F)+dist(F,D) ≥ dist(Q,D) ? 1 2 dist(Q,F) dist(F,D) Query Feature(Index) Data dist(Q,D) Query Data

  18. Similarity Search (triangular inequality) dist(Q,F)+dist(F,D) ≥ dist(Q,D) – hold! 1 2 2 dist(Q,F) dist(F,D) Query Feature(Index) Data dist(Q,D) Query Data

  19. Triangular inequality: not always hold dist(Q,F)+dist(F,D) ≥ dist(Q,D) X 0 1 3 dist(Q,F) dist(F,D) Query Feature(Index) Data dist(Q,D) Query Data

  20. Triangular inequality: not always hold dist(Q,F)+dist(F,D) ≥ dist(Q,D) X 0 1 3 dist(Q,F) dist(F,D) Query Feature(Index) Data dist(Q,D) Query Data

  21. Connectivity Dominance Connectivity Dominance: The connectivity of mccs(g1, g2) dominates the connectivity of g2 if there is a subgraph isomorphic mapping from mccs(g1, g2) to g2 such that if removing all the edges from this mapping, then all the vertices in the embedding mapping are disconnected. (i.e. The removing fully disconnected g2 .)

  22. Connectivity Dominance Theorem. Given three graphs g1, g2, and g3, if the connectivity of mccs(g1, g2) dominates g2 or the connectivity of mccs(g3, g2) dominates g2, then dist(g1, g3) ≤ dist(g1, g2) + dist(g2, g3).

  23. Connectivity Dominance Theorem. Given three graphs g1, g2, and g3, if the connectivity of mccs(g1, g2) dominates g2 or the connectivity of mccs(g3, g2) dominates g2, then dist(g1, g3) ≤ dist(g1, g2) + dist(g2, g3). Example 1 Example 2 g1=Query g2=Feature(Index) g3=Data

  24. Connectivity Dominance Theorem. Given three graphs g1, g2, and g3, if the connectivity of mccs(g1, g2) dominates g2 or the connectivity of mccs(g3, g2) dominates g2, then dist(g1, g3) ≤ dist(g1, g2) + dist(g2, g3). Example 1 mccs(g1,g2) not dominate g2 mccs(g2,g3) dominates g2 Example 2 g1=Query g2=Feature(Index) g3=Data

  25. Connectivity Dominance Theorem. Given three graphs g1, g2, and g3, if the connectivity of mccs(g1, g2) dominates g2 or the connectivity of mccs(g3, g2) dominates g2, then dist(g1, g3) ≤ dist(g1, g2) + dist(g2, g3). Example 1 mccs(g1,g2) not dominate g2 mccs(g2,g3) dominates g2 Example 2 mccs(g2,g3) not dominate g2 mccs(g1,g2) not dominate g2 g1=Query g2=Feature(Index) g3=Data

  26. Connectivity Dominance Theorem. Given three graphs g1, g2, and g3, if the connectivity of mccs(g1, g2) dominates g2 or the connectivity of mccs(g3, g2) dominates g2, then dist(g1, g3) ≤ dist(g1, g2) + dist(g2, g3). Example 1 mccs(g1,g2) not dominate g2 mccs(g2,g3) dominates g2 Example 2 mccs(g2,g3) not dominate g2 mccs(g1,g2) not dominate g2 g1=Query g2=Feature(Index) g3=Data Count # of disconnected components: Linear Algorithm

  27. dist(Q,F)+dist(F,D) ≥ dist(Q,D) Validation Rule 1: dist(Q,F)+dist(F,D) ≤ => dist(Q,D) ≤ mccs(Q, F) dominates F or mccs(F, D) dominates F dist(Q,D)+dist(D,F) ≥ dist(Q,F) Pruning Rule 1: dist(Q,F)-dist(D,F)> => dist(Q,D)> mccs(D, F) dominates D dist(F,Q)+dist(Q,D) ≥ dist(F,D) Pruning Rule 2: dist(F, D)-dist(F, Q)> => dist(Q,D)> mccs(F, Q) dominates Q

  28. Verification Algorithm • Basic idea:1. enumerate sub-spanning tree of query graph such that the # of missing edges ≤ ; try to terminate the algorithm as early as possible. 2. sharing the enumeration costs by two ways: a. not enumerate every thing from scratch. b. once enumerated, keep enumerated spanning trees. • Convert Query to QI-Sequence [VLDB08] to favour earlier termination. Prefix = Induced subgraph1.1 Infrequent Label (in all data graphs) First 1.2 Higher Degree Vertex  (in the query graph) First1.3 Dense Induced Subgraph (in the query graph) First   

  29. Verification Algorithm • MCCS Detection Algorithm • Compute QI-Sequence

  30. Verification Algorithm • MCCS Detection Algorithm • Compute QI-Sequence • DFS: Threshold based DFS Search(A-B-C Matched)

  31. Verification Algorithm Remove Edge B-D • MCCS Detection Algorithm • Compute QI-Sequence • DFS: Threshold based DFS Search(A-B-C Matched) • Generate new QI-Sequence from the existing one.

  32. Verification Algorithm Remove Edge B-E • MCCS Detection Algorithm • Compute QI-Sequence • DFS: Threshold based DFS Search(A-B-C Matched) • Generate new QI-Sequence from the existing one.

  33. Verification Algorithm Remove Edge B-F • MCCS Detection Algorithm • Compute QI-Sequence • DFS: Threshold based DFS Search(A-B-C Matched) • Generate new QI-Sequence from the existing one.

  34. Verification Algorithm Right Subtree • MCCS Detection Algorithm • Compute QI-Sequence • DFS: Threshold based DFS Search(A-B-C Matched) • Generate new QI-Sequence from the existing one. • DFS: Threshold based DFS Search (The second A-B Matched)

  35. Verification Algorithm Remove Edge B-C • MCCS Detection Algorithm • Compute QI-Sequence • DFS: Threshold based DFS Search(A-B-C Matched) • Generate new QI-Sequence from the existing one. • DFS: Threshold based DFS Search (The second A-B Matched) • Generate new QI-Sequence from the existing one.

  36. Verification Algorithm • MCCS Detection Algorithm • Compute QI-Sequence • DFS: Threshold based DFS Search(A-B-C Matched) • Generate new QI-Sequence from the existing one. • DFS: Threshold based DFS Search (The second A-B Matched) • Generate new QI-Sequence from the existing one. • Terminate. (dist(q,g) ≤ 3)

  37. Feature Selection • Pruning Rule 1: mccs(D, F) dominates D • Pruning Rule 2: mccs(F, Q) dominates Q =>F should be dense. =>Discriminative Frequent Induced Subgraph • Validation Rule 1: mccs(F, D) dominates F or mccs(Q, F) dominates F =>F nearly contains Q and F should be sparse. =>Frequent Large Sparse Subgraphs Algorithm: gSpan[ICDM02] with our on-the-fly feature selection.

  38. Experiments Settings AIDS Antiviral dataset, a popular benchmark, 43k chemical bonds

  39. Experiments

  40. Experiments

  41. Conclusion • Connected Substructure Similarity Search • Measure: Maximum Connected Common Subgraph – MCCS • Connectivity Dominance => Triangular inequality • MCCS Detection Algorithm • (Index, Filtering & Validation, Verification Techniques) • Future Work: • Large Graphs? New Measures? Thanks

More Related