Connected substructure similarity search
Download
1 / 41

Connected Substructure Similarity Search - PowerPoint PPT Presentation


  • 119 Views
  • Uploaded on

Connected Substructure Similarity Search. Haichuan Shang The University of New South Wales & NICTA, Australia Joint Work: Xuemin Lin (The University of New South Wales & NICTA, Australia) Ying Zhang (The University of New South Wales, Australia)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Connected Substructure Similarity Search' - judson


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Connected substructure similarity search

Connected Substructure Similarity Search

Haichuan Shang

The University of New South Wales & NICTA, Australia

Joint Work:

Xuemin Lin (The University of New South Wales & NICTA, Australia)

Ying Zhang (The University of New South Wales, Australia)

Jeffrey Xu Yu (Chinese University of Hong Kong, China)

Wei Wang(The University of New South Wales & NICTA, Australia)


Outline
Outline

1. Motivation

2. Similarity Measure

3. Techniques

4. Experimental Study

5. Conclusion


Application
Application

1. Chemistry

2. Bioinformatics

3. Software Engineering

4. Social Network

Chemical Compounds



Substructure similarity search
Substructure Similarity Search

Why Similarity Search?

Input Mistake

Exploration

......


Substructure similarity search1
Substructure Similarity Search

Why Similarity Search?

Input Mistake

Exploration

......

Existing Work

SIGMOD’05 Grafil

ICDE’06 Closure-tree

ICDE’07 GDIndex

VLDB’09 Comparing Stars


Graph similarity
Graph Similarity

Subgraph Similarity

  • Similarity Measures

  • Maximum Common Subgraph (MCS)

  • (# of missing edges)

  • Edit Distance.

  • Variants.

  • No enforcement of

  • connectivity.


Graph similarity1
Graph Similarity

A New Similarity Measure.

Maximum Connected Common Subgraph – MCCS

(counting missing edges while retaining the connectivity)


Graph similarity2
Graph Similarity

Maximum Connected Common Subgraph – MCCS: Given two graphs g1 and g2, the maximum connected common subgraph of g1 and g2 is the largest connected subgraph of g1 which is subgraph isomorphic to g2, denoted as mccs(g1, g2)


Graph similarity3
Graph Similarity

Maximum Connected Common Subgraph – MCCS: Given two graphs g1 and g2, the maximum connected common subgraph of g1 and g2 is the largest connected subgraph of g1 which is subgraph isomorphic to g2, denoted as mccs(g1, g2)

Subgraph Distance: Given a query graph q and a data graph g, the Subgraph Distance is defined as,

dist(q, g) = |q| − |mccs(q, g)|

The graph size is defined as the number of edges.

(# of missing edges from the query)


Graph similarity4
Graph Similarity

Maximum Connected Common Subgraph – MCCS: Given two graphs g1 and g2, the maximum connected common subgraph of g1 and g2 is the largest connected subgraph of g1 which is subgraph isomorphic to g2, denoted as mccs(g1, g2)

Subgraph Distance: Given a query graph q and a data graph g, the Subgraph Distance is defined as,

dist(q, g) = |q| − |mccs(q, g)|

The graph size is defined as the number of edges.

(# of missing edges from the query)

Substructure Similarity Search: Given a graph database D = {g1, g2, ..., gn}, a query graph q, and a subgraph distance threshold , the substructure similarity search is to retrieve all the graphs gi ∈ D with dist(q, gi) ≤ .



Feature based exact subgraph search overview
Feature-based exact subgraph search: overview

Pruning:

Query Feature(Index) Data

Query Data


Feature based exact subgraph search overview1
Feature-based exact subgraph search: overview

Pruning:

Validation:

Query Feature(Index) Data

Query Data


Similarity search triangular inequality
Similarity Search (triangular inequality)

dist(Q,F)+dist(F,D) ≥ dist(Q,D) ?

dist(Q,D)

Query Data

dist(Q,F)

dist(F,D)

Query Feature(Index) Data


Similarity search triangular inequality1
Similarity Search (triangular inequality)

dist(Q,F)+dist(F,D) ≥ dist(Q,D) ?

1

dist(Q,F)

dist(F,D)

Query Feature(Index) Data

dist(Q,D)

Query Data


Similarity search triangular inequality2
Similarity Search (triangular inequality)

dist(Q,F)+dist(F,D) ≥ dist(Q,D) ?

1 2

dist(Q,F)

dist(F,D)

Query Feature(Index) Data

dist(Q,D)

Query Data


Similarity search triangular inequality3
Similarity Search (triangular inequality)

dist(Q,F)+dist(F,D) ≥ dist(Q,D) – hold!

1 2 2

dist(Q,F)

dist(F,D)

Query Feature(Index) Data

dist(Q,D)

Query Data


Triangular inequality not always hold
Triangular inequality: not always hold

dist(Q,F)+dist(F,D) ≥ dist(Q,D) X

0 1 3

dist(Q,F)

dist(F,D)

Query Feature(Index) Data

dist(Q,D)

Query Data


Triangular inequality not always hold1
Triangular inequality: not always hold

dist(Q,F)+dist(F,D) ≥ dist(Q,D) X

0 1 3

dist(Q,F)

dist(F,D)

Query Feature(Index) Data

dist(Q,D)

Query Data


Connectivity dominance
Connectivity Dominance

Connectivity Dominance: The connectivity of mccs(g1, g2) dominates the connectivity of g2 if there is a subgraph isomorphic mapping from mccs(g1, g2) to g2 such that if removing all the edges from this mapping, then all the vertices in the embedding mapping are disconnected. (i.e. The removing fully disconnected g2 .)


Connectivity dominance1
Connectivity Dominance

Theorem. Given three graphs g1, g2, and g3, if the connectivity of mccs(g1, g2) dominates g2 or the connectivity of mccs(g3, g2) dominates g2, then dist(g1, g3) ≤ dist(g1, g2) + dist(g2, g3).


Connectivity dominance2
Connectivity Dominance

Theorem. Given three graphs g1, g2, and g3, if the connectivity of mccs(g1, g2) dominates g2 or the connectivity of mccs(g3, g2) dominates g2, then dist(g1, g3) ≤ dist(g1, g2) + dist(g2, g3).

Example 1

Example 2

g1=Query g2=Feature(Index) g3=Data


Connectivity dominance3
Connectivity Dominance

Theorem. Given three graphs g1, g2, and g3, if the connectivity of mccs(g1, g2) dominates g2 or the connectivity of mccs(g3, g2) dominates g2, then dist(g1, g3) ≤ dist(g1, g2) + dist(g2, g3).

Example 1

mccs(g1,g2) not dominate g2

mccs(g2,g3) dominates g2

Example 2

g1=Query g2=Feature(Index) g3=Data


Connectivity dominance4
Connectivity Dominance

Theorem. Given three graphs g1, g2, and g3, if the connectivity of mccs(g1, g2) dominates g2 or the connectivity of mccs(g3, g2) dominates g2, then dist(g1, g3) ≤ dist(g1, g2) + dist(g2, g3).

Example 1

mccs(g1,g2) not dominate g2

mccs(g2,g3) dominates g2

Example 2

mccs(g2,g3) not dominate g2

mccs(g1,g2) not dominate g2

g1=Query g2=Feature(Index) g3=Data


Connectivity dominance5
Connectivity Dominance

Theorem. Given three graphs g1, g2, and g3, if the connectivity of mccs(g1, g2) dominates g2 or the connectivity of mccs(g3, g2) dominates g2, then dist(g1, g3) ≤ dist(g1, g2) + dist(g2, g3).

Example 1

mccs(g1,g2) not dominate g2

mccs(g2,g3) dominates g2

Example 2

mccs(g2,g3) not dominate g2

mccs(g1,g2) not dominate g2

g1=Query g2=Feature(Index) g3=Data

Count # of disconnected components: Linear Algorithm


dist(Q,F)+dist(F,D) ≥ dist(Q,D)

Validation Rule 1:

dist(Q,F)+dist(F,D) ≤ => dist(Q,D) ≤

mccs(Q, F) dominates F or mccs(F, D) dominates F

dist(Q,D)+dist(D,F) ≥ dist(Q,F)

Pruning Rule 1:

dist(Q,F)-dist(D,F)> => dist(Q,D)>

mccs(D, F) dominates D

dist(F,Q)+dist(Q,D) ≥ dist(F,D)

Pruning Rule 2:

dist(F, D)-dist(F, Q)> => dist(Q,D)>

mccs(F, Q) dominates Q


Verification algorithm
Verification Algorithm

  • Basic idea:1. enumerate sub-spanning tree of query graph such that the # of missing edges ≤ ; try to terminate the algorithm as early as possible.

    2. sharing the enumeration costs by two ways:

    a. not enumerate every thing from scratch.

    b. once enumerated, keep enumerated spanning trees.

  • Convert Query to QI-Sequence [VLDB08] to favour earlier

    termination.

    Prefix = Induced subgraph1.1 Infrequent Label (in all data graphs) First

    1.2 Higher Degree Vertex  (in the query graph) First1.3 Dense Induced Subgraph (in the query graph) First   


Verification algorithm1
Verification Algorithm

  • MCCS Detection Algorithm

  • Compute QI-Sequence


Verification algorithm2
Verification Algorithm

  • MCCS Detection Algorithm

  • Compute QI-Sequence

  • DFS: Threshold based DFS Search(A-B-C Matched)


Verification algorithm3
Verification Algorithm

Remove Edge B-D

  • MCCS Detection Algorithm

  • Compute QI-Sequence

  • DFS: Threshold based DFS Search(A-B-C Matched)

  • Generate new QI-Sequence from the existing one.


Verification algorithm4
Verification Algorithm

Remove Edge B-E

  • MCCS Detection Algorithm

  • Compute QI-Sequence

  • DFS: Threshold based DFS Search(A-B-C Matched)

  • Generate new QI-Sequence from the existing one.


Verification algorithm5
Verification Algorithm

Remove Edge B-F

  • MCCS Detection Algorithm

  • Compute QI-Sequence

  • DFS: Threshold based DFS Search(A-B-C Matched)

  • Generate new QI-Sequence from the existing one.


Verification algorithm6
Verification Algorithm

Right Subtree

  • MCCS Detection Algorithm

  • Compute QI-Sequence

  • DFS: Threshold based DFS Search(A-B-C Matched)

  • Generate new QI-Sequence from the existing one.

  • DFS: Threshold based DFS Search (The second A-B Matched)


Verification algorithm7
Verification Algorithm

Remove Edge B-C

  • MCCS Detection Algorithm

  • Compute QI-Sequence

  • DFS: Threshold based DFS Search(A-B-C Matched)

  • Generate new QI-Sequence from the existing one.

  • DFS: Threshold based DFS Search (The second A-B Matched)

  • Generate new QI-Sequence from the existing one.


Verification algorithm8
Verification Algorithm

  • MCCS Detection Algorithm

  • Compute QI-Sequence

  • DFS: Threshold based DFS Search(A-B-C Matched)

  • Generate new QI-Sequence from the existing one.

  • DFS: Threshold based DFS Search (The second A-B Matched)

  • Generate new QI-Sequence from the existing one.

  • Terminate. (dist(q,g) ≤ 3)


Feature selection
Feature Selection

  • Pruning Rule 1: mccs(D, F) dominates D

  • Pruning Rule 2: mccs(F, Q) dominates Q

    =>F should be dense.

    =>Discriminative Frequent Induced Subgraph

  • Validation Rule 1: mccs(F, D) dominates F or mccs(Q, F) dominates F

    =>F nearly contains Q and F should be sparse.

    =>Frequent Large Sparse Subgraphs

    Algorithm: gSpan[ICDM02] with our on-the-fly feature selection.


Experiments
Experiments

Settings

AIDS Antiviral dataset, a popular benchmark, 43k chemical bonds




Conclusion
Conclusion

  • Connected Substructure Similarity Search

  • Measure: Maximum Connected Common Subgraph – MCCS

  • Connectivity Dominance => Triangular inequality

  • MCCS Detection Algorithm

  • (Index, Filtering & Validation, Verification Techniques)

  • Future Work:

  • Large Graphs? New Measures?

Thanks


ad