This presentation is the property of its rightful owner.
1 / 13

# Mining Frequent Subgraphs PowerPoint PPT Presentation

Mining Frequent Subgraphs. COMP 790-90 Seminar Spring 2011. FFSM: Fast Frequent Subgraph Mining -- An Overview:. How to solve graph isomorphism problem? A Novel Graph Canonical Form: CAM How to tackle subgraph isomorphism problem (NP-complete)? Incrementally maintained embeddings

Mining Frequent Subgraphs

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## Mining Frequent Subgraphs

COMP 790-90 Seminar

Spring 2011

### FFSM: Fast Frequent Subgraph Mining -- An Overview:

• How to solve graph isomorphism problem?

• A Novel Graph Canonical Form: CAM

• How to tackle subgraph isomorphism problem (NP-complete)?

• Incrementally maintained embeddings

• How to enumerate subgraphs:

• An Efficient Data Structure: CAM Tree

• Two Operations: CAM-join, CAM-extension.

a

b

y

b

x

b

y

x

b

y

0

d

0

y

c

0

y

0

c

0

0

0

y

0

d

y

y

0

0

a

M3

M1

p2

p5

a

y

c

b

y

y

b

p1

y

x

b

x

a

y

0

0

y

d

y

d

b

0

y

0

c

0

p4

p3

M2

(P)

• Every diagonal entry of adjacency matrix M corresponds to a distinct vertex in G and is filled with the label of this vertex.

• Every off-diagonal entry in the lower triangle

part of M1 corresponds to a pair of vertices in G and is filled with the label of the edge between the two vertices and zero if there is no edge.

1for an undirected graph, the upper triangle is always a mirror of the lower triangle

b

x

b

y

0

d

0

y

0

c

y

y

0

0

a

a

y

b

M3

y

x

b

0

y

c

0

0

0

y

0

d

M1

### Code

• A Code of n  n adjacency matrix M is defined as sequence of lower triangular entries (including the diagonal entries) in the order:

M1,1 M2,1 M2,2 … Mn,1 Mn,2 …Mn,n-1 Mn,n

Code(M1): aybyxb0y0c00y0d < Code(M2): aybyxb00yd0y00c < Code(M3): bxby0d0y0cyy00a

a

y

b

y

x

b

0

0

y

d

0

y

0

c

0

M2

• TheCanonical Adjacency Matrix is the one produces the maximal code, using lexicographic order.

a

a

M1

a

y

b

y

b

y

x

b

a

a

0

y

c

y

x

b

y

b

y

b

a

0

0

y

c

0

0

y

0

d

y

0

b

y

x

b

y

b

0

M5

M2

M3

M4

M6

### MP Submatrix

• For an m  m matrix A, an n  n matrix B is A’s maximal proper submatrix (MP Submatrix), iff B is obtained by removing the last none-zero entry from A.

• We define a CAM is connected iff the corresponding graph is connected.

• Theorem I: A CAM’s MP submatrix is CAM

• Theorem II: A connected CAM’s MP submatrix is connected

b

b

a

y

d

x

b

y

b

y

0

c

a

a

a

y

b

y

b

a

a

a

a

a

a

b

y

b

0

y

d

y

0

b

y

y

y

y

b

b

b

b

y

y

b

b

x

b

y

x

b

y

0

c

0

y

y

y

x

x

x

0

b

b

b

b

y

0

x

0

b

b

0

y

0

d

0

0

0

0

y

y

y

y

0

0

0

0

c

d

c

c

0

0

y

y

0

0

d

d

a

a

y

b

y

b

0

x

b

0

x

b

p2

p5

y

0

0

y

c

0

0

y

d

c

b

y

p1

a

a

a

x

a

y

y

y

b

b

b

y

0

0

y

x

x

0

b

b

b

y

d

b

0

0

0

y

y

y

0

0

0

c

c

d

p4

0

0

0

0

0

0

y

y

y

0

0

0

d

c

d

p3

(P)

b

d

c

a

b

b

y

c

x

b

a

b

a

a

y

y

b

b

y

b

x

b

0

y

c

y

0

d

0

0

x

x

b

b

a

b

a

b

y

b

x

b

a

a

y

b

y

b

0

x

b

y

0

b

a

y

b

y

x

b

p2

p5

s1

q1

y

c

b

y

y

y

b

b

s2

p1

q2

x

a

a

a

x

y

y

y

y

d

b

b

b

p4

s3

q3

p3

(S)

(P)

(Q)

= 2/3

### How to Enumerate Nodes in a CAM Tree?

• Two operations to explore CAM tree:

• CAM-Join

• CAM-Extension

• Augmenting CAM tree with Suboptimal CAMs

• Objectives:

• no false dismissal

• no redundancy

• Plus: We want to this efficiently!

a

b

y

b

y

c

y

x

b

j

e

e

j

j

e

e

a

a

b

b

b

b

y

b

x

b

y

b

x

b

x

b

x

b

0

y

c

0

y

d

0

y

d

y

0

c

y

0

d

0

y

c

j

j

e

e

j

j

a

a

a

a

b

b

y

b

y

b

y

b

y

b

x

b

x

b

y

x

b

y

x

b

y

0

c

y

0

d

y

x

b

y

x

b

p2

p5

y

0

0

y

c

0

0

y

d

0

y

0

d

0

y

0

c

0

y

0

c

0

y

0

d

c

b

y

j

j

p1

a

a

x

a

y

y

b

y

b

y

y

x

b

y

x

b

d

b

0

y

0

c

0

0

y

d

p4

p3

0

0

y

0

d

0

0

y

0

c

(P)

### Suboptimal Tree

We define a Suboptimal CAM as a matrix that its MP submatrix is a CAM.

d

b

c

a

b

b

a

y

d

x

b

y

b

### Summary

• Theorem:

For a graph G, let CK-1 (Ck) be set of the suboptimal CAMs of all size-(K-1) (K) subgraphs of G (K ≥ 2). Every member of set CK can be enumerated unambiguously either by joining two members of set CK-1 or by extending a member in CK-1.

### Experimental Study

• Predictive Toxicology Evaluation Competition (PTE)

• Contains: 337 compounds

• Each graph contains 27 nodes and 27 edges on average

• NIH DTP Anti-Viral Screen Test (DTP CA/CM)

• Chemicals are classified to be Confirmed Active (CA), Confirmed Moderate Active (CM) and Confirmed Inactive (CI).

• We formed a dataset contains CA (423) and CM (1083).

• Each graph contains 25 nodes and 27 edges on average

### Performance (PTE)

Support Threshold (%)

Support Threshold (%)

### Performance (DTP CACM)

Support Threshold (%)

Support Threshold (%)