Loading in 5 sec....

NeMoFinder: Dissecting genome-wide protein-protein intractions with meso-scale network motifsPowerPoint Presentation

NeMoFinder: Dissecting genome-wide protein-protein intractions with meso-scale network motifs

- 248 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about '' - ferdinand

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### NeMoFinder: Dissecting genome-wide protein-protein intractions with meso-scale network motifs

Mike Yuan

Outline of this presentation intractions with meso-scale network motifs

- Introduction to PPI
- Introduction to Graph Mining
- Related work
- Problem statement
- Details of the NeMoFinder algorithm
- Summary
- References

Protein Interactions intractions with meso-scale network motifs

A Protein may interact with:

- Other proteins
- Nucleic Acids
- Small molecules

Finding Protein Partners intractions with meso-scale network motifs

Motivation intractions with meso-scale network motifs

- Important for biological functions
- To understand the function of a protein, we need to find its interacting partners

Graph Theory intractions with meso-scale network motifs

Vertex (node)

Cycle

Edge

-5

Directed Edge (Arc)

Weighted Edge

10

7

Molecular interaction networks are mapped as graphs

The protein protein interaction network… intractions with meso-scale network motifs

Graph mining intractions with meso-scale network motifs

- Methods for Mining Frequent Subgraphs
- Mining Variant and Constrained Substructure Patterns
- Applications:
- Graph Indexing
- Similarity Search
- Classification and Clustering

Why Graph Mining? intractions with meso-scale network motifs

- Graphs are ubiquitous
- Chemical compounds (Cheminformatics)
- Protein structures, biological pathways/networks (Bioinformactics)
- Program control flow, traffic flow, and workflow analysis
- XML databases, Web, and social network analysis

- Graph is a general model
- Trees, lattices, sequences, and items are degenerated graphs

- Complexity of algorithms: many problems are of high complexity

Graph, Graph, Everywhere intractions with meso-scale network motifs

from H. Jeong et al Nature 411, 41 (2001)

Aspirin

Yeast protein interaction network

Co-author network

Internet

Graph Pattern Mining intractions with meso-scale network motifs

- Frequent subgraphs
- A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold

- Applications of graph pattern mining
- Mining biochemical structures
- Program control flow analysis
- Mining XML structures or Web communities
- Building blocks for graph classification, clustering, compression, comparison, and correlation analysis

Example: Frequent Subgraphs intractions with meso-scale network motifs

GRAPH DATASET

(A)

(B)

(C)

FREQUENT PATTERNS

(MIN SUPPORT IS 2)

(1)

(2)

Frequent Subgraph Mining Approaches intractions with meso-scale network motifs

- Apriori-based approach: if a graph is frequent, all of its subgraphs are frequent ─ the Apriori property
- AGM/AcGM: Inokuchi, et al. (PKDD’00)
- FSG: Kuramochi and Karypis (ICDM’01)
- PATH#: Vanetik and Gudes (ICDM’02, ICDM’04)
- FFSM: Huan, et al. (ICDM’03)

- Pattern growth approach
- MoFa, Borgelt and Berthold (ICDM’02)
- gSpan: Yan and Han (ICDM’02)
- Gaston: Nijssen and Kok (KDD’04)

Problem Statement intractions with meso-scale network motifs

- PPI network G=(V,E)
_ each vertex represents a unique protein

_ each edge between vA and vB indicates there is an interaction between A and B

- Network motif
_frequently occurring subgraph pattern in a network

- fg is the number of occurrences of a subgraph g, g is repeated if fg>F.
- fg_randi is the frequency of g in a randomized network Grandi, for 1 ≤ i ≤ N, N is the number of the randomized networks. sg is the number of times fg≥fg_randi, g is unique if its sg >S.
- Network motif discovery algorithm

Problem Statement (cont) intractions with meso-scale network motifs

- Motivation of NeMoFinder- existing research has following limitations:
_Number of network motifs candidates increases exponentially

_Interesting network motifs are repeated and unique and Apirori algorithms are not applicable

_The graph isomorphism problem is an NP problem

- NeMoFinder
_ a network motif discovery algorithm to discover repeated and unique meso-scale network motifs in a large PPI network

Key procedures intractions with meso-scale network motifs

- Example graph G
- Find repeated trees
- Use repeated trees to partition a network into a set of graphs
- Introduce graph cousins to facilitate the candidate generation and frequency counting processes.

Step1. Discover Repeated Subgraphs intractions with meso-scale network motifs

- Step1.1 find repeated size-k trees
- Eg. Size 2 to size 5 trees
t2 t3 t4_1 t4_2

t5_1 t5_2 t5_3

Step1. discover repeated subgraphs (cont) intractions with meso-scale network motifs

- ft2 = 7, ft3 = 13, ft4_1 = 6, ft4_2 =17, ft5_1=1, ft5_2 = 5, ft5_3 = 7.
- T2 = {t2}, T3 = {t3}, T4 ={t4_1, t4_2} and T5 = {t5_2, t5_3}.

Step 1.2 Use repeated size-k trees to partition graph intractions with meso-scale network motifs

- Occurrences of t4_1 in G.

Step 1.2 Use repeated size-k trees to partition graph (cont) intractions with meso-scale network motifs

- Occurrences of t4_2 in G.

Step1.2 Use repeated size-k trees to partition graph (cont) intractions with meso-scale network motifs

- Set of graphs GD4
G4_1 G4_2 G4_3

G4_4 G4_5

Step 1.3: perform graph join operation to find repeated size-k graphs

- Generate 3-edge subgraphs from size-4 trees
t4_1 h1 h2

t4_2 h3 h4 h5

Step 1.3: perform graph join operation to find repeated size-k graphs (cont)

- Examples for graph join operations for subgraphs
t4_1 h2 g1_2

t4_2 h3 g1_1

- fg1_1 = 2 and fg1_2 = 5

Step 1.3: perform graph join operation to find repeated size-k graphs (cont)

- Use subgraphs obtained to generate subgraphs
g1_2 h6 h7

- Graph join operations for subgraphs
g1_2 h6 g2

- f(g2)<2, algorithm stops

Algorithm1 NeMoFinder size-k graphs (cont)

1: Input: G - PPI network;N - Number of randomized networks;K - Maximal network motif size;F - Frequency threshold;S - Uniqueness threshold;

2: Output: U - Repeated and unique network motif set;

3: D ← ∅;

4: for motif-size k from 3 to K do

5: T ← FindRepeatedTrees(k);

6: GDk ← GraphPartition(G, T)

7: D ← D T;

8: D’ ← T;

9: i ← k;

10: while D’≠∅ and i ≤ k × (k − 1)/2 do

11: D’ ← FindRepeatedGraphs(k,i,D’);

12: D ← D D’;

13: i ← i + 1;

14: end while

15: end for

Step1: Discover repeated subgraphs

Step 1.1: Find repeated size-k trees

Step 1.2: use repeated size-k trees to partition graph

Step 1.3: perform graph join operation to find repeated size-k graphs

Algorithm1 NeMoFinder (cont) size-k graphs (cont)

16: for counter i from 1 to N do

17: Grand ← RandomizedNetworkGeneration();

18: for each g D do

19: GetRandFrequency(g,Grand);

20: end for

21: end for

22: U ← ∅;

23: for each g D do

24: s ← GetUniqunessValue(g);

25: if s ≥ S then

26: U ← U {g};

27: end if

28: end for

29: return U;

Step 2: Determine subgraph frequency in randomized networks

Step 3: Compute uniqueness of subgraphs

Algorithm Steps (cont) size-k graphs (cont)

- Step 2: Determine subgraph frequency in randomized networks
_Generate randomized networks Grandi(1≤i≤N)

_check the frequency of the subgraphs in each of the randomized networks Grandi

- Step 3: Compute uniqueness of subgraphs
_ Based on frequencies in the input PPI network and the randomized networks

_fg_randiis the frequency of g in a randomized network Grandi, for 1 ≤ i ≤ N, N is the number of the randomized networks. sg is the number of times fg≥fg_randi, g is unique if its sg >S.

Find repeated subgraphs size-k graphs (cont)

Algorithm 2 FindRepeatedGraphs(k, i,D’)

1: Input: D’ - Set of repeated subgraphs with k vertices and

i − 1 edges;

2: Output: D’’ - Set of repeated subgraphs with k vertices and

i edges;

3: C ← CandidateGeneration(k, i, D’);

4: D’’ ← FrequencyCounting(k, i, C);

5: return D’’;

Candidate generation using graph cousins size-k graphs (cont)

- Represent subgraphs by adjacency matrices
- Code(M): a sequence formed by linking the lower triangular entries of M in the following order: m1,1m2,1m2,2…mn,1mn,2…mn,n
- Transform adjancy matrix into canonical adjacency matrix (CAM) which has the maximal code
- Definition of subCAM of a graph
_ A matrix obtained by setting the last edge entry in CAM(g) to 0.

Candidate generation using graph cousins (cont) size-k graphs (cont)

- Definition of cousin
_ Given two subgraphs g and h, if subCAM(g) = subCAM(h), then h is a cousin of g.

- Three types of cousin relationship between g and h:
_ Type I: Direct Cousin h is isomorphic to a subgraph g’ which has the same number of vertices and edges as g, and g’ ≠g;

_ Type II: Twin Cousin h is isomorphic to subgraph g;

_Type III: Distant Cousin h is a disconnected subgraph.

0 size-k graphs (cont)

0

1

1

0

0

1

1

0

0

0

0

1

1

0

0

0

0

0

0

Candidate generation using graph cousins (cont)- Adjacency matrices for the graphs in figure 6
t4_1 h1

h2

Candidate generation using graph cousins (cont) size-k graphs (cont)

- Adjacency matrices for the graphs in figure 6
t4_2 h3

h4h5

Candidate generation using graph cousins (cont) size-k graphs (cont)

- Observations of above example
_h1 is a type 1 direct cousin of t4_1

_h2 is a type 3 distant cousin of t4_1

_h3 is a type 2 twin cousin of t4_2

_h4 is a type 1 direct cousin of t4_2

_h5 is a type 3 distant cousin of t4_2

Candidate generation using graph cousins (cont) size-k graphs (cont)

Algorithm 3 CandidateGeneration(k, i,D’)

1: Input: D’ - Set of repeated subgraphs with k vertices and i − 1 edges;

2: Output: C - Set of candidates with k vertices and i edges;

3: C ← ∅;

4: for each g D do

5: H ← GetCousin(g);

6: for each h H do

7: g’ ← join(g, h);

8: C ← C {g};

9: end for

10: end for

11: return C;

Step 1: Find set of cousins

Step2: join g with cousins to form new subgraph

Frequency counting size-k graphs (cont)

- Leveraging properties of the different types of cousins
_Lx: set of graphs in GDk embedding x

_If type of h=type I direct cousin of g, g’ is subgraph obtained by g and h, then Lg’= Lg ∩ Lh, fg’= |Lg ∩ Lh|

_if type of h = Type III distant cousin,then fg’= |Lg ∩ Lh|

_if type of h = Type II twin cousin

then fg’ =CheckAllOccurances(g)

_Lt4_1 ={G4_1,G4_2,G4_3,G4_5},

Lh2 = {G4_1,G4_2,G4_3,G4_4,G4_5}

Lg1_2= Lt4_1∩ Lh2 ={G4_1,G4_2,G4_3,G4_5}, fg1_2=4>2

Frequency counting size-k graphs (cont)

Algorithm 4 FrequencyCounting(k, i,C)

1: Input: GDk - Set of graphs generated by partitioning G with size-k repeated trees;

C - Set of subgraph candidates with k vertices and i edges;

F - Frequency threshold;

2: Output: D’’ - Set of repeated subgraphs with k vertices and i edges;

3: D’’ ← ∅;

4: for each g’ C do

5: Get the join parameter of g’: g and h;

6: Lg ← set of graphs in GDk embedding g;

7: Lh ← set of graphs in GDk embedding h;

8: if fg < F or fh < F then

9: fg’ ← 0;

10: else if type of h = Type I direct cousin then

11: fg’ ← |Lg ∩ Lh|

12: else if type of h = Type III distant cousin then

13: fg’ ← |Lg ∩ Lh|

14: else if type of h = Type II twin cousin then

15: fg’ ← CheckAllOccurances(g);

16: end if

17: if fg’ > F then

18: D’’ ← D’’ {g’};

19: end if

20: end for

21: return D’’;

Case h is direct cousin

Case h is distant cousin

Case h is twin cousin

Summary size-k graphs (cont)

- NemoFinder-an efficient network motif discovery algorithm to discover larger-sized repeated and unique network motifs in PPI networks.
- Use repeated trees to partition network into graphs
- Graph cousins for candidate generation and frequency counting

References (1) size-k graphs (cont)

- T. Asai, et al. “Efficient substructure discovery from large semi-structured data”, SDM'02
- C. Borgelt and M. R. Berthold, “Mining molecular fragments: Finding relevant substructures of molecules”, ICDM'02
- D. Cai, Z. Shao, X. He, X. Yan, and J. Han, “Community Mining from Multi-Relational Networks”, PKDD'05.
- J.Chen, W.Hsu, M.Lee,NeMoFinder: Dissecting genome wide protein-protein interactions with repeated and unique network motifs, Seekiong Ng, SIGKDD 2006
- M. Deshpande, M. Kuramochi, and G. Karypis, “Frequent Sub-structure Based Approaches for Classifying Chemical Compounds”, ICDM 2003
- M. Deshpande, M. Kuramochi, and G. Karypis. “Automated approaches for classifying structures”, BIOKDD'02
- C. Faloutsos, K. McCurley, and A. Tomkins, “Fast Discovery of 'Connection Subgraphs”, KDD'04
- H. Fröhlich, J. Wegner, F. Sieker, and A. Zell, “Optimal Assignment Kernels For Attributed Molecular Graphs”, ICML’05

References (2) size-k graphs (cont)

- L. Holder, D. Cook, and S. Djoko. “Substructure discovery in the subdue system”, KDD'94
- J. Huan, W. Wang, D. Bandyopadhyay, J. Snoeyink, J. Prins, and A. Tropsha. “Mining spatial motifs from protein structure graphs”, RECOMB’04
- J. Huan, W. Wang, and J. Prins. “Efficient mining of frequent subgraph in the presence of isomorphism”, ICDM'03
- H. Hu, X. Yan, Yu, J. Han and X. J. Zhou, “Mining Coherent Dense Subgraphs across Massive Biological Networks for Functional Discovery”, ISMB'05
- A. Inokuchi, T. Washio, and H. Motoda. “An apriori-based algorithm for mining frequent substructures from graph data”, PKDD'00
- C. James, D. Weininger, and J. Delany. “Daylight Theory Manual Daylight Version 4.82”. Daylight Chemical Information Systems, Inc., 2003.
- G. Jeh, and J. Widom, “Mining the Space of Graph Properties”, KDD'04
- H. Kashima, K. Tsuda, and A. Inokuchi, “Marginalized Kernels Between Labeled Graphs”, ICML’03

References (3) size-k graphs (cont)

- M. Koyuturk, A. Grama, and W. Szpankowski. “An efficient algorithm for detecting frequent subgraphs in biological networks”, Bioinformatics, 20:I200--I207, 2004.
- T. Kudo, E. Maeda, and Y. Matsumoto, “An Application of Boosting to Graph Classification”, NIPS’04
- M. Kuramochi and G. Karypis. “Frequent subgraph discovery”, ICDM'01
- M. Kuramochi and G. Karypis, “GREW: A Scalable Frequent Subgraph Discovery Algorithm”, ICDM’04
- C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu, “Mining Behavior Graphs for ‘Backtrace'' of Noncrashing Bugs’'', SDM'05
- P. Mahé, N. Ueda, T. Akutsu, J. Perret, and J. Vert, “Extensions of Marginalized Graph Kernels”, ICML’04
- S. Nijssen and J. Kok. A quickstart in frequent structure mining can make a difference. KDD'04
- J. Prins, J. Yang, J. Huan, and W. Wang. “Spin: Mining maximal frequent subgraphs from graph databases”. KDD'04

References (4) size-k graphs (cont)

- D. Shasha, J. T.-L. Wang, and R. Giugno. “Algorithmics and applications of tree and graph searching”, PODS'02
- J. R. Ullmann. “An algorithm for subgraph isomorphism”, J. ACM, 23:31--42, 1976.
- N. Vanetik, E. Gudes, and S. E. Shimony. “Computing frequent graph patterns from semistructured data”, ICDM'02
- C. Wang, W. Wang, J. Pei, Y. Zhu, and B. Shi. “Scalable mining of large disk-base graph databases”, KDD'04
- T. Washio and H. Motoda, “State of the art of graph-based data mining”, SIGKDD Explorations, 5:59-68, 2003
- X. Yan and J. Han, “gSpan: Graph-Based Substructure Pattern Mining”, ICDM'02
- X. Yan and J. Han, “CloseGraph: Mining Closed Frequent Graph Patterns”, KDD'03
- X. Yan, P. S. Yu, and J. Han, “Graph Indexing: A Frequent Structure-based Approach”, SIGMOD'04
- X. Yan, X. J. Zhou, and J. Han, “Mining Closed Relational Graphs with Connectivity Constraints”, KDD'05
- X. Yan, P. S. Yu, and J. Han, “Substructure Similarity Search in Graph Databases”, SIGMOD'05
- X. Yan, F. Zhu, J. Han, and P. S. Yu, “Searching Substructures with Superimposed Distance”, ICDE'06
- M. J. Zaki. “Efficiently mining frequent trees in a forest”, KDD'02

Download Presentation

Connecting to Server..