1 / 41

NeMoFinder: Dissecting genome-wide protein-protein intractions with meso-scale network motifs

NeMoFinder: Dissecting genome-wide protein-protein intractions with meso-scale network motifs. Mike Yuan. Outline of this presentation. Introduction to PPI Introduction to Graph Mining Related work Problem statement Details of the NeMoFinder algorithm Summary References .

ferdinand
Download Presentation

NeMoFinder: Dissecting genome-wide protein-protein intractions with meso-scale network motifs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NeMoFinder: Dissecting genome-wide protein-protein intractions with meso-scale network motifs Mike Yuan

  2. Outline of this presentation • Introduction to PPI • Introduction to Graph Mining • Related work • Problem statement • Details of the NeMoFinder algorithm • Summary • References

  3. Protein Interactions A Protein may interact with: • Other proteins • Nucleic Acids • Small molecules

  4. Finding Protein Partners

  5. Motivation • Important for biological functions • To understand the function of a protein, we need to find its interacting partners

  6. Graph Theory Vertex (node) Cycle Edge -5 Directed Edge (Arc) Weighted Edge 10 7 Molecular interaction networks are mapped as graphs

  7. The protein protein interaction network…

  8. Graph mining • Methods for Mining Frequent Subgraphs • Mining Variant and Constrained Substructure Patterns • Applications: • Graph Indexing • Similarity Search • Classification and Clustering

  9. Why Graph Mining? • Graphs are ubiquitous • Chemical compounds (Cheminformatics) • Protein structures, biological pathways/networks (Bioinformactics) • Program control flow, traffic flow, and workflow analysis • XML databases, Web, and social network analysis • Graph is a general model • Trees, lattices, sequences, and items are degenerated graphs • Complexity of algorithms: many problems are of high complexity

  10. Graph, Graph, Everywhere from H. Jeong et al Nature 411, 41 (2001) Aspirin Yeast protein interaction network Co-author network Internet

  11. Graph Pattern Mining • Frequent subgraphs • A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold • Applications of graph pattern mining • Mining biochemical structures • Program control flow analysis • Mining XML structures or Web communities • Building blocks for graph classification, clustering, compression, comparison, and correlation analysis

  12. Example: Frequent Subgraphs GRAPH DATASET (A) (B) (C) FREQUENT PATTERNS (MIN SUPPORT IS 2) (1) (2)

  13. Frequent Subgraph Mining Approaches • Apriori-based approach: if a graph is frequent, all of its subgraphs are frequent ─ the Apriori property • AGM/AcGM: Inokuchi, et al. (PKDD’00) • FSG: Kuramochi and Karypis (ICDM’01) • PATH#: Vanetik and Gudes (ICDM’02, ICDM’04) • FFSM: Huan, et al. (ICDM’03) • Pattern growth approach • MoFa, Borgelt and Berthold (ICDM’02) • gSpan: Yan and Han (ICDM’02) • Gaston: Nijssen and Kok (KDD’04)

  14. Problem Statement • PPI network G=(V,E) _ each vertex represents a unique protein _ each edge between vA and vB indicates there is an interaction between A and B • Network motif _frequently occurring subgraph pattern in a network • fg is the number of occurrences of a subgraph g, g is repeated if fg>F. • fg_randi is the frequency of g in a randomized network Grandi, for 1 ≤ i ≤ N, N is the number of the randomized networks. sg is the number of times fg≥fg_randi, g is unique if its sg >S. • Network motif discovery algorithm

  15. Problem Statement (cont) • Motivation of NeMoFinder- existing research has following limitations: _Number of network motifs candidates increases exponentially _Interesting network motifs are repeated and unique and Apirori algorithms are not applicable _The graph isomorphism problem is an NP problem • NeMoFinder _ a network motif discovery algorithm to discover repeated and unique meso-scale network motifs in a large PPI network

  16. Key procedures • Example graph G • Find repeated trees • Use repeated trees to partition a network into a set of graphs • Introduce graph cousins to facilitate the candidate generation and frequency counting processes.

  17. Step1. Discover Repeated Subgraphs • Step1.1 find repeated size-k trees • Eg. Size 2 to size 5 trees t2 t3 t4_1 t4_2 t5_1 t5_2 t5_3

  18. Step1. discover repeated subgraphs (cont) • ft2 = 7, ft3 = 13, ft4_1 = 6, ft4_2 =17, ft5_1=1, ft5_2 = 5, ft5_3 = 7. • T2 = {t2}, T3 = {t3}, T4 ={t4_1, t4_2} and T5 = {t5_2, t5_3}.

  19. Step 1.2 Use repeated size-k trees to partition graph • Occurrences of t4_1 in G.

  20. Step 1.2 Use repeated size-k trees to partition graph (cont) • Occurrences of t4_2 in G.

  21. Step1.2 Use repeated size-k trees to partition graph (cont) • Set of graphs GD4 G4_1 G4_2 G4_3 G4_4 G4_5

  22. Step 1.3: perform graph join operation to find repeated size-k graphs • Generate 3-edge subgraphs from size-4 trees t4_1 h1 h2 t4_2 h3 h4 h5

  23. Step 1.3: perform graph join operation to find repeated size-k graphs (cont) • Examples for graph join operations for subgraphs t4_1 h2 g1_2 t4_2 h3 g1_1 • fg1_1 = 2 and fg1_2 = 5

  24. Step 1.3: perform graph join operation to find repeated size-k graphs (cont) • Use subgraphs obtained to generate subgraphs g1_2 h6 h7 • Graph join operations for subgraphs g1_2 h6 g2 • f(g2)<2, algorithm stops

  25. Algorithm1 NeMoFinder 1: Input: G - PPI network;N - Number of randomized networks;K - Maximal network motif size;F - Frequency threshold;S - Uniqueness threshold; 2: Output: U - Repeated and unique network motif set; 3: D ← ∅; 4: for motif-size k from 3 to K do 5: T ← FindRepeatedTrees(k); 6: GDk ← GraphPartition(G, T) 7: D ← D  T; 8: D’ ← T; 9: i ← k; 10: while D’≠∅ and i ≤ k × (k − 1)/2 do 11: D’ ← FindRepeatedGraphs(k,i,D’); 12: D ← D D’; 13: i ← i + 1; 14: end while 15: end for Step1: Discover repeated subgraphs Step 1.1: Find repeated size-k trees Step 1.2: use repeated size-k trees to partition graph Step 1.3: perform graph join operation to find repeated size-k graphs

  26. Algorithm1 NeMoFinder (cont) 16: for counter i from 1 to N do 17: Grand ← RandomizedNetworkGeneration(); 18: for each g  D do 19: GetRandFrequency(g,Grand); 20: end for 21: end for 22: U ← ∅; 23: for each g D do 24: s ← GetUniqunessValue(g); 25: if s ≥ S then 26: U ← U  {g}; 27: end if 28: end for 29: return U; Step 2: Determine subgraph frequency in randomized networks Step 3: Compute uniqueness of subgraphs

  27. Algorithm Steps (cont) • Step 2: Determine subgraph frequency in randomized networks _Generate randomized networks Grandi(1≤i≤N) _check the frequency of the subgraphs in each of the randomized networks Grandi • Step 3: Compute uniqueness of subgraphs _ Based on frequencies in the input PPI network and the randomized networks _fg_randiis the frequency of g in a randomized network Grandi, for 1 ≤ i ≤ N, N is the number of the randomized networks. sg is the number of times fg≥fg_randi, g is unique if its sg >S.

  28. Find repeated subgraphs Algorithm 2 FindRepeatedGraphs(k, i,D’) 1: Input: D’ - Set of repeated subgraphs with k vertices and i − 1 edges; 2: Output: D’’ - Set of repeated subgraphs with k vertices and i edges; 3: C ← CandidateGeneration(k, i, D’); 4: D’’ ← FrequencyCounting(k, i, C); 5: return D’’;

  29. Candidate generation using graph cousins • Represent subgraphs by adjacency matrices • Code(M): a sequence formed by linking the lower triangular entries of M in the following order: m1,1m2,1m2,2…mn,1mn,2…mn,n • Transform adjancy matrix into canonical adjacency matrix (CAM) which has the maximal code • Definition of subCAM of a graph _ A matrix obtained by setting the last edge entry in CAM(g) to 0.

  30. Candidate generation using graph cousins (cont) • Definition of cousin _ Given two subgraphs g and h, if subCAM(g) = subCAM(h), then h is a cousin of g. • Three types of cousin relationship between g and h: _ Type I: Direct Cousin h is isomorphic to a subgraph g’ which has the same number of vertices and edges as g, and g’ ≠g; _ Type II: Twin Cousin h is isomorphic to subgraph g; _Type III: Distant Cousin h is a disconnected subgraph.

  31. 0 0 1 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 Candidate generation using graph cousins (cont) • Adjacency matrices for the graphs in figure 6 t4_1 h1 h2

  32. Candidate generation using graph cousins (cont) • Adjacency matrices for the graphs in figure 6 t4_2 h3 h4h5

  33. Candidate generation using graph cousins (cont) • Observations of above example _h1 is a type 1 direct cousin of t4_1 _h2 is a type 3 distant cousin of t4_1 _h3 is a type 2 twin cousin of t4_2 _h4 is a type 1 direct cousin of t4_2 _h5 is a type 3 distant cousin of t4_2

  34. Candidate generation using graph cousins (cont) Algorithm 3 CandidateGeneration(k, i,D’) 1: Input: D’ - Set of repeated subgraphs with k vertices and i − 1 edges; 2: Output: C - Set of candidates with k vertices and i edges; 3: C ← ∅; 4: for each g  D do 5: H ← GetCousin(g); 6: for each h  H do 7: g’ ← join(g, h); 8: C ← C  {g}; 9: end for 10: end for 11: return C; Step 1: Find set of cousins Step2: join g with cousins to form new subgraph

  35. Frequency counting • Leveraging properties of the different types of cousins _Lx: set of graphs in GDk embedding x _If type of h=type I direct cousin of g, g’ is subgraph obtained by g and h, then Lg’= Lg ∩ Lh, fg’= |Lg ∩ Lh| _if type of h = Type III distant cousin,then fg’= |Lg ∩ Lh| _if type of h = Type II twin cousin then fg’ =CheckAllOccurances(g) _Lt4_1 ={G4_1,G4_2,G4_3,G4_5}, Lh2 = {G4_1,G4_2,G4_3,G4_4,G4_5} Lg1_2= Lt4_1∩ Lh2 ={G4_1,G4_2,G4_3,G4_5}, fg1_2=4>2

  36. Frequency counting Algorithm 4 FrequencyCounting(k, i,C) 1: Input: GDk - Set of graphs generated by partitioning G with size-k repeated trees; C - Set of subgraph candidates with k vertices and i edges; F - Frequency threshold; 2: Output: D’’ - Set of repeated subgraphs with k vertices and i edges; 3: D’’ ← ∅; 4: for each g’  C do 5: Get the join parameter of g’: g and h; 6: Lg ← set of graphs in GDk embedding g; 7: Lh ← set of graphs in GDk embedding h; 8: if fg < F or fh < F then 9: fg’ ← 0; 10: else if type of h = Type I direct cousin then 11: fg’ ← |Lg ∩ Lh| 12: else if type of h = Type III distant cousin then 13: fg’ ← |Lg ∩ Lh| 14: else if type of h = Type II twin cousin then 15: fg’ ← CheckAllOccurances(g); 16: end if 17: if fg’ > F then 18: D’’ ← D’’  {g’}; 19: end if 20: end for 21: return D’’; Case h is direct cousin Case h is distant cousin Case h is twin cousin

  37. Summary • NemoFinder-an efficient network motif discovery algorithm to discover larger-sized repeated and unique network motifs in PPI networks. • Use repeated trees to partition network into graphs • Graph cousins for candidate generation and frequency counting

  38. References (1) • T. Asai, et al. “Efficient substructure discovery from large semi-structured data”, SDM'02 • C. Borgelt and M. R. Berthold, “Mining molecular fragments: Finding relevant substructures of molecules”, ICDM'02 • D. Cai, Z. Shao, X. He, X. Yan, and J. Han, “Community Mining from Multi-Relational Networks”, PKDD'05. • J.Chen, W.Hsu, M.Lee,NeMoFinder: Dissecting genome wide protein-protein interactions with repeated and unique network motifs, Seekiong Ng, SIGKDD 2006 • M. Deshpande, M. Kuramochi, and G. Karypis, “Frequent Sub-structure Based Approaches for Classifying Chemical Compounds”, ICDM 2003 • M. Deshpande, M. Kuramochi, and G. Karypis. “Automated approaches for classifying structures”, BIOKDD'02 • C. Faloutsos, K. McCurley, and A. Tomkins, “Fast Discovery of 'Connection Subgraphs”, KDD'04 • H. Fröhlich, J. Wegner, F. Sieker, and A. Zell, “Optimal Assignment Kernels For Attributed Molecular Graphs”, ICML’05

  39. References (2) • L. Holder, D. Cook, and S. Djoko. “Substructure discovery in the subdue system”, KDD'94 • J. Huan, W. Wang, D. Bandyopadhyay, J. Snoeyink, J. Prins, and A. Tropsha. “Mining spatial motifs from protein structure graphs”, RECOMB’04 • J. Huan, W. Wang, and J. Prins. “Efficient mining of frequent subgraph in the presence of isomorphism”, ICDM'03 • H. Hu, X. Yan, Yu, J. Han and X. J. Zhou, “Mining Coherent Dense Subgraphs across Massive Biological Networks for Functional Discovery”, ISMB'05 • A. Inokuchi, T. Washio, and H. Motoda. “An apriori-based algorithm for mining frequent substructures from graph data”, PKDD'00 • C. James, D. Weininger, and J. Delany. “Daylight Theory Manual Daylight Version 4.82”. Daylight Chemical Information Systems, Inc., 2003. • G. Jeh, and J. Widom, “Mining the Space of Graph Properties”, KDD'04 • H. Kashima, K. Tsuda, and A. Inokuchi, “Marginalized Kernels Between Labeled Graphs”, ICML’03

  40. References (3) • M. Koyuturk, A. Grama, and W. Szpankowski. “An efficient algorithm for detecting frequent subgraphs in biological networks”, Bioinformatics, 20:I200--I207, 2004. • T. Kudo, E. Maeda, and Y. Matsumoto, “An Application of Boosting to Graph Classification”, NIPS’04 • M. Kuramochi and G. Karypis. “Frequent subgraph discovery”, ICDM'01 • M. Kuramochi and G. Karypis, “GREW: A Scalable Frequent Subgraph Discovery Algorithm”, ICDM’04 • C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu, “Mining Behavior Graphs for ‘Backtrace'' of Noncrashing Bugs’'', SDM'05 • P. Mahé, N. Ueda, T. Akutsu, J. Perret, and J. Vert, “Extensions of Marginalized Graph Kernels”, ICML’04 • S. Nijssen and J. Kok. A quickstart in frequent structure mining can make a difference. KDD'04 • J. Prins, J. Yang, J. Huan, and W. Wang. “Spin: Mining maximal frequent subgraphs from graph databases”. KDD'04

  41. References (4) • D. Shasha, J. T.-L. Wang, and R. Giugno. “Algorithmics and applications of tree and graph searching”, PODS'02 • J. R. Ullmann. “An algorithm for subgraph isomorphism”, J. ACM, 23:31--42, 1976. • N. Vanetik, E. Gudes, and S. E. Shimony. “Computing frequent graph patterns from semistructured data”, ICDM'02 • C. Wang, W. Wang, J. Pei, Y. Zhu, and B. Shi. “Scalable mining of large disk-base graph databases”, KDD'04 • T. Washio and H. Motoda, “State of the art of graph-based data mining”, SIGKDD Explorations, 5:59-68, 2003 • X. Yan and J. Han, “gSpan: Graph-Based Substructure Pattern Mining”, ICDM'02 • X. Yan and J. Han, “CloseGraph: Mining Closed Frequent Graph Patterns”, KDD'03 • X. Yan, P. S. Yu, and J. Han, “Graph Indexing: A Frequent Structure-based Approach”, SIGMOD'04 • X. Yan, X. J. Zhou, and J. Han, “Mining Closed Relational Graphs with Connectivity Constraints”, KDD'05 • X. Yan, P. S. Yu, and J. Han, “Substructure Similarity Search in Graph Databases”, SIGMOD'05 • X. Yan, F. Zhu, J. Han, and P. S. Yu, “Searching Substructures with Superimposed Distance”, ICDE'06 • M. J. Zaki. “Efficiently mining frequent trees in a forest”, KDD'02

More Related