1 / 25

Mining Coherent Dense Subgraphs across Multiple Biological Networks

Mining Coherent Dense Subgraphs across Multiple Biological Networks. Vahid Mirjalili CSE 891. Motivation: Finding patterns across multiple networks, to identify biological modules, and function prediction Current algorithms are too costly Developed a novel algorithm: CODENSE

ziva
Download Presentation

Mining Coherent Dense Subgraphs across Multiple Biological Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Coherent Dense Subgraphs across Multiple Biological Networks VahidMirjalili CSE 891

  2. Motivation: • Finding patterns across multiple networks, to identify biological modules, and function prediction • Current algorithms are too costly • Developed a novel algorithm: CODENSE • Scalable in number and size • Adjustable based on the exact or approximate pattern mining

  3. Clustering can detect meaningful biological modules • e.g. a dense protein interaction sub-network may correspond to a protein complex • Dense co-expression sub-network may represent a co-expression cluster • Biological modules are expected to be active across multiple conditions • One idea: aggregate all the networks and identify dense sub-graphs in the aggregated network • Risk of false positive detection

  4. Aggregated graph:False positive in the aggregated graph • Adding six graphs together, and deleting the edges that occur less than 3 times resulting summary graph

  5. Solution to the false-positive summary-graph • Frequent sub-graphs • Mine the dense sub-graphs directly in each original network • A sub-graph is frequent if it occurs in multiple times in a set of graphs • In biological networks, each gene occur only once in a graph  no isomorphism problem

  6. Frequent dense sub-grpah • A frequent dense sub-graph doesn’t show accurate information • Some edges in the frequent sub-graph shown above do not occur in the original set • It is more meaningful to divide this to two sub-graphs

  7. Coherent Dense Sub-graphs • All edges in a coherent sub-graphs should have correlated occurrences in the original graph set • CODENSE divides the networks into 2 meta-graphs and perform clustering on these two graphs only (instead of individual networks) • CODENSE can distinguish the two modules • Good scalability • Discovery of overlapping clusters

  8. Overlapping Sub-graphs • Partition-based clustering algorithms fail to identify overlapping sub-graphs • Mining Overlapping Dense Sub-graphs (MODES)

  9. Application • Identify frequent co-expression clusters across multiple microarray datasets Microarray dataset: • Un-weighted, undirected graph • Each gene represents a node • Two genes are connected by an edge if they show high expression correlation • A densely connected sub-graph  tight co-expression cluster • Clusters from a single microarray dataset include spurious links, and may not be homogenous in function and regulation

  10. Problem Formulation • A relation graph contains n simple graphs, such as • A common vertex set V is shared by the graphs • Support(G): the numbers of graphs in a relation graph dataset (D) • A graph is frequent if support(G) > threshold • Summary graph: is an un-weighted graph extracted from D, where an edge exists only if it occurs in more than k graphs in D

  11. Problem Formulation • Edge Support Vector: is the weight of edge e in graph i (for an un-weighted graph it would be 0 or 1)

  12. Second-Order Graph: where each node represents an edge from the relation graph dataset (D) and an edge between nodes u and v exists if w(u) and w(v) are highly correlated • For efficiency, only construct the S graph for a sub-graph of the summary graph

  13. Coherent Graph: a sub-graph extracted from the summary graph is coherent if • All its edges have support > k • Its second-order graph is dense • Graph Density: m: number of edges n: n umber of nodes

  14. Two facts: • If a frequent sub-graph is dense, then it must be dense in the summary graph as well, but the reverse way doesn’t hold true always • If a sub-graph is coherent (its edges have high correlation across the dataset), then its second-order sub-graph is dense

  15. Aggregate the graphs into a summary graph • Eliminate infrequent edges

  16. MODES: Mining Overlapping DEnseSubgraphs • Developed based on HCS: Highly Connected Sub-graphs • Can efficiently identify dense sub-graphs • Can mine overlapping sub-graphs • Two approaches: • Minimum cut • Normalized cut (Shi, Malik 2000) • Apply the normalized cut in the initial steps of HCS algorothm, then if the size of partitions is small proceed with minimum cut

  17. C

  18. CODENSE analysis • Simplify the identification of coherent dense sub-graphs across n graphs into mining in two special graphs: summary graph + second-order graph • Can mine network modules • Can mine both exact and approximate patterns (by modifying the similarity threshold) • Can be extended to weighted graph (using Pearson correlation instead of Euclidean distance )

  19. Experimental Study: co-expression network • 39 yeast microarray datasets • 6661 genes • Calculate the Pearson correlation between the expression levels (r)  • Construct the relation graph, (connectivity of two genes determined by the Pearson correlation) n: number of measurements

  20. Create the summary graph , while removing edges that occur less than 6 times across 39 graphs • Apply MODES to identify dense sub-grahs: sub( ) with cutoff density d1 • For each sub( ), construct the second-order graph S • Apply MODES to S to identify sub-grpahs with density > d2 • Transform the edges  vertices, and apply MODES again to identify the dense sub-graphs with density > d3

  21. Functional Module Discovery:MODES vs CODENSE • A cluster is considered functionally homogenous if: • The functional homogeneity modeled by hypergeometric distribution shall be significant at α=0.01 • At least 40% of its memebr genes belong to a specific G.O. functional category • MODES identified 366 clusters, but only 151 were functionally homogenous (42%) • CODENSE identified 770 clusters, which 76% of those were homogenous • Improvement is due to second-order graph by eliminating edges which do not show co-occurrence across all networks

  22. Example of MODES false positive: MODES identified 5 genes: MSF1, PHB1, CBP4, NDI1, SCO2 which are not functionally homogenous Protein biosynthesis replicative cell aging mitochondrial electron transfer

  23. Functional prediction: • CODENSE identified this 6-nodes sub-graph • 5 genes belong to “protein biosynthesis” category • Predict: ASC1 must be involved in protein biosynthesis as well Test with 448 known genes: 50% accuracy

More Related