Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Integrated Mining of PPI Networks: A Case for Ensemble Clustering Srinivasan Parthasarathy Department of Computer Science and Engineering The Ohio State University Joint work with Sitaram Asur and Duygu Ucar

I. Preliminaries and Motivation

Proteins • Central component of cell machinery and life • It is the proteins dynamically generated by a cell that execute the genetic program [Kahn 1995] • Proteins work with other proteins [Von Mering et al 2002] • Form large interaction networks typically refered to as protein-protein interaction (PPI) networks • Regulate and support each other for specific functionality or process

Protein Protein Interaction Networks • Why analyze? • To fully understand cellular machinery, simply listing proteins is not enough – (clusters of) interactions need to be delineated as well [v.Mering 2002] • Understanding the organism • Protein function prediction • E.g. no functional annotations for one-third of baker’s yeast • Drug design • Goal: To find modular clusters

Challenges in analyzing PPI Networks • Noisy data • False positives [Deane 2002], false negatives [Hsu 06] • Existence of Hub Nodes • Particularly problematic for standard clustering and graph partitioning algorithms -- lead to very large core clusters and not much else! • Proteins can be multi-faceted • Can belong to multiple functional groups – most clustering algorithms are hard – need for soft or fuzzy clustering • Data Integration Issues • Multiple Sources • 2-Hyrbid, Mass Spectrometry, genetic co-occurrence • Different targets • Y2H, Mass Spec – target binding • Gene co-occurrence – target functional • Different weaknesses (missing certain interactions) • Y2H – translation • mass-spectrometry – transport & sensing

Ensemble Clustering • A useful approach to combine the results from multiple clustering arrangements into a single arrangement based on consensus [SG03] • Objective: Mapping between clusters obtained by different algorithms to a single clustering arrangement • Our hypothesis: Potentially offers a viable solution for problems simultaneously • Given nice theory in the context of classification it is likely to be particularly useful in a noisy environment. • A weak analogy to the audience vote in millionaire • Naturally handles arrangements produced from different sources or domain driven segmentation.

Ensemble Clustering on PPI networks:Key Questions • What are the base clustering methods and arrangements to use in the context of interaction networks? • How to handle the influence of noise and hubs? • How do we scale to problems of the scale of interaction networks? • How do we address the issue of soft clustering? • How to address the issue of data integration? • Another day another time 

II. Ensemble Clustering Framework

Birds-eye-view (coarse grained) Topology-based Similarity Metrics Clustering Algorithms Scale-free graph x y Clustering Arrangements xy base clustering arrangements (soft)Consensus Clustering Cluster Representation Final clusters

Similarity Metrics • Central to any clustering algorithm • Key idea: • Leverage topological information to determine the similarity between two proteins in the interaction network • With ensemble approach we are not limited to one! • Metrics : • Clustering coefficient based (edge oriented, local) • Edge Betweenness based (edge oriented, global) • Neighborhood based (local, non-edge oriented)

Clustering coefficient-based similarity • Clustering coefficient • "all-my-friends-know-each-other" property • Measures the interconnectivity of a node’s neighbors. • Clustering coefficient-based similarity of two connected nodes vi and vj • Measures the contribution of the edge between the nodes towards the clustering coefficient of the nodes 2 1 vi vj 6 5 4 3

Edge betweenness-based similarity • Shortest path edge betweenness [Newman et al] • “I-am-between-every-pair” property • Computes the fraction of shortest paths passing through an edge • Edges that lie between communities have high values of betweenness • Edge betweenness-based similarity 2 1 6 5 7 4 3 8

Neighborhood-based similarity • “my-friends-are-your-friends” property • Based on the number of common neighbors between nodes (Czekanowski-Dice metric [Brun et al, 2004]) where Int(i) = number of neighbors of node i 2 1 6 5 4 3

Base Clustering • Base clustering algorithms : Different criteria • kMetis • Repeated bisections • Direct k-way partitioning • Topology-based similarity measures : weight interactions • Clustering coefficient-based – local, targets FP • Edge betweenness-based – global, targets FP • Neighborhood – local, potentially targets FN & FP • 3X3 = 9 arrangements (variance is good!) • K clusters per arrangement (K clusters)

PCA-based Consensus Technique Cluster Purification Dimensionality Reduction Consensus Clustering

Cluster Purification • Goal : Prune unreliable base clusters • Intra-cluster similarity measure where SP(i,j) represents shortest path between i and j • Low intra-cluster distance => high reliability • Remove clusters with low reliability

Dimensionality Reduction • Cluster membership matrix to represent pruned base clusters • Dimensions likely to be high (9 X k) • Clustering inefficient for high-dimensional data • Distance metric computations do not scale well • Lot of noise and redundancy in the matrix • Solution : Reduce dimensions of the matrix • Apply logistic PCA • Variant of PCA for binary data (Schein et al, 2003)

Consensus Clustering • Agglomerative Hierarchical Clustering • Bottom-up clustering algorithm • Begin with each point in a separate cluster • Iteratively merge clusters that are similar • Recursive Bisection (RBR) algorithm • Soft Clustering Variants • Find initial clusters using agglo or RBR • Assign points to multiple clusters based on similarity • Hub nodes have high propensity for multiple membership

Topological Metrics Ensemble Framework (Detailed View) Base Clustering Base clustering arrangements Cluster Purification Weights Weighted Graph Consensus Clustering Pruning Agglomerative Clustering Principal Component Analysis Soft Final clusters PCA-soft-variants PCA-rbr PCA-agglo

III. Evaluation

Validation Metrics: Domain Independant • Topological measure : Modularity [Newman&Girvan04] • Measures the modularity within clusters • dij represents fraction of edges linking nodes in clusters i and j • Information theoretic measure : Normalized Mutual Information [Strehl & Ghosh03] • Measures the shared information between the consensus and base clustering arrangements

Validation Metric: Domain Dependant • Domain-based measure: • Gene ontology annotations for each cluster of proteins • Cellular Component • Molecular Function • Biological Process • P-value to measure statistical significance of clusters • Computes the probability of the grouping being random • Smaller p-values represent higher biological significance • Clustering Score to measure overall clustering arrangement

Experimental Setup • Algorithms proposed by Strehl et al , 2003 • HyperGraph Partitioning Algorithm (HGPA) • Minimal Hyperedge Separator using HMetis • Meta-CLustering Algorithm (MCLA) • Group related hyperedges to form meta-clusters • Assign each point to the closest meta-cluster • Cluster-based Similarity Partitioning (CSPA) • Pairwise similarity matrix is partitioned with METIS • Algorithms proposed by Gionis et al, ICDE 2005 • Agglomerative algorithm (CE-agglo) • Density-based clustering algorithm (CE-balls) • Use strict thresholds and are non-parametric • Database of Interacting Proteins (DIP) • 4928 proteins, 17194 interactions

Modularity and NMI CSPA algorithm ran out of memory CE-agglo and CE-balls algorithms resulted in pairs and singleton clusters (cluster-sizes 2121 and 2783 respectively) PCA-based consensus methods provide best scores!

Comparison with Ensemble Algorithms PCA-based Consensus methods outperform all other algorithms! MCLA performs best of the other algorithms

Existing Solutions to Identify Dense Regions • Molecular Complex Detection (MCODE) • Bader et al, 2003 • Use local neighborhood density to identify seed vertices • Group highly weighted vertices around seed vertices • Markov Cluster Algorithm (MCL) • Dongen et al 2000 • Random walks on the graph will infrequently go from one natural cluster to another • Cluster structure separates out • Fast, scalable and non-parametric

Comparison with MCODE and MCL • MCODE produced only 59 clusters • Not all proteins clustered (794/4928) • 10-20 clusters insignificant • MCL produced 1246 clusters • Most of the clusters insignificant (close to 75-80%)

Soft Clustering: Comparison with Hub Duplication (Ucar 2006) For Hub Hi Hub-induced Subgraph Si Dense components of Si Hi Duplicate Hi Hi D’i i++ Graph Partitioning

Benefits of Soft Ensemble Clustering

A closer look at soft clustering performance • CKA1 (hub protein)

Concluding Remarks • Clustering PPI networks is challenging • Noise • Presence of hubs • Need for soft clustering • Integration • Ensemble clustering shows promise as a unified method to handle these problems • Competes well against existing stand-alone solutions • Scalable -- straightforward parallelization for the most part • Ongoing work • General applicability • WWW applications • Social network analysis • Explicit modeling of domain knowledge • E.g. encoding directionality • Data Integration • Key is to weight edges and/or components of the ensemble • Leveraging graphical models • More robust base models • Extrinsic similarity measures • Impact of anomalies

Questions? • We acknowledge the following grants for support • NSF: CAREER-IIS-0347662 • NSF: NGS-CNS-0406386 • NSF: RI-CNS-0403342 • DOE: ECPI-FG02 • Graduate Student Colleagues • S. Asur and D. Ucar • Details • http://dmrl.cse.ohio-state.edu • www.cse.ohio-state.edu/~srini/

Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Presentation Transcript

Data Mining with Clementine

Fuzzy C-Means Clustering

Data Mining

Data Mining

Presentation for Deforestation project

Graph P artitioning a nd Clustering for Community Detection

DATA MINING LECTURE 5

AMCS/CS 340: Data Mining

Data Mining Tools

BIOLOGICAL NETWORKS

Clustering and NLP

APSC 150 Case Study 3 Sustainable Mining - is that possible?

Lexical networks, lexical centrality, and text mining

Data Mining with DB

Computational Geometry and Spatial Data Mining

New Mining Technology 采矿新技术

Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 8 —

Data Mining Course

Lexical networks, lexical centrality, and text mining

Presentation for Deforestation project