Model-based clustering of gene expression data

Model-based clustering of gene expression data Ka Yee Yeung1,Chris Fraley2, Alejandro Murua3, Adrian E. Raftery2, andWalter L. Ruzzo1 1Department of Computer Science & Engineering, University of Washington 2Department of Statistics, University of Washington 3Insightful Corporation CAIMS 2001UW CSE Computational Biology Group

Model-based clustering of gene expression data Walter L. Ruzzo University of Washington Computer Science and Engineering CAIMS 2001UW CSE Computational Biology Group

Overview • Overview of gene expression data • Model-based clustering • Data sets • Results • Summary and Conclusions

A gene expression data set …….. • Each chip is one experiment: • time course • drug response • tissue samples (normal/cancer) • … • Snapshot of activities of all genes in the cell p experiments n genes Xij

Analysis of expression data? • Hard problem • High dimensionality • Noise & artifacts • Goals loosely defined, etc. • Common exploratory analysis technique: Clustering

Why Cluster? • Find biologically related genes • Tissue classification • Many algorithms • Most heuristic-based • No clear winner How to cluster?

Model-based clustering • Mixture model: • Each mixture component generated by a specified underlying probability distribution • Gaussian mixture model: • Each component k is modeled by the multivariate normal distribution with parameters : • Mean vector: mk • Covariance matrix: Sk • EM algorithm: to estimate parameters

volume shape Sk=lkDkAkDkT orientation More flexibleMore descriptive But more parameters Covariance models(Banfield & Raftery 1993) • Equal volume spherical model (EI): ~ kmeans Sk = lI • Unequal volume spherical (VI): Sk = lkI • Diagonal model: Sk = lkBk, where Bk is diagonal with | Bk|=1 • EEE elliptical model: Sk = lDADT • Unconstrained model (VVV): Sk = lkDkAkDkT

Advantages of model-based clustering • Higher quality clusters • Flexible models • Model selection - choose right model and right # of clusters • Bayesian Information Criterion (BIC): • Approximate Bayes factor: posterior odds for one model against another model • A large BIC score indicates strong evidence for the corresponding model.

Overview • Overview of gene expression data • Model-based clustering • Data sets • Gene expression data sets • Synthetic data sets • Results • Summary and Conclusions

Our Goal • To show the model-based approach has superior performance with respect to: • Quality of clusters • Number of clusters and model chosen (BIC)

Our Approach • Validate the results from model-based clustering using data sets with external criteria (BIC scores do not require the external criteria) • To compare clusters with external criterion: • Adjusted Rand index(Hubert and Arabie 1985) • Adjusted Rand index = 1 perfect agreement • 2 random partitions have an expected index of 0 • Compare the quality of clusters with a leading heuristic-based algorithm: CAST (Ben-Dor & Yakhini 1999)

Gene expression data sets • Ovarian cancer data set (Michel Schummer, Institute of Systems Biology) • Subset of data : 235 clones 24 experiments (cancer/normal tissue samples) • 235 clones correspond to 4 genes • Yeast cell cycle data (Cho et al 1998) • 17 time points • Subset of 384 genes associated with 5 phases of cell cycle

Synthetic data sets Both based on ovary data • Gaussian mixture • Generate multivariate normal distributions with the sample covariance matrix and mean vector of each class in the ovary data • Randomly resampled ovary data • For each class, randomly sample the expression levels in each experiment, independently • Near diagonal covariance matrix

Overview • Overview of gene expression data • Model-based clustering • Data sets • Results • Synthetic data sets • Expression data sets • Summary and Conclusions

Results: Gaussian mixture based on ovary data(2350 genes) • At 4 clusters, VVV, diagonal, CAST : high adjusted Rand • BIC selects VVV at 4 clusters.

Results: randomly resampled ovary data • Diagonal model achieves the max adjusted Rand and BIC score (higher than CAST) • BIC max at 4 clusters • Confirms expected result

Results: square root ovary data • Max adjusted Rand at EEE 4 clusters (> CAST) • BIC selects VI at 8 clusters • BIC curve shows a bend at 4 clusters for EEE.

Results: log yeast cell cycle data • CAST achieves much higher adjusted Rand indices than most model-based approaches (except EEE). • BIC scores of EEE much higher than the other models.

Results: standardized yeast cell cycle data • EI slightly higher adjusted Rand indices than CAST at 5 clusters. • BIC selects EEE at 5 clusters.

Summary and Conclusions • Synthetic data sets: • With the correct model, the model-based clustering better than a leading heuristic clustering algorithm • BIC selects the right model & right number of clusters • Real expression data sets: • Comparable adjusted Rand indices to CAST • BIC gives a hint to the number of clusters • Time course data: • Standardization makes classes spherical

Future Work • Custom refinements to the model-based implementation: • Design models that incorporate specific information about the experiments, eg. Block diagonal covariance matrix • Missing data • Outliers & noise

Acknowledgements • Ka Yee Yeung1,Chris Fraley2, Alejandro Murua3, Adrian E. Raftery2 1Department of Computer Science & Engineering, University of Washington 2Department of Statistics, University of Washington 3Insightful Corporation • Michèl Schummer (Institute of Systems Biology) • the ovary data set • Jeremy Tantrum • help with MBC software (diagonal model)

More Info http://www.cs.washington.edu/homes/ruzzo CAIMS 2001UW CSE Computational Biology Group

Model-based clustering • Mixture model: • Assume each component is generated by an underlying probability distribution • Likelihood for the mixture model: • Gaussian mixture model: • Each component k is modeled by the multivariate normal distribution with parameters : • Mean vector: mk • Covariance matrix: Sk

Eigenvalue decomposition of the covariance matrix • Covariance matrix Sk determines geometric features of each component k. • Banfield and Raftery (1993): • Dk: orientation of component k (orthogonal matrix) • Ak: shape of component k (diagonal matrix) • lk : volume of component k (scalar) • Allowing some but not all of the parameters to vary  many models within this framework

EM algorithm • General appraoch to maximum likelihood • Iterate between E and M steps: • E step: compute the probability of each observation belonging to each cluster using the current parameter estimates • M-step: estimate model parameters using the current group membership probabilities

Definition of the BIC score • The integrated likelihood p(D|Mk) is hard to evaluate, where D is the data, Mk is the model. • BIC is an approximation to log p(D|Mk) • uk: number of parameters to be estimated in model Mk

Randomly resampled synthetic data set Ovary data Synthetic data experiments class genes

Adjusted Rand Example

Results: mixture of normal distributions based on ovary data (235 genes) • VVV and CAST: comparable adjusted Rand index at 4 clusters • With only 235 genes, no BIC scores available from VVV. • With bigger data sets (2350 genes), BIC selects VVV at 4 clusters.

Results: log ovary data • Max adjusted Rand: VVV 4 clusters, EEE 4-6 clusters (higher than CAST) • BIC selects VI at 9 clusters • BIC curve shows a bend at 4 clusters for the diagonal model.

Results: standardized ovary data • Comparable adjusted Rand indices for CAST, EI and EEE at 4 clusters. • BIC selects EEE at 7 clusters.

Key advantage of the model-based approach: choose the right model and the number of clusters • Bayesian Information Criterion (BIC): • Approximate Bayes factor: posterior odds for one model against another model • A large BIC score indicates strong evidence for the corresponding model.

Model-based clustering of gene expression data

Model-based clustering of gene expression data

Presentation Transcript

Clustering analysis of microarray gene expression data

Lecture 9: Gene expression analysis/Clustering

Basic Gene Expression Data Analysis--Clustering

Evaluation and optimization of clustering in gene expression data analysis

Gaussian Mixture Model based adaptive filtering of microarray gene expression data

Analysis of Gene Expression Data

Clustering Gene Expression Data

ICA-based Clustering of Genes from Microarray Expression Data

Discrimination and clustering with microarray gene expression data

Context-Specific Bayesian Clustering for Gene Expression Data

Scalable Rule-Based Gene Expression Data Classification

Applications of Visualization and Data Clustering to 3D Gene Expression Data

Clustering short time series gene expression data

Probabilistic Techniques for the Clustering of Gene Expression Data

Graph-based consensus clustering for class discovery from gene expression data

Clustering Gene Expression Data

Gene Expression Data

Clustering Short Gene Expression Profiles

Soft clustering of gene expression data

Clustering analysis of microarray gene expression data

Clustering Gene Expression Data

Clustering Gene Expression Data