1 / 36

Model-based clustering of gene expression data

Model-based clustering of gene expression data. Ka Yee Yeung 1 ,Chris Fraley 2 , Alejandro Murua 3 , Adrian E. Raftery 2 , and Walter L. Ruzzo 1 1 Department of Computer Science & Engineering, University of Washington 2 Department of Statistics, University of Washington

ralph
Download Presentation

Model-based clustering of gene expression data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Model-based clustering of gene expression data Ka Yee Yeung1,Chris Fraley2, Alejandro Murua3, Adrian E. Raftery2, andWalter L. Ruzzo1 1Department of Computer Science & Engineering, University of Washington 2Department of Statistics, University of Washington 3Insightful Corporation CAIMS 2001UW CSE Computational Biology Group

  2. Model-based clustering of gene expression data Walter L. Ruzzo University of Washington Computer Science and Engineering CAIMS 2001UW CSE Computational Biology Group

  3. Overview • Overview of gene expression data • Model-based clustering • Data sets • Results • Summary and Conclusions

  4. A gene expression data set …….. • Each chip is one experiment: • time course • drug response • tissue samples (normal/cancer) • … • Snapshot of activities of all genes in the cell p experiments n genes Xij

  5. Analysis of expression data? • Hard problem • High dimensionality • Noise & artifacts • Goals loosely defined, etc. • Common exploratory analysis technique: Clustering

  6. Why Cluster? • Find biologically related genes • Tissue classification • Many algorithms • Most heuristic-based • No clear winner How to cluster?

  7. Overview • Overview of gene expression data • Model-based clustering • Data sets • Results • Summary and Conclusions

  8. Model-based clustering • Mixture model: • Each mixture component generated by a specified underlying probability distribution • Gaussian mixture model: • Each component k is modeled by the multivariate normal distribution with parameters : • Mean vector: mk • Covariance matrix: Sk • EM algorithm: to estimate parameters

  9. volume shape Sk=lkDkAkDkT orientation More flexibleMore descriptive But more parameters Covariance models(Banfield & Raftery 1993) • Equal volume spherical model (EI): ~ kmeans Sk = lI • Unequal volume spherical (VI): Sk = lkI • Diagonal model: Sk = lkBk, where Bk is diagonal with | Bk|=1 • EEE elliptical model: Sk = lDADT • Unconstrained model (VVV): Sk = lkDkAkDkT

  10. Advantages of model-based clustering • Higher quality clusters • Flexible models • Model selection - choose right model and right # of clusters • Bayesian Information Criterion (BIC): • Approximate Bayes factor: posterior odds for one model against another model • A large BIC score indicates strong evidence for the corresponding model.

  11. Overview • Overview of gene expression data • Model-based clustering • Data sets • Gene expression data sets • Synthetic data sets • Results • Summary and Conclusions

  12. Our Goal • To show the model-based approach has superior performance with respect to: • Quality of clusters • Number of clusters and model chosen (BIC)

  13. Our Approach • Validate the results from model-based clustering using data sets with external criteria (BIC scores do not require the external criteria) • To compare clusters with external criterion: • Adjusted Rand index(Hubert and Arabie 1985) • Adjusted Rand index = 1 perfect agreement • 2 random partitions have an expected index of 0 • Compare the quality of clusters with a leading heuristic-based algorithm: CAST (Ben-Dor & Yakhini 1999)

  14. Gene expression data sets • Ovarian cancer data set (Michel Schummer, Institute of Systems Biology) • Subset of data : 235 clones 24 experiments (cancer/normal tissue samples) • 235 clones correspond to 4 genes • Yeast cell cycle data (Cho et al 1998) • 17 time points • Subset of 384 genes associated with 5 phases of cell cycle

  15. Synthetic data sets Both based on ovary data • Gaussian mixture • Generate multivariate normal distributions with the sample covariance matrix and mean vector of each class in the ovary data • Randomly resampled ovary data • For each class, randomly sample the expression levels in each experiment, independently • Near diagonal covariance matrix

  16. Overview • Overview of gene expression data • Model-based clustering • Data sets • Results • Synthetic data sets • Expression data sets • Summary and Conclusions

  17. Results: Gaussian mixture based on ovary data(2350 genes) • At 4 clusters, VVV, diagonal, CAST : high adjusted Rand • BIC selects VVV at 4 clusters.

  18. Results: randomly resampled ovary data • Diagonal model achieves the max adjusted Rand and BIC score (higher than CAST) • BIC max at 4 clusters • Confirms expected result

  19. Results: square root ovary data • Max adjusted Rand at EEE 4 clusters (> CAST) • BIC selects VI at 8 clusters • BIC curve shows a bend at 4 clusters for EEE.

  20. Results: log yeast cell cycle data • CAST achieves much higher adjusted Rand indices than most model-based approaches (except EEE). • BIC scores of EEE much higher than the other models.

  21. Results: standardized yeast cell cycle data • EI slightly higher adjusted Rand indices than CAST at 5 clusters. • BIC selects EEE at 5 clusters.

  22. Overview • Overview of gene expression data • Model-based clustering • Data sets • Results • Summary and Conclusions

  23. Summary and Conclusions • Synthetic data sets: • With the correct model, the model-based clustering better than a leading heuristic clustering algorithm • BIC selects the right model & right number of clusters • Real expression data sets: • Comparable adjusted Rand indices to CAST • BIC gives a hint to the number of clusters • Time course data: • Standardization makes classes spherical

  24. Future Work • Custom refinements to the model-based implementation: • Design models that incorporate specific information about the experiments, eg. Block diagonal covariance matrix • Missing data • Outliers & noise

  25. Acknowledgements • Ka Yee Yeung1,Chris Fraley2, Alejandro Murua3, Adrian E. Raftery2 1Department of Computer Science & Engineering, University of Washington 2Department of Statistics, University of Washington 3Insightful Corporation • Michèl Schummer (Institute of Systems Biology) • the ovary data set • Jeremy Tantrum • help with MBC software (diagonal model)

  26. More Info http://www.cs.washington.edu/homes/ruzzo CAIMS 2001UW CSE Computational Biology Group

  27. Model-based clustering • Mixture model: • Assume each component is generated by an underlying probability distribution • Likelihood for the mixture model: • Gaussian mixture model: • Each component k is modeled by the multivariate normal distribution with parameters : • Mean vector: mk • Covariance matrix: Sk

  28. Eigenvalue decomposition of the covariance matrix • Covariance matrix Sk determines geometric features of each component k. • Banfield and Raftery (1993): • Dk: orientation of component k (orthogonal matrix) • Ak: shape of component k (diagonal matrix) • lk : volume of component k (scalar) • Allowing some but not all of the parameters to vary  many models within this framework

  29. EM algorithm • General appraoch to maximum likelihood • Iterate between E and M steps: • E step: compute the probability of each observation belonging to each cluster using the current parameter estimates • M-step: estimate model parameters using the current group membership probabilities

  30. Definition of the BIC score • The integrated likelihood p(D|Mk) is hard to evaluate, where D is the data, Mk is the model. • BIC is an approximation to log p(D|Mk) • uk: number of parameters to be estimated in model Mk

  31. Randomly resampled synthetic data set Ovary data Synthetic data experiments class genes

  32. Adjusted Rand Example

  33. Results: mixture of normal distributions based on ovary data (235 genes) • VVV and CAST: comparable adjusted Rand index at 4 clusters • With only 235 genes, no BIC scores available from VVV. • With bigger data sets (2350 genes), BIC selects VVV at 4 clusters.

  34. Results: log ovary data • Max adjusted Rand: VVV 4 clusters, EEE 4-6 clusters (higher than CAST) • BIC selects VI at 9 clusters • BIC curve shows a bend at 4 clusters for the diagonal model.

  35. Results: standardized ovary data • Comparable adjusted Rand indices for CAST, EI and EEE at 4 clusters. • BIC selects EEE at 7 clusters.

  36. Key advantage of the model-based approach: choose the right model and the number of clusters • Bayesian Information Criterion (BIC): • Approximate Bayes factor: posterior odds for one model against another model • A large BIC score indicates strong evidence for the corresponding model.

More Related