1 / 54

Clustering of Time Course Gene-Expression Data via Mixture Regression Models

Clustering of Time Course Gene-Expression Data via Mixture Regression Models. Geoff McLachlan (joint with Angus Ng and Sam Wang) Department of Mathematics & Institute for Molecular Bioscience University of Queensland. ARC Centre of Excellence in Bioinformatics

dimaia
Download Presentation

Clustering of Time Course Gene-Expression Data via Mixture Regression Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering of Time Course Gene-Expression Data via Mixture Regression Models Geoff McLachlan (joint with Angus Ng and Sam Wang) Department of Mathematics & Institute for Molecular Bioscience University of Queensland ARC Centre of Excellence in Bioinformatics http://www.maths.uq.edu.au/~gjm

  2. Institute for Molecular Bioscience, University of Queensland

  3. Time-Course Data Time-course microarray experiments are being increasingly used to characterize dynamic biological processes. (Microarray technology provides the ability to measure the expression levels of thousands of genes at once.) In these experiments, gene-expression levels are measured at different time points, possibly in different biological conditions (e.g. treatment-control). The focus here is on the analysis of gene-expression profiles consisting of short time series of log expression ratios for each of the genes represented on the microarrays.

  4. CLUSTERING OF GENE PROFILES can provide new insight into the biological proces of interest (coexpressed genes can contribute to our understanding of the regulatory network of gene expression). can also assist in assigning functions to genes that have not yet been functionally annotated. a secondary concern is the need for imputation of missing data

  5. The biological rationale underlying the clustering of microarray data is the fact that many coexpressed genes are coregulated. It becomes a way of identifying sets of genes that are putatively coregulated, thereby generating testable hypotheses; see Boutros and Okey (2005). It assists with: the functional annotation of uncharacterised genes the identification of transcription factor binding sites the elucidation of complete biological pathways

  6. Outline of Talk • 1. Mixture model-based approach to analysis of gene-expressions • 2. Normal Mixtures • 3. Modifications for high-dimensional and/or structured data • Mixtures of linear mixed models • Clustering of gene profiles

  7. Finite Mixture Models • Provide an arbitrarily accurate estimate of the underlying density with g sufficiently large • Provide a probabilistic clustering of the data into g clusters - outright clustering by assigning a data point to the component to which it has the greatest posterior probability of belonging.

  8. Definition We letY1,…. Yndenote a random sample of sizenwhereYjis a p-dimensional random vector with probability density functionf (yj) where thef i(yj) are densities and thepiare nonnegative quantities that sum to one.

  9. By Bayes Theorem, fori=1,…, g; j=1,…,n. The quantityti(yj;Y (k)) is the posterior probability that thejthmember of the sample with observed valueyjbelongs to theithcomponent of the mixture.

  10. A soft (probabilistic) clustering is given in terms of the estimated posterior probabilities of component membership A hard (outright) clustering is given by assigning each yj to the component to which it has the highest posterior probability of belonging; that is, given by the where

  11. Multivariate Mixture Models Day (Biometrika, 1969) Wolfe (NORMIX, 1965, 1967, 1970) It was the publication of the seminal paper ofDempster, Laird, and Rubin (1977) on theEM algorithm that greatly stimulated interest inthe use of finite mixture distributions to model heterogeneous data.

  12. Multivariate Mixture Models Day (Biometrika, 1969) Wolfe (NORMIX, 1965, 1967, 1970) It was the publication of the seminal paper ofDempster, Laird, and Rubin (1977) on theEM algorithm that greatly stimulated interest inthe use of finite mixture distributions to model heterogeneous data. Ganesalingam and McLachlan (Biometrika,1978)

  13. Everitt and Hand (2001) • Titterington, Smith, and Makov (1985)

  14. Everitt and Hand (2001) • Titterington, Smith, and Makov (1985) • McLachlan and Basford (1988) • Lindsay (1996) • McLachlan and Peel (2000) • Bohning (2000) • Fruhwirth-Schnatter (2006)

  15. Normal Mixtures Suppose that the density of the random vectorYjhas ag-component normal mixture form whereYis the vector containing the unknown parameters.

  16. One attractive feature of adopting mixture models with elliptically symmetric components, such as the normal or t densities, is that the implied clustering is invariant under affine transformations of the data, i.e., under operations relating to changes in location, scale, and rotation of the data. Thus the clustering process does not depend on irrelevant factors such as the units of measurement or the orientation of the clusters in space.

  17. Microarray Data represented as N x M Matrix Sample 1 Sample 2 Sample M Gene 1 Gene 2 Gene N Expression Signature M columns (samples) ~ 102 N rows (genes) ~ 104 Expression Profile

  18. Clustering of Microarray Data Clustering of tissues on basis of genes: latter is a nonstandard problem in cluster analysis (n =M << p=N) Clustering of genes on basis of tissues: genes (observations) not independent and structure on the tissues (variables) (n=N >> p=M)

  19. The component-covariance matrixΣiis highly parameterized withp(p+1)/2parameters. Σi= σ2Ip (equal spherical) Σi= σi2Ip(unequal spherical) Σi= D (equal diagonal) Σi= Di (unequal diagonal) Σi= Σ (equal)

  20. Banfield and Raftery (1993) introduced a parameterization of the component-covariance matrix Σibased on a variant of the standard spectral decomposition of Σi(i=1, …,g).

  21. However, if p is large relative to the sample size n, it may not be possible to use this decomposition to infer an appropriate model for the component-covariance matrices. Even if it is possible, the results may not be reliable due to potential problems with near-singular estimates of the component-covariance matrices when p is large relative to n.

  22. Hence, in fitting normal mixture models to high-dimensional data, we should first consider • some form of dimension reduction and/or • some form of regularization

  23. Mixture Software: EMMIX EMMIX for UNIX McLachlan, Peel, Adams, and Basford http://www.maths.uq.edu.au/~gjm/emmix/emmix.html

  24. PROVIDES A MODEL-BASED APPROACH TO CLUSTERING McLachlan, Bean, and Peel, 2002, A Mixture Model-Based Approach to the Clustering of Microarray Expression Data,Bioinformatics18, 413-422 http://www.bioinformatics.oupjournals.org/cgi/screenpdf/18/3/413.pdf

  25. Microarray Data represented as N x M Matrix Sample 1 Sample 2 Sample M Gene 1 Gene 2 Gene N Expression Signature M columns (samples) ~ 102 N rows (genes) ~ 104 Expression Profile

  26. In applying the normal mixture model to cluster multivariate • (continuous) data, it is assumed as in most typical cluster analyses using any other method that • there are no replications on any particular entity specifically identified as such; • (b) all the observations on the entities are independent of one another

  27. For example, where and where

  28. Clustering of gene expression profiles • Longitudinal (with or without replication, for example time-course) • Cross-sectional data EMMIX-WIRE EM-based MIXture analysis With Random Effects Ng, McLachlan, Wang, Ben-Tovim Jones, and Ng (2006, Bioinformatics) Supplementary information : http://www.maths.uq.edu.au/~gjm/bioinf0602_supp.pdf

  29. In the ith component of the mixture, the profile vector yj for the jth gene follows the model

  30. N(mi,Bi), with

  31. Celeux et al. (2005).Mixtures of linear mixed models for clustering gene expression profiles from repeated microarray measurements. Statistical Modelling 5 , 243-267. • Qin and Self (2006).The clustering of regression models method with applications in gene expression data.Biometrics 62, 526-533. • Booth et al. (2008).Clustering using objective functions and stochastic search. J R Statist Soc B 70, 119-139.

  32. Yeast cell cycle data of Cho et al. (1998) n=237genes at p=17 time points categorized into 4 MIPS (Munich Information Centre for Protein Sequences)functional groups. The yeast system is useful because of our ability to control and monitor the progression of cells through the cell cycle (temperature-based synchronization with temperature-sensitive genes whose product is essential for cell-cycle progression).

  33. High-density oligonucleotide arrays were used to quanitate mRNA transcript levels in synchronized yeast cells at regular intervals (10 min) during the cell cycle (genes with cell-cycle dependent periodicity). Samples of yeast cultures were taken at 17 time points after their cell cycle phase had been synchronized. The data were reduced to a short time series of log expression ratios for each of the yeast genes represented on the microarrays (expression ratios were calculated by dividing each intensity measurement by the average for that gene.

  34. Example . Clustering of yeast cell cycle time-course data n = 237 genes p = 17 time points where

  35. In the ith cluster,

  36. Estimated T followingBooth et al. (2004) 0, 10, 20,…, 160 T is the period – estimated to be 73 min.

  37. Table 1: Values of BIC for Various Levels of the Number of Components g

  38. Cluster-specific random effects term

  39. Table 2: Summary of Clustering Results for g = 4 Clusters

  40. The use of the cluster-specific random effects terms ci leads to a clusteringthat corresponds more closely to the underlying functional groups than without their use.

  41. Figure 1: Clusters of gene-profiles obtained by mixture of linear mixed models with cluster-specific random effects

  42. Figure 2: Clusters of gene-profiles obtained by mixture of linear mixed models without cluster-specific random effects

  43. Figure 3: Clusters of gene-profiles obtained by mixtures of linear mixed models with and without cluster-specific random effects

  44. Figure 4: Plots of gene profiles grouped according to their functional grouping

  45. Figure 5: Plots of clustered gene profiles versus functional grouping

  46. Figure 6: Clusters of gene-profiles obtained by k-means

  47. Figure 7: Plots of Clusters of gene-profiles: Model-based clustering versus k-means

More Related