1 / 20

Clustering Short Gene Expression Profiles

Clustering Short Gene Expression Profiles. Ling Wang Marco Ramoni Paolo Sebastiani. The Problem: Input. Gene expression profiles for J genes from microarray experiments. [1]. The Problem: Output. A clustering of the genes that groups functionally related genes in the same cluster.

Download Presentation

Clustering Short Gene Expression Profiles

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering Short Gene Expression Profiles Ling Wang Marco RamoniPaolo Sebastiani Abdullah Mueen

  2. The Problem: Input Gene expression profiles for J genes from microarray experiments [1] Abdullah Mueen

  3. The Problem: Output • A clustering of the genes that groups functionally related genes in the same cluster. Abdullah Mueen

  4. Previous Works • Hierarchical Clustering (Eisen et al., 1998) • K-means and self organizing maps (Tamayo et al, 1999) • Standard measures : Euclidian Distance, Correlation coefficient. • Problem • Ignores the sequential nature of the profiles. • Different pairs of time series can have same measure. [3] Abdullah Mueen

  5. Previous Works • Continuous representation of the profile using • Autoregressive Models. • Hidden Markov Models. • Advantages: • Count the temporal information • Good for long profiles ( 10 points or more ) • Easily go with Bayesian Clustering. [3] Abdullah Mueen

  6. Autoregressive Model: Definition • Each time point is correlated with p previous time points. • Combining the models of all the time points for a gene • Xj is the regression matrix of size (n-p)x(p+1) and βj is the coefficient matrix. [2] Abdullah Mueen

  7. Autoregressive Model: Problems • Problems • AR model is for stationary time series. Interval between time points are ignored. • For short gene expression profiles (5 time points) the regression order can not be large. • For a large number of genes with short expression profiles, there may be random patterns. AR model overfit these random patterns. Abdullah Mueen

  8. The Algorithm The algorithm has three components • A modeldescribing the dynamics of gene expression temporal profiles. • A probabilistic metric to score different clustering models based on the posterior probability of each clustering model. • A heuristic to make the search for the best clustering model feasible. Abdullah Mueen

  9. Polynomial Model: Definition • Each time point is approximated by a polynomial of degree p . • The combined model for a gene is Abdullah Mueen

  10. Polynomial Model: Assumptions • The uncorrelated errors are normally distributed with mean 0 and variance1/τjwhere • The coefficients are normally distributed • β0, α1andα2are hyper-parameters of the prior distributions of the parameters. Abdullah Mueen

  11. Hyper-parameters • Around 25-50% of the total number of genes/probes in the microarrays are disregarded because of their low confidence level. • To avoid overfitting random patterns, hyper parameters are estimated from random data. • If σ2a is the sample variance of the disregarded genes then the hyper-parameters are related through Abdullah Mueen

  12. Scoring Method • The scoring function is calculated using marginal likelihood of each gene which is • For the current model marginal likelihood of a gene is Abdullah Mueen

  13. Marginal Likelihood • With the polynomial model, assumed prior distribution and hyper parameters, the marginal likelihood function is computed. Abdullah Mueen

  14. Scoring the Model • The weighted average of the marginal likelihood of each gene is the scoring function for a clustering model. • The weights for each cluster varies with the size of the cluster. Abdullah Mueen

  15. Agglomerative Clustering • The clustering phase starts with singleton clusters. • It computes and • Iterativelymergestime series into clusters until the scoring function does not increase. • While merging it takes average of the cluster representatives. Abdullah Mueen

  16. Heuristic Search • Computing the scoring function for all the model is expensive and a heuristic is adopted. • Instead of computing all the possible merge pairs, it tries to find a merge pair that increases the scoring function. The search for such a merge pair is done in the descending order of their Euclidian Distance, Dynamic Time Warping, etc. Abdullah Mueen

  17. Evaluation: Simulation Abdullah Mueen

  18. Evaluation: Real Data • The gene expression profiles from [1] are used. Clusters are tested using Gene Ontology enrichment test with EASE (Hosack et al. 2003). Abdullah Mueen

  19. Conclusion • Short gene expression profiles are modeled using polynomials. • A clustering model is evaluated using the marginal likelihood of the genes with respect to the polynomial model. • An agglomerative clustering is done with a heuristic search strategy. • Output clusters are gene ontology enriched. Abdullah Mueen

  20. References • Guillemin K., Salma N.R., Tompkins L.S., and Falkow S. Cag pathogenicity island-specific responses of gastric epithelial cells to Helicobacter pylori infection. PNAS. 99: 15136-15141, 2002. • M. Ramoni, P. Sebastiani, and I. S. Kohane. Cluster analysis of gene expression dynamics. Proc. Natl. Acad. Sci. USA, 99(14):9121–6, 2002 • J. Ernst, G. J. Nau, and Z. Bar-Joseph. Clustering short time series gene expression data. Bioinformatics, 21 Suppl. 1:i159-i168, 2005 Abdullah Mueen

More Related