- 68 Views
- Uploaded on
- Presentation posted in: General

Clustering Short Gene Expression Profiles

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Clustering Short Gene Expression Profiles

Ling Wang Marco RamoniPaolo Sebastiani

Abdullah Mueen

Gene expression profiles for J genes from microarray experiments

[1]

Abdullah Mueen

- A clustering of the genes that groups functionally related genes in the same cluster.

Abdullah Mueen

- Hierarchical Clustering (Eisen et al., 1998)
- K-means and self organizing maps (Tamayo et al, 1999)
- Standard measures : Euclidian Distance, Correlation coefficient.
- Problem
- Ignores the sequential nature of the profiles.
- Different pairs of time series can have same measure.

[3]

Abdullah Mueen

- Continuous representation of the profile using
- Autoregressive Models.
- Hidden Markov Models.

- Advantages:
- Count the temporal information
- Good for long profiles ( 10 points or more )
- Easily go with Bayesian Clustering.

[3]

Abdullah Mueen

- Each time point is correlated with p previous time points.
- Combining the models of all the time points for a gene
- Xj is the regression matrix of size (n-p)x(p+1) and βj is the coefficient matrix.

[2]

Abdullah Mueen

- Problems
- AR model is for stationary time series. Interval between time points are ignored.
- For short gene expression profiles (5 time points) the regression order can not be large.
- For a large number of genes with short expression profiles, there may be random patterns. AR model overfit these random patterns.

Abdullah Mueen

The algorithm has three components

- A modeldescribing the dynamics of gene expression temporal profiles.
- A probabilistic metric to score different clustering models based on the posterior probability of each clustering model.
- A heuristic to make the search for the best clustering model feasible.

Abdullah Mueen

- Each time point is approximated by a polynomial of degree p .
- The combined model for a gene is

Abdullah Mueen

- The uncorrelated errors are normally distributed with mean 0 and variance1/τjwhere
- The coefficients are normally distributed
- β0, α1andα2are hyper-parameters of the prior distributions of the parameters.

Abdullah Mueen

- Around 25-50% of the total number of genes/probes in the microarrays are disregarded because of their low confidence level.
- To avoid overfitting random patterns, hyper parameters are estimated from random data.
- If σ2a is the sample variance of the disregarded genes then the hyper-parameters are related through

Abdullah Mueen

- The scoring function is calculated using marginal likelihood of each gene which is
- For the current model marginal likelihood of a gene is

Abdullah Mueen

- With the polynomial model, assumed prior distribution and hyper parameters, the marginal likelihood function is computed.

Abdullah Mueen

- The weighted average of the marginal likelihood of each gene is the scoring function for a clustering model.
- The weights for each cluster varies with the size of the cluster.

Abdullah Mueen

- The clustering phase starts with singleton clusters.
- It computes and
- Iterativelymergestime series into clusters until the scoring function does not increase.
- While merging it takes average of the cluster representatives.

Abdullah Mueen

- Computing the scoring function for all the model is expensive and a heuristic is adopted.
- Instead of computing all the possible merge pairs, it tries to find a merge pair that increases the scoring function. The search for such a merge pair is done in the descending order of their Euclidian Distance, Dynamic Time Warping, etc.

Abdullah Mueen

Abdullah Mueen

- The gene expression profiles from [1] are used. Clusters are tested using Gene Ontology enrichment test with EASE (Hosack et al. 2003).

Abdullah Mueen

- Short gene expression profiles are modeled using polynomials.
- A clustering model is evaluated using the marginal likelihood of the genes with respect to the polynomial model.
- An agglomerative clustering is done with a heuristic search strategy.
- Output clusters are gene ontology enriched.

Abdullah Mueen

- Guillemin K., Salma N.R., Tompkins L.S., and Falkow S. Cag pathogenicity island-specific responses of gastric epithelial cells to Helicobacter pylori infection. PNAS. 99: 15136-15141, 2002.
- M. Ramoni, P. Sebastiani, and I. S. Kohane. Cluster analysis of gene expression dynamics. Proc. Natl. Acad. Sci. USA, 99(14):9121–6, 2002
- J. Ernst, G. J. Nau, and Z. Bar-Joseph. Clustering short time series gene expression data. Bioinformatics, 21 Suppl. 1:i159-i168, 2005

Abdullah Mueen