Clustering short gene expression profiles
1 / 20

Clustering Short Gene Expression Profiles - PowerPoint PPT Presentation

  • Uploaded on

Clustering Short Gene Expression Profiles. Ling Wang Marco Ramoni Paolo Sebastiani. The Problem: Input. Gene expression profiles for J genes from microarray experiments. [1]. The Problem: Output. A clustering of the genes that groups functionally related genes in the same cluster.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Clustering Short Gene Expression Profiles' - clayton-morin

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Clustering short gene expression profiles

Clustering Short Gene Expression Profiles

Ling Wang Marco RamoniPaolo Sebastiani

Abdullah Mueen

The problem input
The Problem: Input

Gene expression profiles for J genes from microarray experiments


Abdullah Mueen

The problem output
The Problem: Output

  • A clustering of the genes that groups functionally related genes in the same cluster.

Abdullah Mueen

Previous works
Previous Works

  • Hierarchical Clustering (Eisen et al., 1998)

  • K-means and self organizing maps (Tamayo et al, 1999)

  • Standard measures : Euclidian Distance, Correlation coefficient.

  • Problem

    • Ignores the sequential nature of the profiles.

    • Different pairs of time series can have same measure.


Abdullah Mueen

Previous works1
Previous Works

  • Continuous representation of the profile using

    • Autoregressive Models.

    • Hidden Markov Models.

  • Advantages:

    • Count the temporal information

    • Good for long profiles ( 10 points or more )

    • Easily go with Bayesian Clustering.


Abdullah Mueen

Autoregressive model definition
Autoregressive Model: Definition

  • Each time point is correlated with p previous time points.

  • Combining the models of all the time points for a gene

  • Xj is the regression matrix of size (n-p)x(p+1) and βj is the coefficient matrix.


Abdullah Mueen

Autoregressive model problems
Autoregressive Model: Problems

  • Problems

    • AR model is for stationary time series. Interval between time points are ignored.

    • For short gene expression profiles (5 time points) the regression order can not be large.

    • For a large number of genes with short expression profiles, there may be random patterns. AR model overfit these random patterns.

Abdullah Mueen

The algorithm
The Algorithm

The algorithm has three components

  • A modeldescribing the dynamics of gene expression temporal profiles.

  • A probabilistic metric to score different clustering models based on the posterior probability of each clustering model.

  • A heuristic to make the search for the best clustering model feasible.

Abdullah Mueen

Polynomial model definition
Polynomial Model: Definition

  • Each time point is approximated by a polynomial of degree p .

  • The combined model for a gene is

Abdullah Mueen

Polynomial model assumptions
Polynomial Model: Assumptions

  • The uncorrelated errors are normally distributed with mean 0 and variance1/τjwhere

  • The coefficients are normally distributed

  • β0, α1andα2are hyper-parameters of the prior distributions of the parameters.

Abdullah Mueen

Hyper parameters

  • Around 25-50% of the total number of genes/probes in the microarrays are disregarded because of their low confidence level.

  • To avoid overfitting random patterns, hyper parameters are estimated from random data.

  • If σ2a is the sample variance of the disregarded genes then the hyper-parameters are related through

Abdullah Mueen

Scoring method
Scoring Method

  • The scoring function is calculated using marginal likelihood of each gene which is

  • For the current model marginal likelihood of a gene is

Abdullah Mueen

Marginal likelihood
Marginal Likelihood

  • With the polynomial model, assumed prior distribution and hyper parameters, the marginal likelihood function is computed.

Abdullah Mueen

Scoring the model
Scoring the Model

  • The weighted average of the marginal likelihood of each gene is the scoring function for a clustering model.

  • The weights for each cluster varies with the size of the cluster.

Abdullah Mueen

Agglomerative clustering
Agglomerative Clustering

  • The clustering phase starts with singleton clusters.

  • It computes and

  • Iterativelymergestime series into clusters until the scoring function does not increase.

  • While merging it takes average of the cluster representatives.

Abdullah Mueen

Heuristic search
Heuristic Search

  • Computing the scoring function for all the model is expensive and a heuristic is adopted.

  • Instead of computing all the possible merge pairs, it tries to find a merge pair that increases the scoring function. The search for such a merge pair is done in the descending order of their Euclidian Distance, Dynamic Time Warping, etc.

Abdullah Mueen

Evaluation simulation
Evaluation: Simulation

Abdullah Mueen

Evaluation real data
Evaluation: Real Data

  • The gene expression profiles from [1] are used. Clusters are tested using Gene Ontology enrichment test with EASE (Hosack et al. 2003).

Abdullah Mueen


  • Short gene expression profiles are modeled using polynomials.

  • A clustering model is evaluated using the marginal likelihood of the genes with respect to the polynomial model.

  • An agglomerative clustering is done with a heuristic search strategy.

  • Output clusters are gene ontology enriched.

Abdullah Mueen


  • Guillemin K., Salma N.R., Tompkins L.S., and Falkow S. Cag pathogenicity island-specific responses of gastric epithelial cells to Helicobacter pylori infection. PNAS. 99: 15136-15141, 2002.

  • M. Ramoni, P. Sebastiani, and I. S. Kohane. Cluster analysis of gene expression dynamics. Proc. Natl. Acad. Sci. USA, 99(14):9121–6, 2002

  • J. Ernst, G. J. Nau, and Z. Bar-Joseph. Clustering short time series gene expression data. Bioinformatics, 21 Suppl. 1:i159-i168, 2005

Abdullah Mueen