Biclustering algorithms for biological data analysis
1 / 26

- PowerPoint PPT Presentation

  • Updated On :

Biclustering Algorithms for Biological Data Analysis. Csci 8980: Mining Biomedical Datasets Spring 2011. Some slides are taken from Matthew Hibbs. Introduction. M conditions. Micro-arrays technology provides us expression level for thousands of genes

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about '' - lizbeth

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Biclustering algorithms for biological data analysis l.jpg

Biclustering Algorithms for Biological Data Analysis

Csci 8980: Mining Biomedical Datasets

Spring 2011

Some slides are taken from Matthew Hibbs

Slide2 l.jpg


M conditions

  • Micro-arrays technology provides us expression level for thousands of genes

  • Microarray data can be viewed as an NM matrix:

    • Rows/columns represents gene/condition

    • Each entry represents the expression level of a gene under a condition.

    • A row/column is sometimes referred to as the “expression profile” of the gene/condition



Slide3 l.jpg


expression data



diverse conditions

Pooling genome-wide expression measurements from many experiments



sets of specific conditions

Clustering microarray data l.jpg
Clustering Microarray data

If two genes are related (have similar functions or are co-regulated), their expression profiles should be similar (e.g. low Euclidean distance or high correlation).


Characterization of Early Stages of Human B Cell Development by Gene Expression Profiling

Marit E. Hystad et. al.

Journal of Immunology, 2007


Slide5 l.jpg

Traditional Clustering Analysis

  • Cluster analysis group the data so that members of the same group are similar but between groups they are distinct.

  • For gene expression data it is grouped in two ways:

    • By condition

      • Multiple microarray experiments can be grouped according to the similarity of gene expression between experiments

    • By gene

      • The genes are grouped according to their individual expression profile across the experiments (e.g. to identify groups of co-regulated genes).

Slide6 l.jpg

Typical Clustering Algorithms Used

  • K-means / k-median

    • Pros: Fairly simple approach and gives accurate results when discriminating unrelated classes

    • Cons: Prior specification of the number of classes the data has to be divided into; Problem of local minima; Objects are only allowed to belong to one group

  • Hierarchical clustering

    • Pros: Fairly simple approach and there is no need to know the number of clusters a priori

    • Cons: Objects are allowed to belong only to one group; All genes are clustered, although some connections might be weaker

Slide7 l.jpg

Two Major Drawbacks with Traditional Clustering

  • Genes are clustered on the basis of their expression under all experimental conditions

    • However, a cellular process may be active only in a subset of conditions

  • Assignment of a single gene to only one cluster

    • However, a single gene may participate in multiple cellular processes

X: Set of genes

Y: Set of conditions

Regions: Modules

Slide8 l.jpg

Overview of Biclustering Methods

  • To overcome these problems, many “biclustering” methods have been proposed to cluster both genes and conditions simultaneously. Some of them are:

  • Cheng Y and Church GM. Biclustering of expression data. Proc Int Conf Intell Syst Mol Biol. 2000;8:93-103)

  • Getz G, Levine E, Domany E. Coupled two-way clustering analysis of gene microarray data. (Proc Natl Acad Sci U S A. 2000 Oct 24;97(22):12079-84)

  • Tanay A, Sharan R, Kupiec M, Shamir R. Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. (Proc Natl Acad Sci U S A. 2004 Mar 2;101(9):2981-6)

  • Sven Bergmann, Jan Ihmels and Naama Barkai , “Iterative signature algorithm for the analysis of large-scale gene expression data”, Phys. Rev. E 67, 031902 (2003)

  • Gasch AP and Eisen MB. Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering. (Genome Biol. 2002 Oct 10;3(11):RESEARCH0059)

  • Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R, Staudt L, Chan WC, Botstein D, Brown P. 'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns. (Genome Biol. 2000;1(2):RESEARCH0003.)

  • Other references are in the project description document on the class website

What is biclustering l.jpg
What is Biclustering ?

Row/column order need not be consistent between different biclusters.

Finding submatrices in an n x m matrix that follow a desired pattern*

Types of biclusters l.jpg
Types of Biclusters

Constant values

Constant values on rows

Constant values on columns

Coherent evolutions

Bicluster properties l.jpg
Bicluster properties

Biclustering of Expression data: Cheng and Church, RECOMB 2001

For any submatrix CIJ where I and J are a subsets of genes and conditions, the mean squared residude score is

A bicluster is a submatrix CIJ that has a low mean squared residue score.

Example l.jpg




1 2 3 4 5 6 7

Gene 1

Gene 2

Gene 3

Gene 4

has row variance but no column variance

There could be either row variance or column variance and that is explained by the overall variance

Has column variance

but no row variance

Slide17 l.jpg

Issues with Biclustering Methods

  • Computational Complexity

    • Most methods use some heuristics to limit the search space

    • Although this helps finds most of the TMs in a reasonable amount of time, the greedy nature of most biclustering algorithms makes them suffer from the local minima problem

  • Difficult to Compare

    • Different biclustering methods use different definitions of the biclusters and use different optimization techniques and implementation to solve them (See Madeira et al 2004 and Prelic et al 2006 for details)

  • Difficult to choose appropriate biclustering algorithm

    • Different algorithms exhibit significant variations in terms of their robustness and sensitivity to noise in the gene expression data (Prelic et al 2006)

Cheng and church s algorithm l.jpg
Cheng and Church’s Algorithm

Greedy approach to rapidly converge to a maximal bicluster.

In phase I, it removes rows/columns with a large contribution to the mean residue score (msr).

In phase II, rows/columns are added that have a low contribution to the msr without exceeding δ.

After a bicluster is identified, its values are randomized to prevent it to show up again.

Slide19 l.jpg

  • Greedy Approach

  • Finds a submatrix that minimizes MSR

  • Biclusters (a) and (b) fits the definition of MSR

Order preserving submatrix problem opsm l.jpg

For a set of conditions C and an ordering T = (t1, t2, … ts), the number of genes for which the ordering holds are said to support T.

T' = {<t1, t2 … ta>, <ts-b+1, … ts>, s} is a partial ordering of the order (a, b).

Order-Preserving Submatrix Problem (OPSM)

Ben-Dor et al 2003

Slide21 l.jpg

The goal is to find a set of genes G‘ and conditions C’ with a complete ordering of size s.

The algorithm starts with best partial models of the order (1,1) and extends them to generate partial models of the order (2,1), (2,2) and so on.

This is repeated until best models of the order (s/2, s/2) are generated. These are nothing but complete models of size s.


Samba l.jpg

The goal is to find biclusters that have extreme values in them

First, a bi-partite graph is constructed using genes and conditions as nodes.

An edge is placed between a gene and a condition if the expression of the gene for the condition is significantly different w.r.t. its normal level.

Maximum weighted bicliques are found in this graph.


Slide23 l.jpg
ISA them

  • Iterative approach

  • Considers two matrices EG EC

  • Gene score – avg expression over selected conditions in EC

  • Condition score - avg expression from selected genes in EG

  • Starts with random subset of genes, iteratively computes condition and gene scores.

Slide24 l.jpg

E themG







Condition scores


Conditions selected


Overview of the biclustering methods l.jpg
Overview of the Biclustering Methods them


Taken from Kevin Yip, 2003

Overview of the biclustering methods26 l.jpg
Overview of the Biclustering Methods them


Taken from Kevin Yip, 2003