Biclustering algorithms for biological data analysis l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 26

Biclustering Algorithms for Biological Data Analysis PowerPoint PPT Presentation


  • 121 Views
  • Uploaded on
  • Presentation posted in: General

Biclustering Algorithms for Biological Data Analysis. Csci 8980: Mining Biomedical Datasets Spring 2011. Some slides are taken from Matthew Hibbs. Introduction. M conditions. Micro-arrays technology provides us expression level for thousands of genes

Download Presentation

Biclustering Algorithms for Biological Data Analysis

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Biclustering algorithms for biological data analysis l.jpg

Biclustering Algorithms for Biological Data Analysis

Csci 8980: Mining Biomedical Datasets

Spring 2011

Some slides are taken from Matthew Hibbs


Slide2 l.jpg

Introduction

M conditions

  • Micro-arrays technology provides us expression level for thousands of genes

  • Microarray data can be viewed as an NM matrix:

    • Rows/columns represents gene/condition

    • Each entry represents the expression level of a gene under a condition.

    • A row/column is sometimes referred to as the “expression profile” of the gene/condition

N

genes


Slide3 l.jpg

large-scale

expression data

stress

genes

diverse conditions

Pooling genome-wide expression measurements from many experiments

cell-

cycle

sets of specific conditions

http://serverdgm.unil.ch/bergmann


Clustering microarray data l.jpg

Clustering Microarray data

If two genes are related (have similar functions or are co-regulated), their expression profiles should be similar (e.g. low Euclidean distance or high correlation).

conditions

Characterization of Early Stages of Human B Cell Development by Gene Expression Profiling

Marit E. Hystad et. al.

Journal of Immunology, 2007

genes


Slide5 l.jpg

Traditional Clustering Analysis

  • Cluster analysis group the data so that members of the same group are similar but between groups they are distinct.

  • For gene expression data it is grouped in two ways:

    • By condition

      • Multiple microarray experiments can be grouped according to the similarity of gene expression between experiments

    • By gene

      • The genes are grouped according to their individual expression profile across the experiments (e.g. to identify groups of co-regulated genes).


Slide6 l.jpg

Typical Clustering Algorithms Used

  • K-means / k-median

    • Pros: Fairly simple approach and gives accurate results when discriminating unrelated classes

    • Cons: Prior specification of the number of classes the data has to be divided into; Problem of local minima; Objects are only allowed to belong to one group

  • Hierarchical clustering

    • Pros: Fairly simple approach and there is no need to know the number of clusters a priori

    • Cons: Objects are allowed to belong only to one group; All genes are clustered, although some connections might be weaker


Slide7 l.jpg

Two Major Drawbacks with Traditional Clustering

  • Genes are clustered on the basis of their expression under all experimental conditions

    • However, a cellular process may be active only in a subset of conditions

  • Assignment of a single gene to only one cluster

    • However, a single gene may participate in multiple cellular processes

X: Set of genes

Y: Set of conditions

Regions: Modules


Slide8 l.jpg

Overview of Biclustering Methods

  • To overcome these problems, many “biclustering” methods have been proposed to cluster both genes and conditions simultaneously. Some of them are:

  • Cheng Y and Church GM. Biclustering of expression data. Proc Int Conf Intell Syst Mol Biol. 2000;8:93-103)

  • Getz G, Levine E, Domany E. Coupled two-way clustering analysis of gene microarray data. (Proc Natl Acad Sci U S A. 2000 Oct 24;97(22):12079-84)

  • Tanay A, Sharan R, Kupiec M, Shamir R. Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. (Proc Natl Acad Sci U S A. 2004 Mar 2;101(9):2981-6)

  • Sven Bergmann, Jan Ihmels and Naama Barkai , “Iterative signature algorithm for the analysis of large-scale gene expression data”, Phys. Rev. E 67, 031902 (2003)

  • Gasch AP and Eisen MB. Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering. (Genome Biol. 2002 Oct 10;3(11):RESEARCH0059)

  • Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R, Staudt L, Chan WC, Botstein D, Brown P. 'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns. (Genome Biol. 2000;1(2):RESEARCH0003.)

  • Other references are in the project description document on the class website


What is biclustering l.jpg

What is Biclustering ?

Row/column order need not be consistent between different biclusters.

Finding submatrices in an n x m matrix that follow a desired pattern*


Types of biclusters l.jpg

Types of Biclusters

Constant values

Constant values on rows

Constant values on columns

Coherent evolutions


Bicluster properties l.jpg

Bicluster properties

Biclustering of Expression data: Cheng and Church, RECOMB 2001

For any submatrix CIJ where I and J are a subsets of genes and conditions, the mean squared residude score is

A bicluster is a submatrix CIJ that has a low mean squared residue score.


Example l.jpg

Example

conditions

genes

conditions

1 2 3 4 5 6 7

Gene 1

Gene 2

Gene 3

Gene 4

has row variance but no column variance

There could be either row variance or column variance and that is explained by the overall variance

Has column variance

but no row variance


Slide17 l.jpg

Issues with Biclustering Methods

  • Computational Complexity

    • Most methods use some heuristics to limit the search space

    • Although this helps finds most of the TMs in a reasonable amount of time, the greedy nature of most biclustering algorithms makes them suffer from the local minima problem

  • Difficult to Compare

    • Different biclustering methods use different definitions of the biclusters and use different optimization techniques and implementation to solve them (See Madeira et al 2004 and Prelic et al 2006 for details)

  • Difficult to choose appropriate biclustering algorithm

    • Different algorithms exhibit significant variations in terms of their robustness and sensitivity to noise in the gene expression data (Prelic et al 2006)


Cheng and church s algorithm l.jpg

Cheng and Church’s Algorithm

Greedy approach to rapidly converge to a maximal bicluster.

In phase I, it removes rows/columns with a large contribution to the mean residue score (msr).

In phase II, rows/columns are added that have a low contribution to the msr without exceeding δ.

After a bicluster is identified, its values are randomized to prevent it to show up again.


Slide19 l.jpg

CC

  • Greedy Approach

  • Finds a submatrix that minimizes MSR

  • Biclusters (a) and (b) fits the definition of MSR


Order preserving submatrix problem opsm l.jpg

For a set of conditions C and an ordering T = (t1, t2, … ts), the number of genes for which the ordering holds are said to support T.

T' = {<t1, t2 … ta>, <ts-b+1, … ts>, s} is a partial ordering of the order (a, b).

Order-Preserving Submatrix Problem (OPSM)

Ben-Dor et al 2003


Slide21 l.jpg

The goal is to find a set of genes G‘ and conditions C’ with a complete ordering of size s.

The algorithm starts with best partial models of the order (1,1) and extends them to generate partial models of the order (2,1), (2,2) and so on.

This is repeated until best models of the order (s/2, s/2) are generated. These are nothing but complete models of size s.

OPSM


Samba l.jpg

The goal is to find biclusters that have extreme values in them

First, a bi-partite graph is constructed using genes and conditions as nodes.

An edge is placed between a gene and a condition if the expression of the gene for the condition is significantly different w.r.t. its normal level.

Maximum weighted bicliques are found in this graph.

SAMBA


Slide23 l.jpg

ISA

  • Iterative approach

  • Considers two matrices EG EC

  • Gene score – avg expression over selected conditions in EC

  • Condition score - avg expression from selected genes in EG

  • Starts with random subset of genes, iteratively computes condition and gene scores.


Slide24 l.jpg

EG

EC

Genes

selected

Gene

scores

tg

Condition scores

tc

Conditions selected

ISA

http://www.genome.jp/kegg/expression/


Overview of the biclustering methods l.jpg

Overview of the Biclustering Methods

25

Taken from Kevin Yip, 2003


Overview of the biclustering methods26 l.jpg

Overview of the Biclustering Methods

26

Taken from Kevin Yip, 2003


  • Login