Biclustering Algorithms for Biological Data Analysis

Biclustering Algorithms for Biological Data Analysis Csci 8980: Mining Biomedical Datasets Spring 2011 Some slides are taken from Matthew Hibbs

Introduction M conditions • Micro-arrays technology provides us expression level for thousands of genes • Microarray data can be viewed as an NM matrix: • Rows/columns represents gene/condition • Each entry represents the expression level of a gene under a condition. • A row/column is sometimes referred to as the “expression profile” of the gene/condition N genes

large-scale expression data stress genes diverse conditions Pooling genome-wide expression measurements from many experiments cell- cycle sets of specific conditions http://serverdgm.unil.ch/bergmann

Clustering Microarray data If two genes are related (have similar functions or are co-regulated), their expression profiles should be similar (e.g. low Euclidean distance or high correlation). conditions Characterization of Early Stages of Human B Cell Development by Gene Expression Profiling Marit E. Hystad et. al. Journal of Immunology, 2007 genes

Traditional Clustering Analysis • Cluster analysis group the data so that members of the same group are similar but between groups they are distinct. • For gene expression data it is grouped in two ways: • By condition • Multiple microarray experiments can be grouped according to the similarity of gene expression between experiments • By gene • The genes are grouped according to their individual expression profile across the experiments (e.g. to identify groups of co-regulated genes).

Typical Clustering Algorithms Used • K-means / k-median • Pros: Fairly simple approach and gives accurate results when discriminating unrelated classes • Cons: Prior specification of the number of classes the data has to be divided into; Problem of local minima; Objects are only allowed to belong to one group • Hierarchical clustering • Pros: Fairly simple approach and there is no need to know the number of clusters a priori • Cons: Objects are allowed to belong only to one group; All genes are clustered, although some connections might be weaker

Two Major Drawbacks with Traditional Clustering • Genes are clustered on the basis of their expression under all experimental conditions • However, a cellular process may be active only in a subset of conditions • Assignment of a single gene to only one cluster • However, a single gene may participate in multiple cellular processes X: Set of genes Y: Set of conditions Regions: Modules

Overview of Biclustering Methods • To overcome these problems, many “biclustering” methods have been proposed to cluster both genes and conditions simultaneously. Some of them are: • Cheng Y and Church GM. Biclustering of expression data. Proc Int Conf Intell Syst Mol Biol. 2000;8:93-103) • Getz G, Levine E, Domany E. Coupled two-way clustering analysis of gene microarray data. (Proc Natl Acad Sci U S A. 2000 Oct 24;97(22):12079-84) • Tanay A, Sharan R, Kupiec M, Shamir R. Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. (Proc Natl Acad Sci U S A. 2004 Mar 2;101(9):2981-6) • Sven Bergmann, Jan Ihmels and Naama Barkai , “Iterative signature algorithm for the analysis of large-scale gene expression data”, Phys. Rev. E 67, 031902 (2003) • Gasch AP and Eisen MB. Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering. (Genome Biol. 2002 Oct 10;3(11):RESEARCH0059) • Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R, Staudt L, Chan WC, Botstein D, Brown P. 'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns. (Genome Biol. 2000;1(2):RESEARCH0003.) • Other references are in the project description document on the class website

What is Biclustering ? Row/column order need not be consistent between different biclusters. Finding submatrices in an n x m matrix that follow a desired pattern*

Types of Biclusters Constant values Constant values on rows Constant values on columns Coherent evolutions

Bicluster properties Biclustering of Expression data: Cheng and Church, RECOMB 2001 For any submatrix CIJ where I and J are a subsets of genes and conditions, the mean squared residude score is A bicluster is a submatrix CIJ that has a low mean squared residue score.

Example conditions genes conditions 1 2 3 4 5 6 7 Gene 1 Gene 2 Gene 3 Gene 4 has row variance but no column variance There could be either row variance or column variance and that is explained by the overall variance Has column variance but no row variance

Issues with Biclustering Methods • Computational Complexity • Most methods use some heuristics to limit the search space • Although this helps finds most of the TMs in a reasonable amount of time, the greedy nature of most biclustering algorithms makes them suffer from the local minima problem • Difficult to Compare • Different biclustering methods use different definitions of the biclusters and use different optimization techniques and implementation to solve them (See Madeira et al 2004 and Prelic et al 2006 for details) • Difficult to choose appropriate biclustering algorithm • Different algorithms exhibit significant variations in terms of their robustness and sensitivity to noise in the gene expression data (Prelic et al 2006)

Cheng and Church’s Algorithm Greedy approach to rapidly converge to a maximal bicluster. In phase I, it removes rows/columns with a large contribution to the mean residue score (msr). In phase II, rows/columns are added that have a low contribution to the msr without exceeding δ. After a bicluster is identified, its values are randomized to prevent it to show up again.

CC • Greedy Approach • Finds a submatrix that minimizes MSR • Biclusters (a) and (b) fits the definition of MSR

For a set of conditions C and an ordering T = (t1, t2, … ts), the number of genes for which the ordering holds are said to support T. T' = {<t1, t2 … ta>, <ts-b+1, … ts>, s} is a partial ordering of the order (a, b). Order-Preserving Submatrix Problem (OPSM) Ben-Dor et al 2003

The goal is to find a set of genes G‘ and conditions C’ with a complete ordering of size s. The algorithm starts with best partial models of the order (1,1) and extends them to generate partial models of the order (2,1), (2,2) and so on. This is repeated until best models of the order (s/2, s/2) are generated. These are nothing but complete models of size s. OPSM

The goal is to find biclusters that have extreme values in them First, a bi-partite graph is constructed using genes and conditions as nodes. An edge is placed between a gene and a condition if the expression of the gene for the condition is significantly different w.r.t. its normal level. Maximum weighted bicliques are found in this graph. SAMBA

ISA • Iterative approach • Considers two matrices EG EC • Gene score – avg expression over selected conditions in EC • Condition score - avg expression from selected genes in EG • Starts with random subset of genes, iteratively computes condition and gene scores.

EG EC Genes selected Gene scores tg Condition scores tc Conditions selected ISA http://www.genome.jp/kegg/expression/

Overview of the Biclustering Methods 25 Taken from Kevin Yip, 2003

Overview of the Biclustering Methods 26 Taken from Kevin Yip, 2003

Biclustering Algorithms for Biological Data Analysis