Comparative Gene Expression Analysis: Data Analysis Issues and Solutions

Comparative Gene Expression Analysis: Data Analysis Issues and Solutions Vipin Kumar William Norris Professor and Head, Department of Computer Science

Problem Definition • Goal: gain biological insights by analyzing which genes have the same or divergent behavior across the two organisms • Techniques can identify pairs of orthologous genes between two organisms • C. albicans and S cerevisiae have 4000 such pairs

One Approach (Judith Berman, et al.) • Step 1: Identify clusters of functionally related orthologous genes within one organism • Select a functionally related group of genes • Find clusters using similarities computed from the gene expression data of the organism • Step 2: Split each cluster into two clusters • Use the similarities computed from the gene expression data of the second organism • Analyze for similarities and differences

Problems With Step 1 • Clustering techniques may produce incorrect clusters due to • Noise • Varying cluster sizes • Varying cluster density • Non-globular cluster shape • High-dimensional data • Clusters that exist in subsets of the attributes • Clusters may be overlapping • Normalization • Choice of similarity measure

Problems With Step 2 • Given a decomposition of genes into functionally coherent clusters for two organisms, A and B, there are a wide variety of relationships between the clusters of the two organisms • Some relationships are not captured by current approach • Example: a cluster of genes in organism A may (1) be split into two standalone clusters, or (2) be split into two groups that are just a part of larger clusters • Focusing on one cluster at a time does not take into account cross-talk between functional categories

Alternative #1: Similarity-Based Approach • Directly compare the pattern of similarities of a gene g in both organisms • Idea is that the function of a gene is conserved if its relationship to other genes is similar in both organisms • Degree of similarity reflects the degree of overlap • Assign a value between 0 and 1 to each pair that indicates the divergence or conservation of functionality • A value of 0 implies divergence of function • A value of 1 implies conservation of function • Intermediate values indicate intermediate degrees of conservation/divergence Orthologous pair of genes

Shared Nearest Neighbor Approach Idea is that the function of a gene is conserved if its relationship to other genes is similar in both organisms

Shared Nearest Neighbor Approach • For each pair of orthologues of a gene g in organisms A and B • Assign a measure based on the overlap of the k nearest neighbor list • Various possibilities • Fraction of overlap in k nearest neighbor list (0 indicates no overlap, 1 indicates complete overlap) • Use a weighted measure (high weight for high ranks) • A pair of orthologues that have a high value of the measure are likely to have conserved behavior

Alternative #2: Contrast Sets (motivated by Bay and Pazzani, KDD 99) A set of genes that have very high similarity (in expression patterns) for one organisms and low similarity for the other organism • Contrast sets can be overlapping • Set of candidates are exponentially large • Recent advantages make it possible to prune the search space and compute them efficiently

Alternatives for Step 2 • Assume that the output of step 1 is accurate • Could apply statistical tests for comparing distributions • T-test commonly used for comparing individual genes • Issues for comparing clusters using this scheme • Need to define a multi-dimensional version of the T-test • Only tests equality of the sample means • Assumes that the conditions are the same for the samples • Could apply techniques developed for comparing partitions (Strehl and Ghosh, 2002) • Measures of distance between partitions • Evaluate which clusters contribute most to the distance • Catch: Works only for the same data set (Correlation matrices for the two organisms in this case) • Need a more general solution

General solution to step 2 • Compare sets of clusters derived from two different but related data sets • Biologically-inspired overlap-based approach: • Consider cluster C1 of genes for first organism and C2 for second • |C1∩C2|/|C2|>α1 implies genes in C2 still working together for a function similar to C1 • Else, |C1∩C2|/|C2|<α2 implies genes in C2 have diverged into some other functional category • Guidelines for choosing the α’s: • Ideally, α1→1 and α2→0 • α1 should be small enough to allow splits into more than two clusters • Similarly, α2 should be just high enough to be able to identify outliers

Comparative Gene Expression Analysis: Data Analysis Issues and Solutions

Comparative Gene Expression Analysis: Data Analysis Issues and Solutions

Presentation Transcript

Basic Gene Expression Data Analysis--Clustering

Microarray Gene Expression Data Analysis

Analysis of Gene Expression Data

Gene Expression Analysis

Functional genomics and gene expression data analysis

Gene Expression Data Analysis Lab Session

Microarray Data Analysis Differential Gene Expression

Gene expression: Microarray data analysis

Gene Expression Data and Cluster Analysis

Gene expression analysis

Gene Expression Analysis

Gene Expression Analysis

Gene Expression Analysis and Modeling

4. Gene Expression Data Analysis

More Analysis of Gene Expression Data

Cluster Analysis for Gene Expression Data

Proteome and Gene Expression Analysis

Bioinformatics : Gene Expression Data Analysis

Proteome and Gene Expression Analysis

Gene Expression Analysis

Gene Expression Analysis Market