1 / 11

Comparative Gene Expression Analysis: Data Analysis Issues and Solutions

This article discusses the challenges and solutions in analyzing gene expression data for comparative gene expression analysis. It explores different approaches for identifying similarities and differences between orthologous genes in two organisms.

evelinec
Download Presentation

Comparative Gene Expression Analysis: Data Analysis Issues and Solutions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparative Gene Expression Analysis: Data Analysis Issues and Solutions Vipin Kumar William Norris Professor and Head, Department of Computer Science

  2. Problem Definition • Goal: gain biological insights by analyzing which genes have the same or divergent behavior across the two organisms • Techniques can identify pairs of orthologous genes between two organisms • C. albicans and S cerevisiae have 4000 such pairs

  3. One Approach (Judith Berman, et al.) • Step 1: Identify clusters of functionally related orthologous genes within one organism • Select a functionally related group of genes • Find clusters using similarities computed from the gene expression data of the organism • Step 2: Split each cluster into two clusters • Use the similarities computed from the gene expression data of the second organism • Analyze for similarities and differences

  4. Problems With Step 1 • Clustering techniques may produce incorrect clusters due to • Noise • Varying cluster sizes • Varying cluster density • Non-globular cluster shape • High-dimensional data • Clusters that exist in subsets of the attributes • Clusters may be overlapping • Normalization • Choice of similarity measure

  5. Problems With Step 2 • Given a decomposition of genes into functionally coherent clusters for two organisms, A and B, there are a wide variety of relationships between the clusters of the two organisms • Some relationships are not captured by current approach • Example: a cluster of genes in organism A may (1) be split into two standalone clusters, or (2) be split into two groups that are just a part of larger clusters • Focusing on one cluster at a time does not take into account cross-talk between functional categories

  6. Alternative #1: Similarity-Based Approach • Directly compare the pattern of similarities of a gene g in both organisms • Idea is that the function of a gene is conserved if its relationship to other genes is similar in both organisms • Degree of similarity reflects the degree of overlap • Assign a value between 0 and 1 to each pair that indicates the divergence or conservation of functionality • A value of 0 implies divergence of function • A value of 1 implies conservation of function • Intermediate values indicate intermediate degrees of conservation/divergence Orthologous pair of genes

  7. Shared Nearest Neighbor Approach Idea is that the function of a gene is conserved if its relationship to other genes is similar in both organisms

  8. Shared Nearest Neighbor Approach • For each pair of orthologues of a gene g in organisms A and B • Assign a measure based on the overlap of the k nearest neighbor list • Various possibilities • Fraction of overlap in k nearest neighbor list (0 indicates no overlap, 1 indicates complete overlap) • Use a weighted measure (high weight for high ranks) • A pair of orthologues that have a high value of the measure are likely to have conserved behavior

  9. Alternative #2: Contrast Sets (motivated by Bay and Pazzani, KDD 99) A set of genes that have very high similarity (in expression patterns) for one organisms and low similarity for the other organism • Contrast sets can be overlapping • Set of candidates are exponentially large • Recent advantages make it possible to prune the search space and compute them efficiently

  10. Alternatives for Step 2 • Assume that the output of step 1 is accurate • Could apply statistical tests for comparing distributions • T-test commonly used for comparing individual genes • Issues for comparing clusters using this scheme • Need to define a multi-dimensional version of the T-test • Only tests equality of the sample means • Assumes that the conditions are the same for the samples • Could apply techniques developed for comparing partitions (Strehl and Ghosh, 2002) • Measures of distance between partitions • Evaluate which clusters contribute most to the distance • Catch: Works only for the same data set (Correlation matrices for the two organisms in this case) • Need a more general solution

  11. General solution to step 2 • Compare sets of clusters derived from two different but related data sets • Biologically-inspired overlap-based approach: • Consider cluster C1 of genes for first organism and C2 for second • |C1∩C2|/|C2|>α1 implies genes in C2 still working together for a function similar to C1 • Else, |C1∩C2|/|C2|<α2 implies genes in C2 have diverged into some other functional category • Guidelines for choosing the α’s: • Ideally, α1→1 and α2→0 • α1 should be small enough to allow splits into more than two clusters • Similarly, α2 should be just high enough to be able to identify outliers

More Related