Statistical Tests for Gene Clusters Across Three Genomic Regions

Statistical Tests for Gene Clusters Spanning Three Genomic Regions Narayanan Raghupathy*, Rose Hoberman*, and Dannie Durand Carnegie Mellon University, Pittsburgh, PA W1 W2 W3 Gene clusters: evidence of common ancestry? Many analyses use gene clusters---distinct chromosomal regions that share homologous gene pairs, but for which neither gene order nor gene content is preserved---as evidence of shared ancestry. However, it is necessary to first rule out the possibility that the regions are unrelated, and simply share homologous genes by chance. A gene cluster W1 * Contributed equally W2 Are W1 and W2 homologous regions? Gene content overlap models The significance of a cluster depends not only on the properties of windows, but also on the size of the genomes and the number of genes in common between the genomes. We design statistical tests for genome models that are appropriate for two common types of comparative genomics problems. The first model is designed for analyses of conserved linkage of genes in three regions from three distinct genomes. The second model is for detection of segments duplicated by a whole genome duplication (WGD), via comparison with the genome of a related, pre-duplication species. We again use a Venn diagram representation to illustrate the extent of gene content overlap among the genomes. Statistical Tests We propose a novel test that takes into account both the genes conserved in all three regions (x123) and in only pairs of regions ( and ). We use a combinatorial approach to obtain expressions for each genome model for the probability , under the null hypothesis of random gene order, (Equations omitted for brevity.) where denotes the random variables drawn from the distribution given by the null hypothesis. The expression is shorthand for and , that is, each of the quantities is at least as large as the observed quantity. Using these expressions, we computed cluster probabilities in Mathematica for typical genome parameters and window sizes. These simulations were used to investigate the following questions. • When comparing two regions, x, the number of shared genes is a natural test statistic: the more genes that are shared, the less likely the genes are shared by chance. In contrast, when comparing three regions, there are many quantities that provide evidence of homology: • In the cluster at left, x123 = 1, x12 = 3, x13 = 1, x23 = 1 • Previous attempts to test the significance of three or more regions have either used multiple pairwise comparisons (reviewed by Simillion et al [2]), or only considered genes shared between all regions (x123) [1]. How best to combine evidence from different subsets of regions remains an unsolved problem. W1 Current statistical approaches primarily focus on comparisons of two regions only. With the rapid rate of whole genome sequencing, analysis of gene clusters that span three or more chromosomal regions is of increasing interest. However, the statistical questions are more difficult. • the number of genes shared among all three regions (x123) • the number of genes shared between exactly two regions (x12, x13, x23) • the number of genes unique to one window (x1, x2, x3) W2 W3 Given a third region W3, are W1 and W2 homologous? Orthology model: n123 genes are shared between all three genomes. The remaining genes in each genome (n1,n2,n3) are singletons, genes which do not have homologs in any of the other genomes. (a) To design statistical tests for three regions we need to model: • the number of genes shared among the three regions • the extent of gene content overlap among the genomes • Our goals: • Develop genome models appropriate for common comparative genomics problems. • Develop statistical tests for clusters spanning three regions, for each model. • Study the relative importance of the above quantities to cluster significance. • Investigate how the genome model affects cluster significance. • Compare our proposed tests to previous statistical approaches. Hypothesis Testing Approach Our statistical approach tests the hypothesis that a gene cluster is evidence of shared ancestry against a null hypothesis of random gene order. We try to rule out the null hypothesis by showing that the probability of the observed cluster is small under the null hypothesis. Given a set of three windows, each containing r consecutive genes, we wish to determine whether the windows share more homologous genes than expected by chance. A gene cluster spanning two regions can be characterized by the following quantities: • Duplication model: is a genome that has undergone a whole genome duplication (WGD) and is a related genome that diverged from a common ancestor before the WGD. • genes appear twice in and once in . These are the genes that are retained in duplicate. • genes appear once in and once in . These are the genes that were preferentially lost. • genes appear once in but do not appear in . We present the first attempt to evaluate the significance of clusters spanning exactly three regions, taking into account both the genes conserved in all regions and in only pairs of regions. We (b) Are pairwise statistical tests sufficient? The most common strategy for testing significance of multiple regions is to conduct multiple pairwise comparisons (reviewed in [2]). For example, if region W1 is significantly similar to region W2, and W2 is significantly similar to region W3, then homology between all three regions is inferred, even if W1 and W3 share few or no genes. How do retained duplicates after WGD affect cluster significance? • the number of shared genes (x) • the number of genes unique to each window Does the proportion of singleton genes in the genome matter? Following a WGD, in many cases there is no immediate selective advantage for retaining a gene in duplicate, so one of the duplicates is often lost. Therefore, paralogous regions may share few paralogous genes. Thus, these duplicated regions are often detected by comparison to a related pre-duplication genome. We computed cluster probabilities for the duplication model using the following parameters: n1,1= 3600, n1,2= 450 and n0,1= 500. This is consistent with a recent study of pre- and post-duplication yeast species [3], in which only 16% of duplicates were retained following WGD in S. cerevisiae We illustrate these by a Venn diagram representation of a gene cluster, where each circle represents a window, and the number of shared genes (x) is given in the intersection. This approach allows the use of existing statistical methods, which are designed for comparing two regions. However, the pairwise approach Genomes under comparison often contain singletons, genes which do not have homologs in any of the other genomes (n1, n2, n3 in the orthology model). • requires at least two of the three pairwise comparisons to be independently significant • does not consider the greater impact of genes shared among all three regions. As the proportion of singletons in the genomes increases, cluster significance increases substantially. This is because as fewer homologs are shared between the genomes, it is more surprising to find them clustered together. Wpost1 Wpre Wpost2 Wpost1 Wpre Wpost2 How much more does a gene shared by all three regions contribute to significance? We compared the pairwise probabilities to our three-way probabilities for various cluster parameter values. The figure below shows that, even when x123= 0, pairwise tests underestimate the significance, when compared to our three-way test, which considers all three regions jointly. Which cluster is less likely to occur by chance, when genes are arranged randomly? n1=n2=n3=s, n123+s= 5000, r = 100 W1 n123=5000, n1=n2=n3=0, r =100, x123=0 b) Wpre shares only two genes each with Wpost1 and Wpost2, but Wpost1 and Wpost2 share an additional gene • Wpre shares three genes with Wpost1, and three other genes with Wpost2 For example, given a significance threshold of , the pairwise approach requires two of the three regions to share at least seven genes. In contrast, using our three-way test a cluster is significant when each pair of regions shares only four genes. W2 Which cluster is less likely to occur by chance, if 84% of duplicates were lost following WGD? W3 n1,1= 3600, n1,2= 450, n0,1= 500, r=100 • Two genes are shared by all three • windows (x123 = 2, x12=x13=x23=0) b) Two distinct genes are shared by each pair of windows (x123= 0, x12= x13= x23= 2) The figure at right shows that the two scenarios shown above are actually quite close in significance, even though the second scenario shares fewer homologous matches. Current approaches typically compare the pre-duplication region independently with each of the post-duplication regions, and thus ignore the values of x23 and x123. These methods could fail to detect clearly significant clusters. x12= x13= x23 n123=5000, n1=n2=n3=0, r=100 Our results suggest that pairwise tests are not always sufficient and multi-region tests will be able to identify more distantly related homologous regions. In both cases, each pair of windows shares two genes. However, the total number of genes shared in (b) is twice as large as in (a). Nonetheless, as the figure at right shows, the scenario shown in cluster (a) is much less likely to occur by chance under the orthology model. This illustrates the importance of x123 to cluster significance. x12 = x13 References [1] D Durand and D Sankoff, J Comput. Biol.,10, 2003. [2] C Simillion et al, Bioessays 26, 2004. [3] KP Byrne and KH Wolfe, Genome Res., 10, 2005.

Statistical Tests for Gene Clusters Across Three Genomic Regions

Statistical Tests for Gene Clusters Across Three Genomic Regions

Presentation Transcript

Data Analysis: Simple Statistical Tests

Overview of Statistical Tests Available

6. Statistical Inference: Significance Tests

Randomness and Statistical Tests

Statistical Tests

Statistical tests for replicated experiments

Statistical Tests

6. Statistical Inference: Significance Tests

Statistical Hypothesis Tests

I. Statistical Tests:

Statistical Tests

Recording Data &amp; Statistical Tests

Data Analysis: Simple Statistical Tests

Biostatistics Breakdown Common Statistical tests

TESTS OF STATISTICAL SIGNIFICANCE

New Statistical Tests, continued…

6. Statistical Inference: Significance Tests

Statistical tests

1.3. Statistical hypothesis tests

6. Statistical Inference: Significance Tests

Statistical Tests and Limits Lecture 3

Statistical Tests for Gene Clusters Across Three Genomic Regions