Statistical Tests for Gene Clusters Spanning Three Genomic Regions Narayanan Raghupathy*, Rose Hoberman*, and Dannie Durand Carnegie Mellon University, Pittsburgh, PA. W 1. W 2. W 3. Gene clusters: evidence of common ancestry?.
Narayanan Raghupathy*, Rose Hoberman*, and Dannie Durand
Carnegie Mellon University, Pittsburgh, PA
Gene clusters: evidence of common ancestry?
Many analyses use gene clusters---distinct chromosomal regions that share homologous gene pairs, but for which neither gene order nor gene content is preserved---as evidence of shared ancestry. However, it is necessary to first rule out the possibility that the regions are unrelated, and simply share homologous genes by chance.
A gene cluster
* Contributed equally
Are W1 and W2 homologous regions?
Gene content overlap models
The significance of a cluster depends not only on the properties of windows, but also on the size of the genomes and the number of genes in common between the genomes. We design statistical tests for genome models that are appropriate for two common types of comparative genomics problems.
The first model is designed for analyses of conserved linkage of genes in three regions from three distinct genomes. The second model is for detection of segments duplicated by a whole genome duplication (WGD), via comparison with the genome of a related, pre-duplication species. We again use a Venn diagram representation to illustrate the extent of gene content overlap among the genomes.
We propose a novel test that takes into account both the genes conserved in all three regions (x123) and in only pairs of regions ( and ). We use a combinatorial approach to obtain expressions for each genome model for the probability , under the null hypothesis of random gene order, (Equations omitted for brevity.) where denotes the random variables drawn from the distribution given by the null hypothesis.
The expression is shorthand for and , that is, each of the quantities is at least as large as the observed quantity.
Using these expressions, we computed cluster probabilities in Mathematica for typical genome parameters and window sizes. These simulations were used to investigate the following questions.
Current statistical approaches primarily focus on comparisons of two regions only. With the rapid rate of whole genome sequencing, analysis of gene clusters that span three or more chromosomal regions is of increasing interest. However, the statistical questions are more difficult.
Given a third region W3, are W1 and W2 homologous?
Orthology model: n123 genes are shared between all three genomes. The remaining genes in each genome (n1,n2,n3) are singletons, genes which do not have homologs in any of the other genomes.
To design statistical tests for three regions we need to model:
Hypothesis Testing Approach
Our statistical approach tests the hypothesis that a gene cluster is evidence of shared ancestry against a null hypothesis of random gene order. We try to rule out the null hypothesis by showing that the probability of the observed cluster is small under the null hypothesis.
Given a set of three windows, each containing r consecutive genes, we wish to determine whether the windows share more homologous genes than expected by chance.
A gene cluster spanning two regions can be characterized by the following quantities:
We present the first attempt to evaluate the significance of clusters spanning exactly three regions, taking into account both the genes conserved in all regions and in only pairs of regions. We
statistical tests sufficient?
The most common strategy for testing significance
of multiple regions is to conduct multiple pairwise comparisons
(reviewed in ). For example, if region W1 is significantly similar to
region W2, and W2 is significantly similar to region W3, then homology
between all three regions is inferred, even if W1 and W3 share few or no genes.
How do retained duplicates after WGD affect cluster significance?
proportion of singleton
genes in the genome matter?
Following a WGD, in many cases there is no immediate selective advantage for retaining a gene in duplicate, so one of the duplicates is often lost. Therefore, paralogous regions may share few paralogous genes. Thus, these duplicated regions are often detected by comparison to a related pre-duplication genome.
We computed cluster probabilities for the duplication model using the following parameters:
n1,1= 3600, n1,2= 450 and n0,1= 500. This is consistent with a recent study of pre- and post-duplication yeast species , in which only 16% of duplicates were retained following WGD in S. cerevisiae
We illustrate these by a Venn diagram representation of a gene cluster, where each circle represents a window, and the number of shared genes (x) is given in the intersection.
This approach allows the use of existing statistical methods, which are designed for comparing two regions. However, the pairwise approach
Genomes under comparison often contain singletons, genes which do not have homologs in any of the other genomes (n1, n2, n3 in the orthology model).
As the proportion of singletons in the genomes increases, cluster significance increases substantially. This is because as fewer homologs are shared between the genomes, it is more surprising to find them clustered together.
How much more does a gene shared by all three regions contribute to significance?
We compared the pairwise probabilities to our three-way probabilities for various cluster parameter values. The figure below shows that, even when x123= 0, pairwise tests underestimate the significance, when compared to our three-way test, which considers all three regions jointly.
Which cluster is less likely to occur by chance, when genes are arranged randomly?
n1=n2=n3=s, n123+s= 5000, r = 100
n123=5000, n1=n2=n3=0, r =100, x123=0
b) Wpre shares only two genes each with Wpost1 and Wpost2, but Wpost1 and Wpost2 share an additional gene
For example, given a significance threshold of , the pairwise approach requires two of the three regions to share at least seven genes. In contrast, using our three-way test a cluster is significant when each pair of regions shares only four genes.
Which cluster is less likely to occur by chance, if 84% of duplicates were lost following WGD?
n1,1= 3600, n1,2= 450, n0,1= 500, r=100
b) Two distinct genes are shared by each pair of windows (x123= 0, x12= x13= x23= 2)
The figure at right shows that the two scenarios shown above are actually quite close in significance, even though the second scenario shares fewer homologous matches. Current approaches typically compare the pre-duplication region independently with each of the post-duplication regions, and thus ignore the values of x23 and x123. These methods could fail to detect clearly significant clusters.
x12= x13= x23
n123=5000, n1=n2=n3=0, r=100
Our results suggest that pairwise tests are not always sufficient and multi-region tests will be able to identify more distantly related homologous regions.
In both cases, each pair of windows shares two genes. However, the total number of genes shared in (b) is twice as large as in (a). Nonetheless, as the figure at right shows, the scenario shown in cluster (a) is much less likely to occur by chance under the orthology model. This illustrates the importance of x123 to cluster significance.
x12 = x13
 D Durand and D Sankoff, J Comput. Biol.,10, 2003.
 C Simillion et al, Bioessays 26, 2004.
 KP Byrne and KH Wolfe, Genome Res., 10, 2005.