A Ruby in the Rubbish: Using molecular data to look for signatures of selection

A Ruby in the Rubbish:Using molecular data to look for signatures of selection Bruce Walsh, jbwalsh@u.arizona.edu University of Arizona Depts. of Ecology & Evolutionary Biology, Molecular & Cellular Biology, Plant Sciences, Animal Sciences, Epidemology & Biostatistics

Search for Genes that experienced artificial (and natural) selection Akin in sprit to testing candidate genes for association or using genome scans to find QTLs. In linkage studies: Use molecular markers to look for marker-trait associations (phenotypes) In tests for selection, use molecular markers to look for patterns of selection (patterns of within- and between-species variation)

Types of Genes that have experienced Selection in Crop species Domestication genes: Alleles fixed in the course of the initial domestication Diversification/Improvement genes: Alleles fixed in the course of improvement following domestication. Adaptation genes: Alleles in natural populations responding to natural selection on environmental conditions (candidates to transfer into elite germplasms).

Searches for regions under selection complement standard linkage-based approaches for QTL detection (line-crosses, association mapping) Using QTL approaches to find domestication genes requires making crosses of wild progenitor x domesticated line. Localizing adaptation genes to a particular environment via a standard QTL cross very difficult, as one would miss potential pathways to adaptation by focusing only candidate phenotypes thought of by the investigator.

The general approaches for using sequence data to search for signs of selection Key: Use of features of variation at a marker locus to test for departures from strict neutrality • Tests based on pattern and amount of within- species polymorphism (departures from neutral predictions). On-going or recent selection • Tests based on polymorphism plus between species divergence. On-going or recent selection • Tests based on phylogenetic comparisons between species. Historical selection (won’t discuss these further)

A quick review of the neutral theory (expected patterns of variation under drift) • Drift and the coalescence process (its about time) • Mutation-drift equilibrium (within-population variation). Function of population size and mutation rate. Expected variation = H = 4Nem • Divergence between populations (between- population variation). Function of time and mutation rate (but not population size), d = 2tm

Mutation-Drift Equilibrium (Single Loci) Drift removes variation, while mutation introduces it. Thus, an equilibrium amount of genetic variance results While alleles change over time, heterozygosity remains roughly constant.

4 N π e H = 4 N π + 1 e If Ne is the effective population size and m the mutation rate, Crow & Kimura showed the equilibrium heterozygosity is given by Thus, H is simply a product of population size and mutation rate. The parameter 4Nem is a fundamental one in molecular evolution and often denoted by q.

A very powerful way of thinking about drift is the Coalescent Process Instead of following alleles, think in terms of lineages. As a consequence of drift, eventually all current copies of alleles trace back to a single ancestral lineage. Hence, the current lineages coalesce as one moves back in time

MRCA = most recent Common ancestor

Coalescent theory provides an easy way to see why 4Nem appears. Expected number of mutations = 2tm tm mutations tm mutations For two random sequences within a population, t = 2Ne giving2tm = 4Nem

From coalescent theory, the expected Time back to the MRCA is 2N generations Hence, for two randomly-chosen sequences, the expected number of mutations they differ by is just 2mt = 2m(2N) = 4Nm If 4Nm >> 1, two random sequences will typically differ (and hence be heteroygotes) If 4Nm << 1, two random sequences will typically differ (and hence be homoygotes)

The Coalescent for a Sample t2 t3 t4 t5 Past 1/(2N) 3/(2N) 3/N 5/N Present For k-th coalescent event, qk =k(k-1)/4N Mean total time = N (1/5+1/3+2/3+2) = 3.2N

Divergence Between Populations Mutation and drift also generate a between- line variance, i.e., a population divergence As lines separate, the initial heterozygosity is randomly partitioned, creating a between-line variance. More importantly, as new mutations arise in the separated lines, some of these are fixed by drift, and this drives a constant divergence between populations

One average, for a population of size N, 2Nm mutations arise each generation For any of these, their probability of fixation is just U(1/[2N]) = 1/(2N) Hence, the rate at which new mutations are fixed within a line is just (# new per generation)*Pr(fixation) 2Nm*1/(2N) = m Hence, divergence d(t) after t generations is just d(t) = mt Independent of population size!

Logic behind polymorphism-based tests Key: Time to MRCA relative to drift If a locus is under positive selection, more recent MRCA (shorter coalescent) If a locus is under balancing selection, older MRCA relative to drift (deeper coalescent) Shorter coalescent = lower levels of variation, longer blocks of disequilibrium Deeper coalescent = higher levels of variation, shorter blocks of disequilibrium

Selective Sweep Neutral Balancing selection

Selective sweeps result in a local decrease in Ne around the selective site This results in a shorter time to MRCA and a decrease in the amount of polymorphism Note that this has no effect on the rate of divergence of netural sites , as this is independent on Ne. Conversely, balancing selection increases the effective population size, increasing the amount of polymorphism

A scan of levels of polymorphism can thus suggest sites under selection Directional selection (selective sweep) Variation Local region with reduced mutation rate Map location Balancing selection Variation Local region with elevated mutation rate Map location

Example: maize domestication gene tb1 Major changes in plant architecture in transition from teosinte to maize Doebley lab identified a gene, teosinite branched 1, tb1, involved in many of these architectural changes Wang et al. (1999) observed a significant decrease in genetic variation in the 5’ NTR region of tb1, suggesting a selective sweep influenced this region. The sweep did not influence the coding region.

Wang et al (1999) Nature 398: 236.

Clark et al (2004) examined the 5’ tb1 region in more detail, finding evidence for a sweep influencing a region of 60 - 90 kb Clark et al (2004) PNAS 101: 700.

Wang et al. and Clark et al. controlled for the reduction in neutral polymorphisms being due simply to reduced mutation rate by using a close relative (teosinte) as a control. The process of domestication itself is expected to reduce variation genome-wide because of the population bottleneck that is typically induced during domestication. In maize, the background level of polymorphism (genome wide) is only about 75% of that of teosinte.

Estimating strength of selection from size of sweep region Kaplan, Hudson, and Langley (1989) showed that the distance d at which a neutral site can be influenced by a sweep is a function of the strength of selection s and the recombination fraction c, with d = 0.01 s/c. Hence, s = 100 . d . c For tb1, s -> 0.05. With s in hand, one can also estimate the expected time for selection to fix the allele, which Wang et al. estimated at 300 to 1000 years, indicating a fairly long period of domestication.

Example: Waxy gene in Rice (Olsen et al. 2006) “Sticky” (glutinous) rice results from low amylose levels, and are typical of temperate japonica variety groups. A number of groups showed this is due to a splice mutant in the Waxy gene. This is an example of an improvement (as opposed to domestication) gene Olsen et al. observed a region 250kb in size around Waxy with a greatly reduced level of polymorphism compared to control populations. Using the Kaplan et al expression, this gives s = 4.6!

While the sweep around tb1 did not even influence the coding region of that gene, the waxy sweep covers 39 rice genes! One evolutionary consequence of a sweep is that the reduction in population size (that produces the signal of a sweep) also reduces the efficiency of selection on linked genes within the region (the Hill-Robertson effect) Deleterious alleles have a higher probability of fixation Favorable alleles have a reduced probablity of fixation.

Accumulation of Deleterious mutations in domesticated rice genomes? Lu et al (2006) compared the genomes of Oryza sativa ssp. indica and japonica with their ancestral relativeO. rufipogon. The Ka/Ks (ratio of the substitution rate of non-synonymous to synonymous changes) was much higher for indica vs. japonica (0.498) than for domesticated vs. wild rice (japonica vs. rufipogin, 0.259) Lu et al suggest that roughly 25% of the amino acid differences between indica and japonica are deleterious. They suggest that excessive reductions in Ne due to selective-sweeps covering much of the genome during selection for domestication greatly reduced the efficiency of natural selection in removing deleterious alleles.

Formal tests of selection • Tajima’s D. Requires: single-locus, within-population polymorphism data • McDonald-Kreitman Test.Requires: coding region, data from 2 species (within-population variation, btw species divergence) • Hudson-Kreitman-Aguade (HKA) test. Requires: at least two loci, data from 2 species (within-population variation, btw species divergence) • Allele frequency vs. LD tests. Requires: dense marker scan around a single-locus using within-population data

Tests based on Within-Population Variation These tend to compare different measures of variation (such as number of alleles vs. pair-wise distances among alleles) Two sequence evolution frameworks are typically used: infinite alleles vs. infinite sites. Both assume each new mutation generates a new (unique) sequence. (such is not the case for STRs) How do these frameworks differ?

1 2 1 A A G A C C 2 A A G G C C 3 A A G A C C * * A A G G C C A A G G C A Consider the following five sequences Infinite alleles: Treat each different haplotype as a different allele Here, there are three alleles Infinite sites model: Treat each site (base position) separately. How many polymorphic sites are there? Here, 2 polymorphic sites

Two typical classes of departures are seen with polymorphism data 1: An excess of rare alleles, a deficiency of intermediate frequency alleles (alleles younger than expected) 2: An excess of intermediate frequency alleles, a deficiency of rare alleles (alleles older than expected) Pattern 1 expected under a selective sweep, when coalescent times are shorter than expected Pattern 2 expected under balancing selection, when coalescent times are longer than expected

Major Complication With Polymorphism-based tests Demographic factors can also cause these departures from neutral expectations! Too many young alleles -> recent population expansion Too many old alleles -> population substructure Thus, there is a composite alternative hypothesis, so that rejection of the null does not imply selection. Rather, selection is just one option.

Can we overcome this problem? It is important to, as only polymorphism- based tests can indicate on-going selection Solution: demographic events should leave a constant signature across the genome Essentially, all loci experience common demographic factors Genome scan approach: look at a large number of markers. These generate null distribution (most not under selection), outliers = potentially selected loci

S n ° 1 b b b q = ; q = k ; q = ¥ S k ¥ a n n n ° 1 X 1 a = n i i =1 Summary Statistics for Infinite Sites Model The key parameter is q = 4Nem • S, number of segregating sites. E(S) = anq • k, average number of pairwise differences . E(k) = q • h, number of singletons. E(h) = q* n/(n-1) These suggest the following three estimates for q:

b b q ° q k S D = p 2 Æ S + Ø S D D Tajima’s D test One of the first, and most popular, polymorphism tests was Tajima’s D test (Tajima 1989) D contrasts estimates of q based on S vs. k Idea: For S we simply count sites, independent of their frequencies. Hence, S rather sensitive to changes in the frequency of rare alleles.

On the other hand, k is a more frequency- weighted measure, and hence more sensitive to changes in the frequency of intermediate alleles. D < 0: too many rare alleles. Selective sweep or population expansion. MRCA more recent than expected. D > 0: too many intermediate-frequency alleles. Balancing selection or population subdivision. MRCA more ancient than expected.

D is a test whether the amount of polymorphism is consistent with the number of polymorphisms Under selective sweeps/population expansion, heterozygosity should be significantly less than predicted from number of polymorphisms

Genome-Wide Polymorphism Tests As mentioned, general problem with polymorphism tests is that demographic signals can also give the same pattern as selection. Cavalli-Sforza (1966) was among the first to note that demography effects all genomic locations (roughly) equally, while the effects of selection are unique to a particular locus With the advent of very dense marker sets, we are now seeing genome-wide scans over all markers. Idea: Most are not under selection and hence reflect the common demographic features. Outliers against this pattern suggest selection.

Logic behind Joint Polymorphism-Divergence tests Under the neutral theory, heterozygosity is a function of q = 4Nem, while divergence is a function of mt Joint Polymorphism-Divergence tests use these two different expectations to look for concordance with neutral results. For example, under neutrality, levels of polymorphism and divergence should be positively correlated.

H 4 N π 2 N i e i e = = d 2 t π t i i Under neutrality, the ratio of polymorphism to divergence at the i-th locus is just Hence, for a series of neutral loci compared in the same populations, this ratio should be very similar. The very popular Hudson, Kreitman and Aguade (1987), or HKA test, is based on this idea, with one using a series of controlled (neutral) loci to contrast with the locus of interest.

d 2 t π π sy n sy n sy n = = d 2 t π π r ep r ep r ep H 4 N π π sy n e sy n sy n = = H 4 N π π r ep e r ep r ep These ratios have the same expected value Joint Polymorphism-Divergence Tests McDonald-Kreitman Test One of the most straight-forward tests of selection that jointly uses divergence and polymorphism data was proposed by McDonald and Kreitman (1991) Consider the replacement & synonymous sites in a single locus.

Since these ratios have the same expected value, the McDonald-Kreitman test proceeds via a simple contingency table contrasting polymorphism vs. divergence at replacement vs. synonymous sites. Key feature: The McDonald-Kreitman test is NOT affected by demography

Example: McDonald & Kreitman looked at the ADH (Alcohol dehydrogease) loci in D. melanogaster & D. simulans. 24 fixed differences occur, 7 replacement, 17 synonymous 44 polymorphisms, 2 replacement, 42 synonymous, giving Fisher’s exact test gives p =0.0073

Wang et al’s LDD Test (Linkage Disequilibrium Decay) One feature of a selective sweep are derived alleles at high frequency. Under neutrality, older alleles are at higher frequencies. Sabeti et al (2002) note that under a sweep such high frequency young alleles should (because of their recent age) have much longer regions of LD than expected. Wang et al (2006) proposed a Linkage Disequilibrium Decay, or LDD, test looks for excessive LD for high frequency alleles Wang et. al used this approach with 1.6 million human SNPs, finding that 1.6% of the markers showed some signatures of positive selection.

Simulation studies by Wang et al. showed that the LDD test effectively distinguishes selection from population bottlenecks and admixture. All genome-based tests have an important caveat. The large number of markers used are typically generated by looking for polymorphisms in a very small, and often not very ethnically-diverse, sample Results in a strong ascertainment bias, for example, an excess of intermediate-frequency markers If such biases are not accounted for, they can skew test results.

Caveats and Unanswered Questions • Even if they have experienced very strong selection, domestication genes may not leave a strong signal at linked neutral markers. Must be sufficient background variation for the chance of a sweep being detected. Hamblin et al. (2006) found that the genome-wide background variation in Sorghum is too low to reliably detect signatures of selection. Likely from extreme bottleneck during domestication. If the ancestral species itself had low variation, would also be very difficult to detect selective sweeps.

• A more subtle complication results from the frequency of favorable alleles at the start of the domestication process A typical adaptive selective sweep is generally thought to occur following the introduction of a single favorable new mutation. Hence, only one founding haplotype at the time of selection. Selection on domestication alleles is akin to a sudden shift in the environment, with many of these alleles pre-existing in the population before domestication If the frequency of any such an allele is > 0.05, multiple haplotypes are likely present, resulting in considerable variation around the selective site even after fixation, and hence a very weak (if any) signal.

Hence, there is the very real possibility than many important domestication genes will not have left a detectable signature in the pattern of linked neutral variation.

Optimal conditions for detecting selection High levels of polymorphism at the start of selection High effective levels of recombination gives a shorter window around the selective site High levels of selfing reduces the effective recombination rate (eg. Maize vs. rice) Signatures of sweeps persist for roughly Ne generations

A Ruby in the Rubbish: Using molecular data to look for signatures of selection