The rise (and fall) of QTL mapping: The fusion of quantitative & molecular genetics

The rise (and fall) of QTL mapping: The fusion of quantitative & molecular genetics Bruce Walsh (jbwalsh@u.arizona.edu) Depts of Ecology & Evolutionary Biology Plant Sciences Animal Sciences Molecular & Cellular Biology Epidemiology & Biostatistics University of Arizona

Rough outline • Classical Quantitative Genetics • The Golden Age: The search for QTLs • History and review of methods • History revised: how successful has the search for QTLs been? • The next wave: • eQTLs • Association mapping • Molecular signatures of selection • Are these improvements? • Summary: Where is quantitative genetics today?

Quantitative Genetics Quantitative Genetics is the analysis of traits whose variation is influenced by both genetic and environmental factors The assumption is that the genotype of an individual cannot be easily predicted from its phenotype. Indeed, the genotypes (and hence loci) contributing to trait variation have historically been assumed to be unknown and largely unknowable. “Classical” Quantitative Genetics works with genetic variance components, which are often easy to estimate.

Genetic variance components Fisher (1918) reconciled quantitative traits with Mendelian Genetics, building on statistical machinery developed by the biometricians. The term variance was first introduced in Fisher’s paper (as well as ANOVA) Z = G + E Fisher’s key insight was the, in sexual species, parents do not pass along their genotypic value G to their offspring, but rather only pass along part, the breeding value A, G = A + D + I Fisher also noted that the variance of A can be estimated by phenotypic covariances among relatives

Cov(Parent, offspring) = Var(A)/2 Cov(half sibs) = Var(A)/4 Cov(full sibs) = Var(A)/2 + Var(D)/4 + Var(Ec) Variance components and Selection Response Thus, without any genetic information, we can still estimate important genetic features associated with the trait variation in a particular population. Key use: The Breeders’ Equation for selection response R = h2 S, with the heritability h2 = Var(A)/Var(P)

Quantitative Genetics: The infinitesimal model At the heart of much of classical quantitative genetics is the infinitesimal model -- the genetic variation in a trait is due to a large number of loci each of small effect. Classical quantitative genetics represents the fusion of Mendelian and population genetics, under the umbrella of classical statistical methods What about a fusion of quantitative genetics with molecular biology and genomics?

Statistics and Molecular biology The success of “classical” quantitative-genetics (variance components and related statistical measures) has been spectacular, esp. in plant and animal breeding. However, the solely statistical nature of this approach has been unsettling to some, and the demise of the field was predicted once we had a better molecular handle on trait variation. Thus, starting with the ability to score a vast number of molecular markers, the fusion of molecular biology and quantitative genetics seemed a possibility.

Quantitative Trait Loci, QTLs The first “harvest” from the ability to score modest number of molecular markers was the ability to search for Quantitative Trait Loci, QTLs, loci showing allelic variation that influences trait variation (mid 1980’s). Conceptually, nothing new, as this is just linkage analysis Consider the gametes from an AB/ab parent, where A & B are linked loci. We observe an excess of AB and ab gametes, and a deficiency of Ab, aB. Suppose B influences a trait, making it larger. Offspring getting the A allele from this parent disproportionately get the B allele as well, and hence have larger trait values.

Early localization of factors influencing quantitative traits was done by Payne 1918, Sax 1923, and Thoday 1960’s Sax (1923) crossed two inbred bean lines differing in seed pigment and weight, with the pigmented parents having heavier seeds than the nonpigmented parents. These crosses demonstrated that seed pigment is determined by a single locus with two alleles, P and p. Among F2 segregants from this cross, PP and Pp seeds were 4.3 +/- 0.8 and 1.9 +/- 0.6 centigrams heavier than pp seeds. Hence, the P allele is linked to a factor (or factors) that act in an additive fashion on seed weight.

Makers and more markers While the basic outlines for QTL mapping has been known for over 70 years, the lack of sufficient genetic markers prevented its widespread use until the mid 1980’s. The early studies (in maize) used 50-80 markers, mostly allozymes and were very loosely-linked (marker spacing much greater than 20 cM) With the advent of DNA (esp. STR = microsat) markers, numbers and density of markers have grown, resulting in a parallel development of more statistically-sophisticated approaches to mapping to use this additional information.

The statistical machinery for QTL mapping Single marker linear model approaches Interval mapping: pairs of markers, move to Maximum likelihood approaches Composite Interval mapping: analysis of a marker interval, flanked by adjacent markers. ML-based Shrinkage and Bayesian approaches for detecting epistasis From from line-cross analysis to the analysis of outbred populations: mixed models

P r ( Q M ) k j P r ( Q j M ) = k j P r ( M ) j Conditional Probabilities of QTL Genotypes The basic building block for all QTL methods is Pr(Qk | Mj) --- the probability of QTL genotype Qk given the marker genotype is Mj. Consider a QTL linked to a marker (recombination Fraction = c). Cross MMQQ x mmqq. In the F1, all gametes are MQ and mq In the F2, freq(MQ) = freq(mq) = (1-c)/2, freq(mQ) = freq(Mq) = c/2

Hence, Pr(MMQQ) = Pr(MQ)Pr(MQ) = (1-c)2/4 Pr(MMQq) = 2Pr(MQ)Pr(Mq) = 2c(1-c)/4 Pr(MMqq) = Pr(Mq)Pr(Mq) = c2 /4 Since Pr(MM) = 1/4, the conditional probabilities become Pr(QQ | MM) = Pr(MMQQ)/Pr(MM) = (1-c)2 Pr(Qq | MM) = Pr(MMQq)/Pr(MM) = 2c(1-c) Pr(qq | MM) = Pr(MMqq)/Pr(MM) = c2

N X π = π P r ( Q j M ) M Q k j j k k = 1 - - ( π π ) = 2 = a ( 1 2 c ) M M m m Expected Marker Means The expected trait mean for marker genotype Mj is just For example, if QQ = 2a, Qa = a(1+k), qq = 0, then in the F2 of an MMQQ/mmqq cross, • If the trait mean is significantly different for the genotypes at a marker locus, it is linked to a QTL • A small MM-mm difference could be (i) a tightly-linked QTL of small effect or (ii) loose linkage to a large QTL

µ ∂ π ° π 1 ° c ° c - M M M M m m m m 1 2 1 1 2 2 1 1 2 2 = a 2 1 ° c ° c + 2 c c 1 2 1 2 ' a ( 1 ° 2 c c ) 1 2 µ ∂ 1 π ° π M M m m 1 1 1 1 This is essentially a for even modest linkage c = 1 ° 1 2 2 a µ ∂ 1 π ° π M M m m 1 1 1 1 ' 1 ° 2 π ° π M M M M m m m m 1 1 2 2 1 1 2 2 Hence, the use of single markers provides for detection of a QTL. However, single marker means does not allow separate estimation of a and c. Now consider using interval mapping (flanking markers) Hence, a and c can be estimated from the mean values of flanking marker genotypes

z = π + b + e i k i i k Value of trait in kth individual of marker genotype type i Effect of marker genotype i on trait value Linear Models for QTL Detection The use of differences in the mean trait value for different marker genotypes to detect a QTL and estimate its effects is a use of linear models. One-way ANOVA. Detection: a QTL is linked to the marker if at least one of the bi is significantly different from zero Estimation (QTL effect and position): This requires relating the bi to the QTL effects and map position

N X 2 ` ( z j M ) = ' ( z ; π ; æ ) P r ( Q j M ) j Q k j k k = 1 Trait value given marker genotype is type j Distribution of trait value given QTL genotype is k is normal with mean mQk. (QTL effects enter here) Probability of QTL genotype k given marker genotype j --- genetic map and linkage phase entire here Sum over the N possible linked QTL genotypes Maximum Likelihood Methods ML methods use the entire distribution of the data, not just the marker genotype means. More powerful that linear models, but not as flexible in extending solutions (new analysis required for each model) Basic likelihood function: This is a mixture model

m a x ` ( z ) r L R = ° 2 l n m a x ` ( z ) Maximum of the likelihood under a no-linked QTL model Maximum of the full likelihood ∑ ∏ m a x ` ( z ) L R ( c ) L R ( c ) r L O D ( c ) = ° l o g = ' 1 0 m a x ` ( z ; c ) 2 l n 1 0 4 : 6 1 ML methods combine both detection and estimation Of QTL effects/position. Test for a linked QTL given from the LR test The LR score is often plotted by trying different locations for the QTL (i.e., values of c) and computing a LOD score for each

A typical QTL map from a likelihood analysis

i-1 i i+1 i+2 CIM works by adding an additional term to the linear model , X b x k k j k 6 i ; i = + 1 Interval Mapping with Marker Cofactors Consider interval mapping using the markers i and i+1. QTLs linked to these markers, but outside this interval, can contribute (falsely) to estimation of QTL position and effect Now suppose we also add the two markers flanking the interval (i-1 and i+2) CIM also (potentially) includes unlinked markers to account for QTL on other chromosomes. Inclusion of markers i-1 and i+2 fully account for any linked QTLs to the left of i-1 and the right of i+2 Interval being mapped Interval mapping + marker cofactors is called Composite Interval Mapping (CIM) However, still do not account for QTLs in the blue areas

From Line Crosses to Outbred Populations In such cases, all of the F1 offspring have the same genotype, namely MQ/mq, being a heterozygote at all loci that show fixed differences between the lines being crossed. We can thus lump all offspring In contrast, with outbred populations, each individual has a unique genotype, and hence each parent must be examined separately. Much of the above discussion was for the analysis of line-cross data. For example, if a father is M1/M2, we contrast phenotypic values in offspring getting M1 vs. M2 from this parent. The reason is that (say) a father could be M1Q/M2q, while his mate might be M1q/M2Q. Likewise, many individuals have no linkage information, e.g., M1Q/M2Q or M1/M1

0 z = π + A + A + e i i i i Trait value for individual i Genetic value of other (background) QTLs Genetic effect of chromosomal region of interest 2 2 æ ( z ; z ) = R æ + 2 £ æ i j i j i j A A 0 Fraction of chromosomal region shared IBD between individuals i and j. Resemblance between relatives correction General Pedigree Methods Random effects (hence, variance component) method for detecting QTLs in general pedigrees The covariance between individuals i and j is thus Mixed-model approaches are used, with variances estimated for each chromosomal region.

2 2 2 V = R æ + A æ + I æ A A 0 e Ω Ω 1 f o r i = j 1 f o r i = j R = ; A = i j i j b R f o r i = 6 j 2 £ f o r i = 6 j i j i j The resulting likelihood function is ∑ ∏ 1 1 T ° 2 2 2 1 ` ( z j π ; æ ; æ ; æ ) = e x p ° ( z ° π ) V ( z ° π ) p A A 0 e n 2 ( 2 º ) j V j Assume z is MVN, giving the covariance matrix as Here Estimated from marker data Estimated from the pedigree A significant sA2 indicates a linked QTL.

What are some of the take-home messages from QTL mapping studies? • Most traits show several (4-30) QTLs that are localized to modest-sized chromosomal segments • Detected QTLs typically account for between 5 and 50% of the observed phenotypic variation (in the F2) • Transgressive segregation is often observed, with high trait alleles being found in low trait value lines, and vise-versa (hidden variation for selection). • Epistasis appears to lacking in many studies, but seems to be fairly common in eQTLs

What are some concerns from QTL mapping studies? • Replication of results is often poor. • Common for a “single” QTL region to show multiple QTLs upon more careful fine analysis, often with effects in opposite directions • QTL mapping does not get at the underling genes, only isolates chromosomal regions of interest, usually with rather poor resolution (20 cM = 20 Megabases = 200 - 2000 genes) • When isolated in inbred lines, QTLs often show strong interaction effects (G x G, G x E), that are not apparent in a normal analysis. Hence, likely very context-specific.

Genotype X environment interaction Additive and dominance effects of QTL are often environment-specific QTL for Drosophila longevity, different larval rearing densities Slide courtesy of Trudy Mackay

More complicated effects Epistatic effects can be sex- and environment specific QTL for Drosophila longevity Slide courtesy of Trudy Mackay

Cracks in the façade? QTL mapping appears to dispute the infinitesimal model, suggesting a few discrete loci account for much of the variation. Problem 1: Upon closer analysis, many of these high-value regions themselves decompose into several QTLs, not just one. How fine such a decomposition can be continued until no more QTL appear is unresolved. Problem 2: From a molecular-biology standpoint, QTLs have not really led us significantly closer to the underling genes, and hence the molecular mechanisms for quantitative trait variation.

Power for detection Most QTL studies are vastly underpowered. For an alpha of a = 0.01, sample size required for 90% power of detection (F2 design) is roughly 22/d2 , where d = a/s, the allele effect in units of SD How many individuals must be scored in an F2 design in a line cross (high power setting) Thus, the sample size for d = 0.5, 0.2, 0.1, 0.05 are 88,550, 2200, and 8800. Typical QTL study in the range of n = 350, giving d = 0.25 Effect of linkage: for c = 0.05, 0.1, 0.2, increase in sample size (over c = 0) is 1.2, 1.6, and 2.8

Power and Repeatability: The Beavis Effect QTLs with low power of detection tend to have their effects overestimated, often very dramatically As power of detection increases, the overestimation of detected QTLs becomes far less serious For example, a QTL accounting for 0.75% of the total F2 variation has only a 3% chance of being detected with 100 F2 progeny (markers spaced at 20 cM). For cases in which such a QTL is detected, the average estimated total variance it accounts for is 15%!. This is often called the Beavis Effect, after Bill Beavis who first noticed this in simulation studies The Beavis effect raises the real concern that many QTL of apparent large effect may be artifacts. Under an infinitesimal model this is especially a concern.

Detection vs. localization Darvasi & Soller (1997) give an appropriate expression for the sample size required for a 95% confidence interval in position, CI = 1500/(nd)2 For a QTL with d = 0.25, 0.1, and 0.05, the sample sizes needed for a 1cM CI are 1500, 3800, and 7600. Fine mapping (localizing to under 1 cm) requires the generation of special lines, such as advanced intercross (AIC), or recombinant inbred lines (RILs). In flies, A series of overlapping deficieny strains can be used.

Tradeoffs in sample designs Most QTL mapping studies are highly underpowered. While QTLs of modest effects can be detected with sample sizes of 500 or less, an order of magnitude more is needed for high-resolution mapping. Adding more markers does not really improve power or resolution very much. Increasing the number of individuals does. Ironically, we are now at the stage where it is fair easier to score markers than to score phenotypes. This limits the sample sizes that can be used.

Mapping eQTLs A current very fashionable trend is the mapping of expression QTLs, locations that influence the amount of protein or RNA made by a particular gene A common design is to use RILs and examine a number of microarrays across a modest set of lines (10-100). Some improvement in power (over an F2 design) occurs because of being able to replicated within each RIL and the expanded map distances (4 fold) found in RILs vs. F2 Still, such designs are underpowered, making localization (cis vs. trans) difficult and the contribution from detected eQTLs being inflated by the Beavis effect.

How can we improve the ability To detect QTLs? Two complementary approaches, which require very dense marker maps, have been suggested. • Association mapping -- much finer resolution with a smaller sample size, using historical recombinants • Methods for detecting genes under (or very recently under) selection.

Association mapping Basic idea is very straightforward: If there exists very tight linkage between a marker and a QTL, with marker and QTL alleles in linkage- disequilibrium, then a random collection of individuals show a marker-trait association. Since the region of LD is expected to be very small, this method potentially allows for fine mapping using not a collection of relatives (hard to get), but rather a random (and hence likley much larger) collection of individuals from a population.

Linkage disequilibrium mapping Idea is to use a random sample of individuals from the population rather than a large pedigree. Ironically, in the right settings this approach has more power for fine mapping than pedigree analysis. Why? • Key is the expected number of recombinants. in a pedigree, Prob(no recombinants) in n individuals is (1-c)n • LD mapping uses the historical recombinants in a sample. Prob(no recomb) = (1-c)2t, where t = time back to most recent common ancestor

Expected number of recombinants in a sample of n sibs is cn Expected number of recombinants in a sample of n random individuals with a time t back to the MRCA (most recent common ancestor) is 2cnt Hence, if t is large, many more expected recombinants in random sample and hence more power for very fine mapping (i.e. c < 0.01) Because so many expected recombinants, only works with c very small

Dense SNP Association Mapping Mapping genes using known sets of relatives can be problematic because of the cost and difficulty in obtaining enough relatives to have sufficient power. By contrast, it is straightforward to gather large sets of unrelated individuals, for example a large number of cases (individuals with a particular trait/disease) and controls (those without it). With the very dense set of SNP markers (dense = very tightly linked), it is possible to scan for markers in LD in a random mating population with QTLs, simply because c is so small that LD has not yet decayed

These ideas lead to consideration of a strategy of Dense SNP association mapping. For example, using 30,000 equally spaced SNP in The 3000cM human genome places any QTL within 0.05cM of a SNP. Hence, for an association created t generations ago (for example, by a new mutant allele appearing at that QTL, the fraction of original LD still present is at least (1-0.0005)t ~ 1-exp(t*0.0005). Thus for mutations 100, 500, and 1000 generations old (2.5K, 12.5K, and 25 K years for humans), this fraction is 95.1%, 77.8%, 60.6%, We thus have large samples and high disequilibrium, the recipe needed to detect linked QTLs of small effect

Problems with association mapping Good news: Do not need a set of relatives. Hence, easier to gather a large sample. Bad news: One can have marker-trait associations in the absence of linkage. For example if a marker predict group membership, and being in that group gives you a different trait value, then a marker- trait covariance will occur. This is the problem of population stratification.

When population being sampled actually consists of several distinct subpopulations we have lumped together, marker alleles may provide information as to which group an individual belongs. If there are other risk factors in a group, this can create a false association btw marker and trait Example. The Gm marker was thought (for biological reasons) to be an excellent candidate gene for diabetes in the high-risk population of Pima indians in the American Southwest. Initially a very strong association was observed: The association was re-examined in a population of Pima that were 7/8th (or more) full heritage: Problem: freq(Gm+) in Caucasians (lower-risk diabetes Population) is 67%, Gm+ rare in full-blooded Pima

n m X X y = π + Ø M + ∞ b + e k k j j j =1 k =1 Adjusting for population stratification • Use molecular makers to classify individuals into groups, do association mapping within each group (structured association mapping). This approach typically uses the program STRUCTURE • Use a simple regression approach, adding additional markers as cofactors for group membership, removing their effect,

Scans for genes under selection • Reduction in levels of polymorphism around selected site (selected sweep), or increase in the levels of polymorphism around a locus under stabilizing selection. • Formal tests based on molecular variation (Tijama’s D, MK, ect.) -- either as a test for candidate genes or scanning the genome for regions showing strong signals • Dense SNP approaches based on linkage disequilibrium and age of allele.

A scan of levels of polymorphism can thus suggest sites under selection Directional selection (selective sweep) Variation Local region with reduced mutation rate Map location Balancing selection Variation Local region with elevated mutation rate Map location

Example: maize domestication gene tb1 Major changes in plant architecture in transition from teosinte to maize Doebley lab identified a gene, teosinite branched 1, tb1, involved in many of these architectural changes Wang et al. (1999) observed a significant decrease in genetic variation in the 5’ NTR region of tb1, suggesting a selective sweep influenced this region. The sweep did not influence the coding region.

Wang et al (1999) Nature 398: 236.

Clark et al (2004) examined the 5’ tb1 region in more detail, finding evidence for a sweep influencing a region of 60 - 90 kb Clark et al (2004) PNAS 101: 700.

Formal tests Strict neutral theory: single parameter describes heterozygosity, average number of differences between alleles Number of singletons (alleles present once in sample) A number of tests comparing these various measures of within-population variation have been proposed: Tajima’s D, HKA, Fu and Li’s D* and F*, Fu’s W and Fs, Fay and Wu’s H, etc. One could either test a candidate gene or do a genomic scan using dense markers to test a sliding window along a chromosome.

Rejection of neutrality = locus under selection! A central problem with all of these frequency spectrum tests is that a rejection of the strict neutral model can be caused by changes in population size in addition to a locus under selection. Such demographic signals would be present at all loci, so that one approach is to use such signals over all loci to correct the test at any particular locus. Another approach is to use marker information toe estimate the demographic parameters and then again use these to generate an appropriate null (neutral) model.

LD tests based on dense markers A newer class of tests that is not influenced by demographic factors are those based on the length of linkage disequilibrium around a target site. Under drift, alleles at moderate to high frequencies are old, and hence have smaller tracks of disequilibrium, due to time for recombination to break down longer tracks. LD based tests of selection look for long tracks of disequilibrium around allele at high frequency. This requires dense SNP markers

The rise (and fall) of QTL mapping: The fusion of quantitative & molecular genetics