Estimating Theta and Rho in Genetic Analysis: Methods and Applications

Getting Parameters from data Comp 790– Coalescence with Mutations

1. Introduction • No fun

2. Estimating θ • Unlike ρ,θ does not shape the genealogy, it modifies genetic types • Mutation-rich samples allow for more accurate estimation of ‘genealogical shape’-parameters. • More segregating sites -> More information

2. Estimating θ • 2.1 Watterson’s estimator • One of the two very popular estimators • The estimator has mean θ:

2. Estimating θ • The estimator is unbiased and variance decreases with increasing ρ: recombination breaks up linkage and reduces correlation between sites • Under exponential growth, it is downwards biased. Under migration, it is upwards biased, because the MRCA tends to be pushed further back in time.

2. Estimating θ • 2.2 Tajima’s estimator • πij: The number of sites that differ between sequence i and j. • πij has mean θ, because πij is the number of segregating sites in a sample of size two. • As a consequence,

2. Estimating θ

2. Estimating θ • Code for Watterson Estimator

2. Estimating θ

2. Estimating θ 2.3. Fu 1994

2. Estimating θ Fu’s other estimators: i-Mutation: only good with large n UPBLUE: does not generalize to settings with recombination

3. Estimating ρ • Estimating ρ is hard, both statistically and computationally. • Using infinite side model, all mutation events can be listed whereas all recombination events cannot. A recombination can only be inferred with certainty if all four gametes are present. • The number of possible genealogical relationship between sequences subject to recombination is unlimited.

3. Estimating ρ • Both TM and HM might overlook important information in the data. They only provides lower bounds to the number of recombination events. (TM: Least number of gene trees required to explain the sample. HM: 5.11.3) • Assume two non-recombining loci with rate ρbetween them. • Likelihood function:L(θ1, θ2, ρ) = P θ1, θ2, ρ(S1,n=k1,S2,n=k2) • For n=2, the two extreme cases: • L(θ1, θ2, 0) = • L(θ1, θ2, ∞) =

3. Estimating ρ • Assume u1=u2, two genes are of same length, then θ1=θ2. If k1=1 and k2=7, L(θ1, θ2, 0)=0.0014 L(θ1, θ2, ∞) =0.0067 (θL = 4 is the maximum likelihood estimator for both ρ=0 and ∞) The likelihood supports two unlinked loci(ρ= ∞) more than two completely linked loci(ρ=0).  ρ>0 even though the data passes the four gamete test and has TM = HM = 0.  recombination can be inferred even in the absence of incompatibilities. • If k1=k2=4, the likelihood supports two complete linked loci.

3. Estimating ρ • Recombination is difficult to take into account in an analysis and it is tempting to ignore it or assume that its effects are minor. Unfortunately it is not true. • 3.1 Estimators based on summary statistics • Wakeley’s estimator: Wakeley(1997) • A complicated function of a complicated function of ρ. ( the form is fully know though ) • Large variance. The expectation doesn’t strongly depend on ρ. • Likelihood and summary statistics: Wall(2000) • Infer ρ based on the likelihood of (Sn, Kn, TM) where Kn is the number of haplotypes in the sample.

3. Estimating ρ • 3.2 Pseudo-likelihood estimators: Hudson (2001b), Fearnhead & Donnelly(2001) • Consider all pairs of segregating sites, ignoring all non-polymorphic sites • Let nij denote the vector of gamete counts for sites i and j (00,01,…). Let ρij be the scaled recombination rate: • Probability of obtaining nij given that both i,j are polymorphic • The proposed pseudo-likelihood function is: • Then to estimate ρ from this pseudo-likelihood. • It depends on ρ only. Because ρij is assumed proportional to sequence length.

3. Estimating ρ • It is pseudo: • 1.Only likelihood of pairs of segregating sites are considered • 2. Pairs are treated as independent of each other • 3.The likelihood of a pair is conditioned on the pair being segregating in both loci

4. Monte Carlo methods • The principle: throw dice several times and calculate the average • Var(g(X))/M • Use Monte Carlo integration to find P(Sn=k) : A naïve approach • Simulate genealogies of n genes, add mutations and count the number of times a genealogy has exactly k mutations. • Many simulated genealogies will not contribute to the sum.

4. Monte Carlo methods • A better approach: Write P(Sn=k) in the form of an integral • Rearrange the terms and we can get: • Where X is gamma distributed with parameters k+1 and θ/2 • Or: • Where Ln is the sum of all branches in the coalescent tree. • It’s better because every simulated values counts. Comp 790– Continuous-Time Coalescence

4. Monte Carlo methods Comp 790– Continuous-Time Coalescence

4. Monte Carlo methods • 4.1 Likelihood curve • Monte Carlo methods becomes more useful in evaluating P(Sn=k) for a whole range of θ values • E.g.: to calculate L(θ) = Pθ(Sn=k) for a large range of θs and single out the θ value with highest probability. • Recall: • Simulate y1,….yM from Ln and calcuate the empirical average: • Note that only one set of simulations is performed and is used to calculate the likelihood for all θ Comp 790– Continuous-Time Coalescence

4. Monte Carlo methods • Alternatively, recall: • one can extend it this way: consider some fixed θ0. for any θ • Using Monte Carlo technique, the integral can be approximated by: • Where x1,…xM are M values obtained from the proposal distribution gamma(k+1, θ0). θ0 is called the ‘driving value’. An appropriate choice of θ0 could be a simple estimator of θ. E.g. Watterson’s estimator. Comp 790– Continuous-Time Coalescence

4. Monte Carlo methods Comp 790– Continuous-Time Coalescence

4. Monte Carlo methods • 4.2 Monte Carlo integration and the coalescent • Full likelihood of a sample under a coalescent model: • H: history D: data • Let’s define H here: • N!(n-1)!/2^(n-1) different coalescent topologies • Impossible to sum up all these. ( we haven't even considered recomb) Comp 790– Continuous-Time Coalescence

4. Monte Carlo methods • A naïve Monte Carlo approach: • It’s not efficient for most of the coalescent topologies will not be compatible with D.  most simulations do not contribute to the likelihood. • A four sequence example: (1/3 compatible ) Comp 790– Continuous-Time Coalescence

4. Monte Carlo methods • Importance Sampling: • Reduce the variance of the estimated probability • Reduce the number of simulations that contribute little to likelihood • Instead of choosing histories from distribution Pθ(H), sample histories from a proposal distribution Q(H) • Now the likelihood of data can be approximated by: Comp 790– Continuous-Time Coalescence

4. Monte Carlo methods • Ideally, one would like to sample from, where because in that case the approximation becomes exact: Not feasible approach. • A proposal distribution between and • Giffiths and Tavare(1994), Stephens and Donnelly(2000) Comp 790– Continuous-Time Coalescence

4. Monte Carlo methods • Giffiths and Tavare(1994): • Let’s go back to 2.4.2 Infinite Site Model Comp 790– Continuous-Time Coalescence

Giffiths and Tavare(1994): • H is defined as a path through the diagram. • H has probability defined by the product of weights attached to the edges that belong to H. • E.g.H’ follows the rightmost path: • 1st term of Q(H’): • θ0 is the driving value • Last five terms are all 1 Comp 790– Continuous-Time Coalescence

4. Monte Carlo methods • Pθ(D|H’) = 1 ??? • is the product of coalescent probabilities of the events defining the history: • Coalescent -> mutation -> coalescent … • The factor in front of a fraction is the probability that a mutation happens in a given lineage(s) or that a coalescent event happens amongst certain pair of genes. Comp 790– Continuous-Time Coalescence

4. Monte Carlo methods • 4.3 Markov Chain Monte Carlo • Kuhner et al.(1995,1998) • Finite sites model • All coalescent topologies are compatible with data • Likelihood ratio: Comp 790– Continuous-Time Coalescence

4. Monte Carlo methods • The importance sampling function is: • Use Metropolis-Hastings algorithm to construct a Markov Chain with distribution Q(H) • The benefit of the approach is that the Markov Chain tends to stay in areas of the tree space that suport the data well before moving to another area. Comp 790– Continuous-Time Coalescence

Estimating Theta and Rho in Genetic Analysis: Methods and Applications

Estimating Theta and Rho in Genetic Analysis: Methods and Applications

Presentation Transcript

Getting data

Cosmological Parameters and the WMAP data

Estimating Parameters Using Measured Data

Extraction of Parameters from Collider Data?

Chapter 6 Estimating Parameters From Observational Data

Getting suitable data

Getting data

Submitting data to (and getting data from!) BODC

Getting data home

Forms (Getting data from users)

Getting Precise Stellar Parameters in PLATO’s first field

Estimating parameters from data

Move your Data Anywhere: Getting Data to and From Diverse Systems

Extraction of SUSY Parameters from Collider Data

Arc: Getting Data

Cosmological parameters from peculiar velocities

Parameters and Data Driving

Cosmological Parameters and the WMAP data

Getting data home

Save your Data from Getting Lost in Translation