Getting parameters from data
1 / 34

Getting Parameters from data - PowerPoint PPT Presentation

  • Uploaded on

Getting Parameters from data. 1. Introduction. No fun. 2. Estimating θ. Unlike ρ , θ does not shape the genealogy, it modifies genetic types Mutation-rich samples allow for more accurate estimation of ‘genealogical shape’-parameters. More segregating sites -> More information.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Getting Parameters from data' - clem

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Getting parameters from data
Getting Parameters from data

Comp 790– Coalescence with Mutations

2 estimating
2. Estimating θ

  • Unlike ρ,θ does not shape the genealogy, it modifies genetic types

  • Mutation-rich samples allow for more accurate estimation of ‘genealogical shape’-parameters.

  • More segregating sites -> More information

2 estimating1
2. Estimating θ

  • 2.1 Watterson’s estimator

    • One of the two very popular estimators

    • The estimator has mean θ:

2 estimating2
2. Estimating θ

  • The estimator is unbiased and variance decreases with increasing ρ: recombination breaks up linkage and reduces correlation between sites

  • Under exponential growth, it is downwards biased. Under migration, it is upwards biased, because the MRCA tends to be pushed further back in time.

2 estimating3
2. Estimating θ

  • 2.2 Tajima’s estimator

    • πij: The number of sites that differ between sequence i and j.

    • πij has mean θ, because πij is the number of segregating sites in a sample of size two.

    • As a consequence,

2 estimating5
2. Estimating θ

  • Code for Watterson Estimator

2 estimating8
2. Estimating θ

2.3. Fu 1994

2 estimating9
2. Estimating θ

2.3. Fu 1994

2 estimating10
2. Estimating θ

Fu’s other estimators:

i-Mutation: only good with large n

UPBLUE: does not generalize

to settings with recombination

3 estimating
3. Estimating ρ

  • Estimating ρ is hard, both statistically and computationally.

    • Using infinite side model, all mutation events can be listed whereas all recombination events cannot. A recombination can only be inferred with certainty if all four gametes are present.

    • The number of possible genealogical relationship between sequences subject to recombination is unlimited.

3 estimating1
3. Estimating ρ

  • Both TM and HM might overlook important information in the data. They only provides lower bounds to the number of recombination events. (TM: Least number of gene trees required to explain the sample. HM: 5.11.3)

  • Assume two non-recombining loci with rate ρbetween them.

  • Likelihood function:L(θ1, θ2, ρ) = P θ1, θ2, ρ(S1,n=k1,S2,n=k2)

  • For n=2, the two extreme cases:

  • L(θ1, θ2, 0) =

  • L(θ1, θ2, ∞) =

3 estimating2
3. Estimating ρ

  • Assume u1=u2, two genes are of same length, then θ1=θ2. If k1=1 and k2=7,

    L(θ1, θ2, 0)=0.0014

    L(θ1, θ2, ∞) =0.0067

    (θL = 4 is the maximum likelihood estimator for both ρ=0 and ∞)

    The likelihood supports two unlinked loci(ρ= ∞) more than two completely linked loci(ρ=0).

     ρ>0 even though the data passes the four gamete test and has TM = HM = 0.

     recombination can be inferred even in the absence of incompatibilities.

  • If k1=k2=4, the likelihood supports two complete linked loci.

3 estimating3
3. Estimating ρ

  • Recombination is difficult to take into account in an analysis and it is tempting to ignore it or assume that its effects are minor. Unfortunately it is not true.

  • 3.1 Estimators based on summary statistics

    • Wakeley’s estimator: Wakeley(1997)

      • A complicated function of a complicated function of ρ. ( the form is fully know though )

      • Large variance. The expectation doesn’t strongly depend on ρ.

    • Likelihood and summary statistics: Wall(2000)

      • Infer ρ based on the likelihood of (Sn, Kn, TM) where Kn is the number of haplotypes in the sample.

3 estimating4
3. Estimating ρ

  • 3.2 Pseudo-likelihood estimators: Hudson (2001b), Fearnhead & Donnelly(2001)

    • Consider all pairs of segregating sites, ignoring all non-polymorphic sites

    • Let nij denote the vector of gamete counts for sites i and j (00,01,…). Let ρij be the scaled recombination rate:

    • Probability of obtaining nij given that both i,j are polymorphic

    • The proposed pseudo-likelihood function is:

    • Then to estimate ρ from this pseudo-likelihood.

      • It depends on ρ only. Because ρij is assumed proportional to sequence length.

3 estimating5
3. Estimating ρ

  • It is pseudo:

    • 1.Only likelihood of pairs of segregating sites are considered

    • 2. Pairs are treated as independent of each other

    • 3.The likelihood of a pair is conditioned on the pair being segregating in both loci

4 monte carlo methods
4. Monte Carlo methods

  • The principle: throw dice several times and calculate the average

    • Var(g(X))/M

  • Use Monte Carlo integration to find P(Sn=k) : A naïve approach

    • Simulate genealogies of n genes, add mutations and count the number of times a genealogy has exactly k mutations.

    • Many simulated genealogies will not contribute to the sum.

4 monte carlo methods1
4. Monte Carlo methods

  • A better approach: Write P(Sn=k) in the form of an integral

    • Rearrange the terms and we can get:

      • Where X is gamma distributed with parameters k+1 and θ/2

    • Or:

      • Where Ln is the sum of all branches in the coalescent tree.

    • It’s better because every simulated values counts.

Comp 790– Continuous-Time Coalescence

4 monte carlo methods2
4. Monte Carlo methods

Comp 790– Continuous-Time Coalescence

4 monte carlo methods3
4. Monte Carlo methods

  • 4.1 Likelihood curve

    • Monte Carlo methods becomes more useful in evaluating P(Sn=k) for a whole range of θ values

    • E.g.: to calculate L(θ) = Pθ(Sn=k) for a large range of θs and single out the θ value with highest probability.

    • Recall:

    • Simulate y1,….yM from Ln and calcuate the empirical average:

    • Note that only one set of simulations is performed and is used to calculate the likelihood for all θ

Comp 790– Continuous-Time Coalescence

4 monte carlo methods4
4. Monte Carlo methods

  • Alternatively, recall:

  • one can extend it this way: consider some fixed θ0. for any θ

  • Using Monte Carlo technique, the integral can be approximated by:

  • Where x1,…xM are M values obtained from the proposal distribution gamma(k+1, θ0). θ0 is called the ‘driving value’. An appropriate choice of θ0 could be a simple estimator of θ. E.g. Watterson’s estimator.

Comp 790– Continuous-Time Coalescence

4 monte carlo methods5
4. Monte Carlo methods

Comp 790– Continuous-Time Coalescence

4 monte carlo methods6
4. Monte Carlo methods

  • 4.2 Monte Carlo integration and the coalescent

    • Full likelihood of a sample under a coalescent model:

    • H: history D: data

    • Let’s define H here:

    • N!(n-1)!/2^(n-1) different coalescent topologies

    • Impossible to sum up all these. ( we haven't even considered recomb)

Comp 790– Continuous-Time Coalescence

4 monte carlo methods7
4. Monte Carlo methods

  • A naïve Monte Carlo approach:

    • It’s not efficient for most of the coalescent topologies will not be compatible with D.  most simulations do not contribute to the likelihood.

    • A four sequence example: (1/3 compatible )

Comp 790– Continuous-Time Coalescence

4 monte carlo methods8
4. Monte Carlo methods

  • Importance Sampling:

    • Reduce the variance of the estimated probability

    • Reduce the number of simulations that contribute little to likelihood

  • Instead of choosing histories from distribution Pθ(H), sample histories from a proposal distribution Q(H)

  • Now the likelihood of data can be approximated by:

Comp 790– Continuous-Time Coalescence

4 monte carlo methods9
4. Monte Carlo methods

  • Ideally, one would like to sample from,


    because in that case the approximation becomes exact:

    Not feasible approach.

  • A proposal distribution between and

  • Giffiths and Tavare(1994), Stephens and Donnelly(2000)

Comp 790– Continuous-Time Coalescence

4 monte carlo methods10
4. Monte Carlo methods

  • Giffiths and Tavare(1994):

  • Let’s go back to 2.4.2 Infinite Site Model

Comp 790– Continuous-Time Coalescence

Getting parameters from data

  • Giffiths and Tavare(1994):

  • H is defined as a path through the diagram.

  • H has probability defined by the product of weights attached to the edges that belong to H.

  • E.g.H’ follows the rightmost path:

    • 1st term of Q(H’):

    • θ0 is the driving value

    • Last five terms are all 1

Comp 790– Continuous-Time Coalescence

4 monte carlo methods11
4. Monte Carlo methods

  • Pθ(D|H’) = 1 ???

  • is the product of coalescent probabilities of the events defining the history:

    • Coalescent -> mutation -> coalescent …

  • The factor in front of a fraction is the probability that a mutation happens in a given lineage(s) or that a coalescent event happens amongst certain pair of genes.

Comp 790– Continuous-Time Coalescence

4 monte carlo methods12
4. Monte Carlo methods

  • 4.3 Markov Chain Monte Carlo

    • Kuhner et al.(1995,1998)

    • Finite sites model

    • All coalescent topologies are compatible with data

    • Likelihood ratio:

Comp 790– Continuous-Time Coalescence

4 monte carlo methods13
4. Monte Carlo methods

  • The importance sampling function is:

  • Use Metropolis-Hastings algorithm to construct a Markov Chain with distribution Q(H)

  • The benefit of the approach is that the Markov Chain tends to stay in areas of the tree space that suport the data well before moving to another area.

Comp 790– Continuous-Time Coalescence