1 / 24

Methods for sampling genealogies in complex models of divergence

This paper discusses different methods for sampling genealogies in complex models of divergence, exploring both genealogy sampling and summary statistics approaches. It introduces a new method that generates genealogy samples and approximates the posterior probability of the model parameters.

sammyb
Download Presentation

Methods for sampling genealogies in complex models of divergence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Methods for sampling genealogies in complex models of divergence Jody Hey Rutgers University

  2. Acknowledgements • Model development Rasmus Nielsen • Chimpanzee studies Yong-Jin Won Yong Wang Sang Chul Choi

  3. the Isolation with Migration Model Descendant Populations Present (Populations for Data Collection) N1 N2 m1 Migration m2 Splitting Time t NA Θ includes Six Parameters Ancestral Population Past

  4. Treating genealogies as a nuisance variable Θ – parameters of the model (e.g. population sizes, migration rates) X – data G – genealogy (i.e. coalescent tree)

  5. In practice • recombination is assumed to be zero within loci, and to be high between loci • Must be approximated by using samples of genealogies • Is slow

  6. Instead of sampling genealogies -> approximate likelihoods using summary statistics • Summary statistic methods have become common due to the limitations of methods that sample genealogies • Can work with loci that have histories of recombination • Can be fast • But, do not use all of the information in the data • So far do not do so well with models and histories that include gene exchange

  7. Competition between two lines of research: genealogy sampling, and summary statistics • Genealogy Sampling • limited by assumptions on recombination (so far) • Slow • Works well for estimating parameters • Summary Statistics • Not limited by recombination • Faster • Does not work so well (so far)

  8. An new method for sampling genealogies • We would like a smaller MCMC state space, for which it is easier to design an MCMC updating scheme that leads to rapid convergence • We would like to have an approach that generates an analytic likelihood function in multiple dimensions • But that avoids the frailties of that approach that stem from using samples of Gconditioned on a driving value of Θ, Θ0 (Kuhner et al, 1995) Hey & Nielsen 2007 PNAS 104:2785–2790.

  9. Reconsidering the integration over genealogies Consider an alternative expression, that also integrates over G, but that directly yields a posterior probability of Θ

  10. This is an expectation of P(Θ|G) and can be approximated given a sample ofgenealogies drawn at random from the posterior distribution of G, P(G | X) This step does not depend on the data, X. All the information in the data is contained in the sample drawn from P(G|X) Yields an analytic function

  11. The key to generating samples of genealogies from P(G | X)and to approximating P(Θ|X) is the calculation of the prior probability of G, P(G) • In fact this can be calculated analytically for the main demographic components of Θ.

  12. Sequence of operations • Run a Markov chain over G and generate random samples from P(G | X) • For each Gdrawn from this distribution, save P(G) and all necessary information for calculating P(G|Θ). • Build a function that approximates the posterior density of Θ • This is an analytic function, and can be evaluated for any value of Θ • The function can be differentiated and searched for maxima.

  13. Comparing the likelihood ratio for a true nested model with the likelihood for the full model • 100 data sets simulated under a model with just 2 population sizes and 1 migration rate χ2 2 Degrees of Freedom –2×Log-Likelihood Ratio 100 simulated data sets

  14. Chimpanzee Distributions

  15. Original results of Won & Hey P. t. troglodytes P. t. troglodytes New Method P. t. verus P. t. verus Ancestor Ancestor Chimpanzee Divergence Posterior Density for Population Size - Ne

  16. Models for more than two populations • Assume that we know the species phylogeny

  17. For three sampled populations N1 N2 m m N3 m m m m t0 NA0 m m t1 NA1 Θ includes 15 Parameters

  18. Multi-population IMa – The Good News • Adding more populations does not introduce new mathematical issues • Building the application is mostly a programming problem, not a math problem • Can do any number of populations for a known phylogeny • Program will “work” for 10 populations (assuming a known phylogeny) 19 population size parameters 162 migration rate parameters 9 population splitting times

  19. Multiple -populations – The Bad News • A lot of data will be required for many situations (hundreds of loci) • Models with many parameters introduce much more potential for model identifiability problems • Program is still slow and applications with 100’s of loci will require new computing configurations

  20. Chimpanzees in a four population Isolation with Migration Model • Pan paniscus (Bonobo) • P. troglodytes troglodytes (Central African Chimpanzee) • P. t. schweinfurthii (East African Chimpanzee) • P. t. verus (West African Chimpanzee)

  21. Chimpanzee Distributions

  22. Chimpanzee phylogeny* P.t. schweinfurthii P. t. troglodytes P.t. verus P. paniscus Eastern Central West Bonobo *Becquet et al., (2007) PLoS Genet 3:e66. (based on 310 microsatellite loci)

  23. Data • Fischer et al., Curr. Biol. 16:1133-1138. • 26 loci, approx 20 gene copies per species, average length 700 bp • Yu et al., (2003) Genetics 164:1511-1518. • 42 loci, approx 10 gene copies per species, average length 400 bp • Deinard & Kidd (2000), HOXB6 and APOB • Single loci from mitochondria, X chromosome, Y chromosome • Total of 73 loci

  24. Western Eastern Central Bonobo 7,100 26,000 8,200 7,800 79,000 yrs 30,000 440,000 yrs 6,900 Migration Signficantly greater than zero Splitting Times in years Effective Population Sizes Parameter Estimates for Four Chimpanzee Populations 890,000 yrs 17,000

More Related