1 / 27

Associate Professor Daniel Wilson

Phylogenetics in Practice Oxford Doctoral Training Programme Monday 20 th – Tuesday 21 st November 2017. Wellcome Trust / Royal Society Sir Henry Dale Fellow Nuffield Department of Medicine. Sarah Earle Biological Sciences. Steven Lin Bioinformatics. Clara Grazian Statistics.

bairn
Download Presentation

Associate Professor Daniel Wilson

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Phylogenetics in Practice Oxford Doctoral Training Programme Monday 20th – Tuesday 21st November 2017 Wellcome Trust / Royal Society Sir Henry Dale Fellow Nuffield Department of Medicine Sarah Earle Biological Sciences Steven LinBioinformatics Clara Grazian Statistics Associate Professor Daniel Wilson

  2. Molecular clocks & neutral evolution

  3. Molecular clocks & neutral evolution • When converting phylogenetic branch lengths to calendar time, it is common to assume a strict molecular clock • i.e. a constant rate of substitution along the phylogeny • Is it reasonable to assume a molecular clock? • Population genetics provides some insights • Consider an idealized population which is • Large • Constant • Well mixed

  4. Molecular clocks & neutral evolution The probability of fixation of a mutant depends on its current frequency and selective advantage 1 – e–γ f 1 – e–γ γ = 2 P N s P is the ploidy N is the population size (1+s) is the relative fitness of the mutant f is the mutant frequency γ is the population-scaled selective advantage Kimura (1955) Cold Spring Harbor Symposium on Quantitative Biology 20: 33

  5. Molecular clocks & neutral evolution The probability of fixation of an advantageous mutation in a large population equals twice its selective advantage i.e. 2s (1+s) is the relative fitness of the mutant γ is the population-scaled selective advantage Kimura (1955) Cold Spring Harbor Symposium on Quantitative Biology 20: 33

  6. Molecular clocks & neutral evolution The probability of fixation of a neutral mutation equals its initial frequency i.e. 1/PN P is the ploidy N is the population size γ is the population-scaled selective advantage Kimura (1955) Cold Spring Harbor Symposium on Quantitative Biology 20: 33

  7. Molecular clocks & neutral evolution If the population generates mutants at rate P N μ then the substitution rate is P N μ× 1/P N = μ This result implies that under strict neutrality the substitution rate equals the mutation rate So a constant substitution rate requires a constant mutation rate γ is the population-scaled selective advantage Kimura (1955) Cold Spring Harbor Symposium on Quantitative Biology 20: 33

  8. Molecular clocks & neutral evolution What about beneficial and deleterious mutants? Kimura’s Neutral Theory allows for beneficial mutants but assumes they are rare and fleeting The substitution of deleterious mutants is likewise rareand fleeting So the Neutral Theory predicts a strict molecular clock if the neutral mutation rate is constant γ is the population-scaled selective advantage Kimura (1983) The Neutral Theory of Molecular Evolution

  9. Coalescent theory

  10. Coalescent theory • Modern population genetics is largely framed in terms of coalescent theory • Kingman (1982) Journal of Applied Probability 19: 27 • The coalescent is a statistical distribution for the relatedness of individuals assuming the standard neutral model in which an idealized population is: • Large • Constant • Well-mixed • With negligible fitness differences • The relatedness or genealogy modelled by the coalescent is • A tree in the absence of recombination • A network otherwise • The coalescent is useful because • It is statistically tractable • Its assumptions can be defended by appealing to Neutral Theory • It provides a baseline on which many extensions, including the coalescent with selection, have been built

  11. Coalescent theory In the coalescent model: • Every tree topology is equally, i.e. uniformly, likely • The intervals between successive coalescent events follow an exponential distribution with mean 2 P N k (k-1) • where k is the number of lineages in that interval • P is the ploidy and N is the population size

  12. The coalescent with population growth

  13. The coalescent with population growth Compared to a constant-sized population • Every tree topology is still uniformly likely • But the intervals between successive coalescent events are compressed, with the compression getting stronger as the population contracts further back in time • This reduces the difference in mean coalescence intervals near the tips vs the root of the tree, making it more • The tree is said to become more star-like • These patterns make it possible to estimate growth rates from trees Slatkin and Hudson (1991) Genetics 129: 555

  14. Bayesian phylogenetics

  15. Maximum likelihood Statistical Model Data Para-meters Pr(Data | θ = 1) = 1 Pr(Data | θ = 0) f The maximum likelihood estimate (MLE) of θ equals 1 if f < 1 But is he more likely to be guilty than not guilty?

  16. Bayesian inference Statistical Model Data Para-meters Bayesians base inference on posterior probabilitieswhich account for prior information as well as the data Pr(θ = 1 | Data) = Pr(θ= 1)Pr(Data | θ = 1)= 11 Pr(θ = 0 | Data) Pr(θ= 0) Pr(Data | θ = 0) N – 1f The maximum likelihood estimate (MLE) of θ equals 1 if f < 1 The maximum a posteriori (MAP) estimate of θ equals 1 if f < 1/(N – 1)

  17. Bayesian inference • Extensive theory motivates the use of Bayesian inference • Bayesian estimates are • Biased by the prior: this is a good thing if your prior is reliable • Less noisy than MLEs, particularly in small samples, because the prior adds information • Increasingly independent of the prior as you get more and more data • Not reliant on large-sample asymptotic approximations for the construction of hypothesis tests and credibility intervals • Invariant to transformation depending on how you construct point estimates and credibility intervals • However • When you do not have reliable or well-informed prior information you need to find another way to find and justify some sort of reference, objective or non-informative prior

  18. Bayesian inference • Bayesian inference is based on the posterior distribution, from which point estimates and credibility intervals are derived Pr(θ = 1 | Data) = Pr(θ = 1) Pr(Data | θ = 1) ΣxPr(θ = x) Pr(Data | θ = x) = 1/N 1/N + (N – 1)/N × f = 1 1 + (N – 1) f Pr(θ = 0 | Data) = (N – 1) f 1 + (N – 1) f E.g. N = 1000 f = 1/100

  19. Bayesian inference • Point estimates can be defined as the • Posterior mean • The expected value of the parameter averaged over the uncertainty in the posterior • Posterior median • The point the parameter is equally likely a posteriori to lie above or below • Posterior mode • The point with the highest posterior probability E(θ| Data) = 1 1 + (N – 1)f θMAP = 1 if f < 1/N or 0 otherwise E.g. N = 1000 f = 1/100

  20. Bayesian inference • 95% credibility intervals can be defined as any interval containing 95% of the posterior probability e.g.: • The 95% highest posterior density (HPD) interval • The 95% HPDI = (1) if f < 1/N × 1/20 (0) if f > 1/N × 20 (0,1) otherwise • The (2.5%, 97.5%) equal-tailed interval • The 95% equal-tailed interval = (1) if f < 1/N × 1/40 (0) if f > 1/N × 40 (0,1) otherwise E.g. N = 1000 f = 1/100

  21. Bayesian tree builders • Slow • Usage requires mathematical prior distributions • Requires specialist software • Principled – theoretical performance guarantees • Quantitative – able to assess uncertainty • Assumptions explicit • Exploits all information

  22. BEAST

  23. BEAST • Bayesian Evolutionary Analysis Sampling Trees • Drummond, Suchard, Xie, Rambaut (2012) Molecular Biology and Evolution 29: 1969 • A key question is what prior distribution to assume for the tree and branch lengths • The coalescent has advantages as a prior • It is motivated by bottom-up population genetics theory • The tree is forced to be rooted with tips at the known sampling dates – these constraints improve interpretability • Extensions to the standard coalescent such as population growth can be used, and parameters such as growth rates jointly estimated • Other priors are available • The standard coalescent is not appropriate for mixed species data • BEAST also implements the Yule process for rooted, dated trees • MrBayes implements priors for unrooted trees • Ronquist, Huelsenbeck (2003) Bioinformatics 19:1572 • Priors are needed for all other parameters such as the clock rate, growth rate and transition:transversion ratio.

  24. BEAST • To find the Bayesian maximum a posteriori (MAP) estimate of the phylogeny and branch lengths would require a search stategy similar to maximum likelihood methods • The problem with ML and MAP estimates is they are point estimates only and don’t quantify uncertainty • Confidence intervals are approximated for ML (and distance-based) trees using methods like bootstrap • Bayesians can instead use the posterior distribution for both point estimates and credibility intervals • The challenge is to explore the posterior distribution of the tree, branch lengths and other parameters • BEAST uses Markov chain Monte Carlo (MCMC) to achieve this

  25. BEAST • Sometimes it is possible to write a formula for the point estimates and credibility intervals • Sometimes the parameter space is small enough to calculate these quantities by numerical integration, optimization or root-finding • When neither is possible, an alternative way to obtain point and interval estimates is simulation • Sample (i.e. simulate) e.g. 1000 draws from the posterior distribution • Approximate the posterior mean as the sample mean • Approximate the 95% credibility interval from the sample percentiles • The problem is that direct simulation may not be possible • Instead, MCMC starts from an arbitrary parameter value and proposes moves that are accepted or rejected depending on how much they increase or decrease the posterior • Hastings (1970) Biometrika 57: 97 • A random walk is produced which eventually is long enough to thin down to an approximately independent set of draws from the posterior

  26. BEAST After 1000 iterations ESS = 250

  27. References • Drummond, Suchard, Xie, Rambaut (2012) Molecular Biology and Evolution 29: 1969 • Hastings (1970) Biometrika 57: 97 • Kimura (1955) Cold Spring Harbor Symposium on Quantitative Biology 20: 33 • Kimura (1983) The Neutral Theory of Molecular Evolution • Kingman (1982) Journal of Applied Probability 19: 27 • Slatkinand Hudson (1991) Genetics 129: 555 • Ronquist, Huelsenbeck (2003) Bioinformatics 19:1572

More Related