1 / 15

Probabilistic methods for phylogenetic trees (Part 2)

Probabilistic methods for phylogenetic trees (Part 2). Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ sroy@biostat.wisc.edu Oct 7 th , 2014. RECAP. Probabilistic methods for phylogenetic tree construction P( data|tree ) Maximum likelihood

oleg-bowers
Download Presentation

Probabilistic methods for phylogenetic trees (Part 2)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic methods for phylogenetic trees (Part 2) Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ sroy@biostat.wisc.edu Oct 7th, 2014

  2. RECAP • Probabilistic methods for phylogenetic tree construction • P(data|tree) • Maximum likelihood • Felsenstein algorithm for computing the likelihood of a sequence given a tree

  3. Probabilistic models of evolution • The probability of a character switching from a to b along a branch of length t, P(b|a,t) is captured by the matrix • For example for DNA this is:

  4. Defining the conditional probability distributions • If we consider t to be evolutionary time, these conditional probabilities can be obtained what is called a continuous time Markov process • Such processes are defined by a K-by-Krate matrix R • Each entry of R,R(a,b)gives a rate of substitution from a to b • The time spent in any state (character) is exponentially distributed • If we have R, S(t) can be obtained from R • Using the theory of continuous time Markov processes

  5. Rate matrices • A rate matrix R • Is a K-by-K matrix where K is the size of our alphabet • E.g. for DNA K=4 • Different rate matrices make different assumptions of substitutions • Jukes Cantor: all substitutions have same rates. • Kimura: transitions (A<->G, C<->T) and transversions (A<->C,A<->T,G<->C,G<->T) have different rates. • Hasegawa, Kishino, Yano (HKY, all substitutions have different rates).

  6. Jukes Cantor Rate matrix • Simplest possible rate matrix forDNA sequence evolution • Assumes all bases change at the same rate A T G C A T G C

  7. Conditional probabilities from Jukes Cantor • The conditional probability matrix, P(a|b,t) has a similar form as the rate matrix A T G C A T G C P(G|C,t) Equilibrium distribution: ¼ for all bases

  8. Searching phylogenetic tree space with maximum likelihood • As in the maximum parsimony case we need to • Score a tree • Search over the space of possible trees • Score a given tree • Branch lengths are parameters • Estimate the branch lengths to maximize the likelihood of data given tree • Search over trees • Start with an initial tree • A greedy approach of adding a branch that maximizes the likelihood • Neighbor Joining • Revisit using nearest neighbor interchange or subtree grafting approaches until convergence

  9. Some advantages of probabilistic approaches • Probabilistic models can be naturally extended to more realistic model • Model site specific parameters • Model gaps • A probabilistic framework can be used to evaluate different models of varying complexity (more parameters) • Different evolutionary models • Easily combined with other probabilistic models • Hidden Markov models

  10. Modeling site-specific parameters • Recall we had assumed that the probabilities at each is the same • This could be relaxed by introducing additional parameters per site, ru

  11. Probabilistic interpretation of Parsimony • Recall P(a|b,t) is the key quantity of interest • Replace P(a|b,t)by P(a|b) and use –log P(a|b) as the score • Applying the weighted parsimony algorithm on this score to get the minimal cost tree will give an approximation to likelihood • The one associated with the most likely assignment of the ancestral states

  12. Bootstrap: Assessing reliability of phylogenetic trees • Bootstrap: a computational strategy used to assess confidence in an estimated quantity • E.g. branch length • Tree branching topology • Generate a bunch of trees, {T1,…,TN}, from N random samples of the data • Sample columns/sites with replacement • Reconstruct a tree from sampled columns • One can estimate the confidence of any tree feature based on the proportion of times the feature is seen in a tree in {T1,…,TN}

  13. Example of bootstrap Ziheng Yang and Bruce Rannala, Nature Reviews Genetics 2012

  14. Some common phylogenetic tree construction algorithms • PhyML • Maximum likelihood, Nearest neighbor interchange, subtree pruning and regrafting • RAxML (Randomized Axelerated Maximum Likelihood) • Exists in both sequential and parallel versions • Also does subtree pruning and regrafting • PhyLIP (From Felsenstein) • Package for distance-based, parsimony, ML methods • BEAST (Bayesian) • MCMC based sampling • MrBayes (Bayesian) • MCMC based sampling • Visit here for more http://evolution.genetics.washington.edu/phylip/software.html

  15. Comments about phylogenetic tree construction • Which method to pick? • Neighbor joining: fast, constructs right tree if the distances are additive • Parsimony: does not make any assumption of distances • Probabilistic: • More principled, provides a systematic framework to estimate model parameters • Enables us to quantify uncertainty in the model, evaluate different models of evolution • If ML distances are additive NJ can construct the right tree • If branch lengths are ignored, weighted parsimony and maximum likelihood are equivalent • Search space may be large, but • can find the optimal tree efficiently in some cases • heuristic methods can be applied • Difficult to evaluate inferred phylogenies: ground truth not usually known • can look at agreement across different sources of evidence • can look at repeatability across subsamples of the data (bootstrap) • can look at indirect predictions, e.g. conservation of sites in proteins • Methods could be assessed based on a simulation framework based on a probabilistic model of phylogenies • Phylogenies for bacteria, viruses not so straightforward because of lateral transfer of genetic material

More Related