1 / 32

Phylogeny

Phylogeny. Vocabulary of Phylogenetic Trees. Graph of edges and nodes that illustrates the evolutionary relationships among “Operational Taxonomic Units or OTUs” Topology refers to the branching pattern. http://www.ncbi.nlm.nih.gov/About/primer/phylo.html.

zola
Download Presentation

Phylogeny

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Phylogeny

  2. Vocabulary of Phylogenetic Trees • Graph of edges and nodes that illustrates the evolutionary relationships among “Operational Taxonomic Units or OTUs” • Topology refers to the branching pattern http://www.ncbi.nlm.nih.gov/About/primer/phylo.html

  3. Rooting and Scaling – Same tree, different look? http://www.ncbi.nlm.nih.gov/About/primer/phylo.html

  4. Three different rooted trees consistent with a four taxon unrooted tree What is the total number of possible rooted trees consistent with this unrooted tree? http://www.ncbi.nlm.nih.gov/About/primer/phylo.html

  5. How many possible trees for n taxa? Number of Rooted Trees = (2n -3)! (2 n -2) (n -2)! Number of Unrooted Trees = (2n -5)! (2 n -3) (n -3)!

  6. Phylogeny and Genomics • A species tree provides a framework for analyzing presence and absence of genes in genomes (or traits in organisms) • The species tree may be unknown • A genome is a (comprehensive) source of DNA and (predicted) protein sequences to use for phylogenetic reconstruction • Different regions of the genome may support different trees • Trees are useful for examining evolutionary history of gene families • Knowledge of the species tree affects interpretation of gene family trees.

  7. Knowing the relationship between strains and species provides a framework for interpretation Pantoea stewartii Erwinia carotovora Salmonella enterica Yersinia pestis

  8. A reasonable guess based on the character “host type” But is this a good choice if the goal is to reconstruct the “species tree”? Why might you choose to build your tree based on a molecular sequence data rather than phenotype even if what you are really interested in is the evolution of host range? Pantoea stewartii Erwinia carotovora Salmonella enterica Yersinia pestis

  9. Best tree from molecular phylogenetic analysis using multiple core metabolism proteins Pantoea stewartii Why choose to use multiple genes or proteins instead of one? Why choose core metabolism proteins? Why might it be a bad idea? Salmonella enterica Erwinia carotovora Yersinia pestis “True” species tree?

  10. Mapping the trait of interest (phenotypes, presence/absence of genes) onto the species tree Signaling system + Pantoea stewartii - Salmonella enterica + Erwinia carotovora - Yersinia pestis “True” species tree Trait/Gene of Interest

  11. From Multiple Alignment to Phylogeny Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Organism A Organism B Organism C Organism D Organism E

  12. Four Approaches to Tree Reconstruction • Distance Methods (MEGA, PAUP, Phylip) • Estimate a distance matrix • Infer topology and branch lengths • Maximum Parsimony (PAUP) • Sift through all possible trees to find “the one” that requires the smallest number of evolutionary events • Maximum Likelihood (PAUP) • Find the tree most likely to have generated the sequence data • Bayesian (MrBayes) • Produce a probability distribution for all (or a well sampled subset) possible trees using MCMC to explore tree space

  13. Distance – in its simplest form is a count of the differences between two sequences Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E A B C D E Organism A - 0 4 2 2 Organism B - 4 2 2 Organism C - 4 4 Organism D - 2 Organism E - USE AN EVOLUTIONARY MODEL TO CORRECT THE DISTANCE MATRIX FOR UNOBSERVED CHANGES.

  14. Five Models for Nucleotide Substitution(There are others) Jukes and Cantor, 1969 All substitutions are equally likely Kimura, 1980 Transitions are more likely than transversions Tamura Transitions are more likely that transversions and GC content does not equal AT content. Tamura and Nei Transitions are more likely than transversions AND GC-content doesn’t equal the AT-content AND there is a rate difference between G-A and T-C transitions Unrestricted There is no discernable relationship between rates

  15. Models of Nucleotide Substitution An element of eij of the matrix stands for the substitution rate from the nucleotide in the ith row to the nucleotide in the jth column A T C G A - a a a T a - a a C a a - a G a a a - A T C G A - b b a T b - a b C b a - a G a b b - Jukes-Cantor Kimura

  16. Infer topology and branch lengths from the matrix using an algorithm like UPGMA UPGMA (Unweighted Pair Group Method with Arithmetic mean) is a simple method that is also used for microarray clustering. Assumes constant rates of evolution among different lineages -> linear relationship between distance and time A B C D E Organism A - 0 4 2 2 Organism B - 4 2 2 Organism C - 4 4 Organism D - 2 Organism E -

  17. UPGMA Step 1- Cluster the Operational Taxonomic UnitsOTUs with the smallest distance with branch length = d/2 A B C D E Organism A - 0 4 2 2 Organism B - 4 2 2 Organism C - 4 4 Organism D - 2 Organism E - Organism A Organism B time

  18. UPGMA Step 2- Collapse the distance matrix to reflect distance from the AB group by taking the average of the distance from A-all others and B-all others A B C D E Organism A - 0 4 2 2 Organism B - 4 2 2 Organism C - 4 4 Organism D - 2 Organism E - AB C D E Group AB - 4 2 2 Organism C - 4 4 Organism D - 2 Organism E -

  19. UPGMA Step 3- • Repeat Step 1 with the collapsed distance matrix • Step 1- Cluster OTUs with the smallest distance with branch length = d/2 AB C D E Group AB - 4 2 2 Organism C - 4 4 Organism D - 2 Organism E - Organism A Organism B 1 Organism D Organism E 1 time

  20. UPGMA Step 4- n Continue to collapse and join until all taxa are added AB C DE Group AB - 4 2 Organism C - 4 Group DE - ABDE C Group ABDE - 4 Organism C - 1.5 Organism A Organism B 0.5 1 Organism D Organism E 0.5 1 2 Organism C time

  21. Alternative to UPGMA that does not assume a constant evolutionary rate Neighbor-joining takes a step-wise approach similar to UPGMA, but chooses branch lengths that minimize the total branch length (minimum evolution) at every step. Not guaranteed to get the overall optimal (minimal branch length) tree because it is a greedy algorithm. Distance methods are fast and scale well for large number of taxa.

  22. Maximum Parsimony - Sift through all possible trees to find “the one” that requires the smallest number of evolutionary events With so many trees, it is often necessary to use a heuristic approach that looks at a subset of all possible trees (TBR, Branch and Bound) (2n-5)! 2n-3(n-3)! (2n-3)! 2n-2(n-2)! rooted unrooted Organism A Organism B Organism D Organism E Organism C time

  23. Maximum Parsimony Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E g -> a Organism A a Organism B a 1 event g g Organism D g Organism E g g Organism C g time

  24. Maximum Parsimony Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Organism A a Organism E g 3 events a a -> g Organism D g Organism B a a -> g a Organism C g a -> g time

  25. Maximum Parsimony 1 event Right tree? 3 events Wrong tree? g -> a Organism A Organism B Organism A a Organism E g g a a -> g g Organism D Organism E Organism D g Organism B a a -> g g a Organism C Organism C g a -> g time time

  26. Maximum Parsimony 1 event Right tree? 3 events Wrong tree? g -> a Organism A Organism B Organism A a Organism E g g a a -> g g Organism D Organism E Organism D g Organism B a a -> g g a Organism C Organism C g a -> g time time

  27. Maximum Likelihood Methods • Given an evolutionary model, evaluate all possible tree topologies and calculate the probability of generating the observed data. • Choose the tree with the highest probability (generally expressed as the log likelihood) • Computationally intensive and sensitive to model selection

  28. Bootstrapping Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E A method of testing the reliability of the tree 100% Organism A Organism B Organism C Organism D Organism E 50% 100%

  29. Bootstrap to Assess Confidence in Branches ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Resample with replacement to produce 1000 alignments of the same size c c a t g g a . . . . . . . . . g . . . g . . . . . a . . . . . . . .

  30. Many different Alignments Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E What percentage of the datasets support each branch? 100% Organism A Organism B Organism C Organism D Organism E 50% 100%

  31. Bootstrapping and what it really tells us. The underlying rational behind bootstrapping is to predict what would happen if more data were collected or small perturbations were made to the existing data. Bootstrapping does not indicate the chance that the branch topology is in the correct location. (Holder, M., Lewis, P. 2003) More simulated data Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E Ggaccttcgggcctcacgccatcggatgaacccagatgggattagctagtaggtgaggta Organism A ............................................................ Organism B ............................tg.................g.......g.... Organism C .................a.....................................g.... Organism D .......a...............................................g.... Organism E

  32. An example of incongruence between different genes in Lactobacillus genomes Nicolas et al. BMC Evolutionary Biology 2007; 7:141 Analyzed 480 proteins 3:2 ratio of genes supporting Ta vs. Tb, but Tc is almost never seen.

More Related