1 / 42

Phylogenetic inference or “How to recognize a tree from quite a long way away”

Phylogenetic inference or “How to recognize a tree from quite a long way away”. Slides are available on the course’s web page. Mikael Thollesson Evolutionary Biology Centre, Uppsala University. “Bioinformation in the cell”. RNA. polypeptide. DNA. mRNA. enzyme. protein. coenzym

Download Presentation

Phylogenetic inference or “How to recognize a tree from quite a long way away”

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Phylogenetic inferenceor “How to recognize a tree from quite a long way away” Slides are available on the course’s web page Mikael Thollesson Evolutionary Biology Centre, Uppsala University

  2. “Bioinformation in the cell” RNA polypeptide DNA mRNA enzyme protein coenzym activation protein folding transcription splicing translation

  3. “Extended bioinformation” Original sense strand Original sense strand New anti-sense strand New sense strand Original anti-sense strand Original anti-sense strand

  4. 111 110 Phylogeny from a Bioinformatic viewpoint • A phylogeny is the (event) history more or less exclusively shared by some kind of biological replicators • These replicators can in practice be for example • Species, population, strains • Genomes, genes • Populations • Phylogenies can usually be modelled as trees; phylogeny and phylogenetictree has thus become more or less synonymous, even though it is not • The objective for phylogenetic analysis is to infer these history and events, usually resulting in a phylogenetic hypothesis in the form of a tree (together with cosmology the only science dealing with particular histories) 010 000

  5. Parallel substitutions and multiple substitution at the same site creates ambiguities about the hierarchy We must make some a priori assumption of homology – for sequences, this is the same as doing a multiple alignment Ordering the sequences hierarchically after shared evolutionary novelties, synapomorphies, produce a phylogenetic hypothesis (tree) We can not distinguish between novelties and ancestral state, just see the difference GCCACaTTCcCGAgCA GCCACaTTCcCGATCA GCCACaTTCcCGAgCA GCgACTagCGCGATCA GCCACaTTCcCGATCA GCCACaTTCcCGATCA GCgACTTTCcCGATtA GCgACTTTCGCGATta GCgACTagCGCGATCA GCgACTagCGCGATCA GCgACTTTCcCGATtA GCgACTTTCGCGATCA GCgACTTTCGCGATtA GCgACTTTCGCGATta ? GCgACTTTCGC--Tta GCCACTTTCGCGATCA Time

  6. Characters Taxa, Terminal units Character states Bullfrog Cod Lion Whiteshark Bald eagle limbsamnion Lion yes yes Bald eagle yes yes Bullfrog yes no Cod no no Whiteshark no no

  7. Pd Mv Ma Lg Pd Ma Mv Lg Pd Lg Ma Mv Pd Pd Pd Ma Mv Lg Ma Lg Ma Mv Mv Lg Mi. akkeshiensis Li. geniculatus My. versicolor Pa. dubius 22 27 26 27 27 Mv Pd Ma Lg Mv Ma Pd Lg Mv Lg Ma Pd 22 26 27 22 26 Lg Mv Ma Pd Lg Ma Mv Pd 3. Find the tree that best fit the data and choose it to be the preferred hypothesis 26 27 22 22 22 26 Lg Pd Ma Mv Ma Lg Pd Mv Ma Mv Pd Lg Ma Pd Mv Lg Lineus geniculatus TGGGCTGGGATGAAGGGAAGTATCGTGGGCCCGG MicruraakkeshiensisGGGGCTAGAATGAATGGGA-TAACGAGCCCCCGA Myoisophagus versicolor GGGGCTAGAATGAAAGAAA-GTTTGAGACCTCAT Parvicirrus dubius GGGACTGGAATGAAAGAAA-TTTTGAGGCCTTAA 1. Gather data from the entities whose phylogeny we are interested in 95% 4. Evaluate the sampling variation in the data to see if you have enough support for your conclusion 2. Select a criterion to evaluate how well each possble tree fits the observed data

  8. Why do phylogenetics? – Prediction • Prospective biomedical compounds from sponges (Porifera) • Treatment of microsporidia • Gauging biodiversity for conservation “Taxa are not related because of similarity, but similar due to relatedness”

  9. Why the oaks retain their leaves in contrast to other deciduous trees Evolution of metabolic pathways Tracing infection histories for virus Evergreens Why? –Sequence of evolutionary events

  10. Why? – (Ab)use of comparative method Correlation between ability to fly and being black and white Species, populations, or genes (i.e., entities corresponding to replicators) are not independent samples/observations since they have a more or less inclusively shared history

  11. Terminal nodes (external vertices) represent taxa or genes on which we have observations Internal vertices represent inferred splitting events (may be interpreted as ancestral species or gene copies) Unrooted vs. rooted trees A D C A B D B C e1 e2 clade e3 e6 e4 e5 Trees and terminology A branch or edge C B D D node or vertex Rooting is normally done using a designated outgroup

  12. X is defined to be more closely related to Y than to Z if and only if X shares a (more recent) history with Y that it does not share with Z Relatedness A C B D B A C D D C A B

  13. Collect your data Select an optimality criterion (“Which tree is better”?) Optional: do data transformations (“corrections”) Select a search strategy and find the best hypothesis (according to selected criterion) using this search method Assess the variation in your data in some way There are really only two big theoretical problems in phylogenetic inference… The criterion and calculating the score Finding the best tree “The standard recipe” for phylogenetic inference

  14. Step 1 – Data collection Any observation of inherited traits is in principle useful Primary homology assessment - from traits to characters and character states; for sequence data this corresponds to alignment Pair-wise differences (e.g., DNA-DNA hybridization, histocompatibility) can also be used, although with a limited set of criteria Include one or several outgroups for rooting

  15. Step 2 – Optimality criteria, some selected

  16. Assumptions in shared by (almost) all optimality criteria/methods • Characters are independent (and thus the order in the data matrix does not matter) • Special models for e.g., rRNA and codons • The substitution process is homogenous over time/in the entire tree (overall rate can vary) • Special models do not make this assumption • Substitution rates are the same for all characters • Can be accommodated easily in most methods

  17. Parsimony optimality criterion Given two trees, the one requiring the lowest number of character changes necessary to explain the observed character distribution is the better Parsimony score for a tree is the minimum number of required changes This score is frequently referred to as number ofsteps or tree length The method can be modified using non-uniform weights Character weights (positional weights) Character state weights (transformational weights)

  18. Total tree length: 7 Total tree length: 8 Total tree length: 8 Parsimony – an example aacgtatgga bacgggtgca gaacggtgga daactgtgca a: c g: a a: c g: a a: c g: a b: c d: a d: a b: c b: c d: a

  19. Using substitution models – Why? Jukes-Cantor is the simplest model in a class of models called time-reversible (GTR) models for DNA GTR (most complex symmetric model) has six different rates (one for each pair of bases) and different base frequencies Observed differences A G C T Actual changes , if i≠j , if i=j Example: Jukes-Cantor model P(t)=eQt

  20. pgd= pdg=2/10=0.2 (p distance) – Jukes-Cantor distance Pair-wise distances – an example aacgtatggac bacgggtgcac gaacggtggac daactgtgcac

  21. Minimum evolution optimality criterion Starts by calculating pair-wise distances between all terminal taxa/sequences These calculations can incorporate explicit substitution models, e.g., Jukes-Cantor Given two trees, the one having the lowest sum of branch lengths when fitted to the data, is the better One way to fit the data is using the constraints below, or using least squares approximation No branch can have negative length, eij≥0 The path between two terminals along the tree is at least as long as the pair-wise distance, eij≥dij The score is commonly referred to as tree length (as for parsimony)

  22. Maximum likelihood optimality criterion Given two trees, the one with the higher likelihood, i.e. the one with the higher conditional probability of observing the data, is the better Site likelihood is the conditional probability of the data at one site (one character) given the assumed model of evolution and parameters of the model Data set likelihood is the product of the site likelihoods (character independence) Likelihood values under different models are comparable, thus giving us a way to test the adequacy of the model The model consists of A substitution model, e.g. Jukes-Cantor A tree with branch lengths

  23. at Taxon1 AC Taxon2 CC For Jukes-Cantor! Ltot=L1·L2, or log Ltot = logL1+logL2 Likelihood of a one-branch tree Taxon1 AC Taxon2 CC

  24. Another one-branch tree at at= 0.02327 lnL= -51.133956 lnL 30 nucleotides from yh-globin genes of two primates on a one-edge tree * * Gorilla GAAGTCCTTGAGAAATAAACTGCACACTGG Orangutan GGACTCCTTGAGAAATAAACTGCACACTGG There are two differences and 28 similarities

  25. Likelihoods of a more interesting tree… Bases at internal nodes are unknown A C e1 e3 e5 u v e2 e4 A T

  26. Number of (rooted) trees for n terminals is (2n-3)·(2n-5)·(2n-7)…3·1 3 taxa -> 3 trees 4 taxa -> 15 trees 10 taxa -> 34 459 425 trees 25 taxa -> 1,19·1030 trees 52 taxa -> 2,75·1080 trees Finding the optimal tree is an NP-complete or NP-hard problem Search strategies Exact Will find the best (according to selected criterion) tree Exhaustive Up to ca 10 taxa Branch and bound Up to ca 15 taxa Heuristic Limits the search to a “reasonable” set of trees. May not find the optimal tree Step 3 – Finding the best tree

  27. Heursitic tree searches usually start with hill climbing (greedy algorithms) to obtain a starting tree Star decomposition Stepwise addition and proceed with some flavour of branch swapping to improve on the starting tree and find better trees

  28. Heursitic tree search – Star decomposition A A A B A C C C E D D E B B E D C E D C E B D B A … E A B A E E C C C A B B D D D

  29. Heursitic tree search – Stepwise addition B A C A B B A A C 831 837 E D A C B D D C C 783 D B E E A B B C C C C D A B B A A D E D E D 914 C 921 A B D 915 916 905

  30. Heursitic tree search – Branch swapping C D A H B G I E F C C D D D C B A H F H A B I G E A G I E E B F F C H H A D H G G A C I C G F I A D D I I B E B G F F E H B E SPR TBR

  31. Step 2+3 – A dirty shortcut to get a tree… • Instead of evaluating each tree, some methods build a tree using a specific algorithm, usually from pair-wise distances • Neighbor-joining is such a methods that is widely used • NJ can roughly be viewed as a star decomposition minimizing the sum of branch lengths (evolutionary change)

  32. Efficiency Power Consistency Robustness Falsifiability – Time to find a/the solution – Rate of convergence/how much data are needed – Convergence to “correct” solution as data are added – Performance when assumptions are violated – Rejection of the model when it is inadequate What is a “good” method?

  33. Performance on simulated data Frequency of correct inference Sequence length 0.30 and 0.05 respectively All 0.50

  34. Some pros and cons of selected methods Pair-wise, algorithmic approach (eg. Neighbor-joining) + Fast + Models can be used when transforming to distances - Information is lost when transforming to pair-wise distances - One will get a tree, but no measure of goodness to compare with other hypotheses (when using algorithmic methods like NJ) Parsimony + Philosophically appealing – Occam’s razor (no unnecessary assumptions) + Can be applied to most kinds of data without prior knowledge - Can be inconsistent - Can be computationally slow Maximum likelihood + Model based; enables statistical tests and handles problems with multiple substitutions - Model based; models can be inadequate and give misleading results - Computationally veeeeery slooooowww

  35. Step 4 – Assessing the variation in the data Variation can not be assessed by repeated sampling from the statistical population – we have a unique sample We have to rely on resampling from the data already at hand Jack-knife – resampling without replacement Bootstrap – resampling with replacement

  36. Bootstrap proportions between 0.5 and 1 can be interpreted as a measure of confidence or support Valules below 0.5 are non-sense Bootstrap Original analysis, e.g. MP, ML, NJ. Original data set with n characters. Ceus Aus Beus Draw n characters randomly with re-placement. Repeat m times. Repeat original analysis on each of the pseudo-replicate data sets. Deus Ceus Aus Ceus Aus Ceus Aus Ceus Aus Beus Beus Deus Beus Ceus Aus Deus Deus Beus Ceus Aus Deus Beus Deus Beus Deus m pseudo-replicates, each with n characters. Evaluate the results from the m analyses. Ceus Aus 75% Beus Deus

  37. What can go wrong? Sampling error (i.e., due to finite data) Assessed by - for example - the bootstrap Systematic error (inconsistent method) Tests of the adequacy of models used Using different methods with different properties and compare the results Inadequate tree search (heuristics) Reality A tree may be a poor model of the real history Information has been lost by subsequent evolutionary changes “Species” vs. “gene” trees

  38. Negligible (within sequence) sampling error – high bootstrap values Tree estimated by a consistent method 100 100 What is wrong with this tree? Canis Gadus Mus

  39. The expected tree… “Species” tree Gene duplication “Gene” trees

  40. Two copies (paralogs) present in the genomes Canis Mus Gadus Gadus Mus Canis Orthologous Orthologous Paralogous

  41. What we have actually studied… • To detect a paralogy problem, several different genes can be used to infer the “species” phylogeny Canis Gadus Mus

  42. To conclude– • Phylogenetic inference deals with historical events and information transfer – the evolutionary history • Results from phylogenetic analyses are hypotheses for further testing; the true history will remain unknown • Inference is mathematically intricate and computationally heavy, and as a result methods for phylogenetic inference are legio. A good place to start looking for software is http://evolution.genetics.washington.edu/phylip/software.html • There are several pitfalls to avoid when doing the analyses and when interpreting them – and most of the problems are data dependent… • But… Phylogenies have great explanatory power (the only we have to predict properties of organisms), and ignoring the shared histories can sometimes give completely bogus results in comparative studies

More Related