1 / 53

Advanced Methods in Reconstructing Phylogenetic Relationships

Join the practical course in Rio de Janeiro to learn the theory and practice of phylogenetic inference from molecular data. Explore methods, computer programmes, and critical analysis of data.

joysilva
Download Presentation

Advanced Methods in Reconstructing Phylogenetic Relationships

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Methods in Reconstructing Phylogenetic Relationships 2010 Practical Course: March 8th to 13th, 2010, Rio de Janeiro

  2. Darwin’s letter to Thomas Huxley 1857 • The time will come I believe, though I shall not live to see it, when we shall have fairly true genealogical (phylogenetic) trees of each great kingdom of nature Haeckel’s pedigree of man

  3. Aims of the course: • To introduce the theory and practice of phylogenetic inference from molecular data • To introduce some of the most useful methods and computer programmes • To encourage a critical attitude to data and its analysis

  4. Some definitions

  5. Richard Owen

  6. Owen’s definition of homology • Homologue:the same organ under every variety of form and function (true or essential correspondence) • Analogy:superficial or misleading similarity Richard Owen 1843

  7. Charles Darwin

  8. Darwin and homology • “The natural system is based upon descent with modification .. the characters that naturalists consider as showing true affinity (i.e. homologies) are those which have been inherited from a common parent, and, in so far as all true classification is genealogical; that community of descent is the common bond that naturalists have been seeking” Charles Darwin, Origin of species 1859 p. 413

  9. Homology is... • Homology:similarity that is the result of inheritance from a common ancestor - the identification and analysis of homologies is central to phylogenetic systematics

  10. Phylogenetic systematics • Sees homology as evidence of common ancestry • Uses tree diagrams to portray relationships based upon recency of common ancestry • Monophyletic groups (clades) - contain species which are more closely related to each other than to any outside of the group

  11. Cladograms and phylograms Bacterium 1 Cladograms show branching order - branch lengths are meaningless Bacterium 2 Bacterium 3 Eukaryote 1 Eukaryote 2 Eukaryote 3 Eukaryote 4 Phylograms show branch order and branch lengths Bacterium 1 Bacterium 2 Bacterium 3 Eukaryote 1 Eukaryote 2 Eukaryote 3 Eukaryote 4

  12. eukaryote eukaryote eukaryote eukaryote Rooting using an outgroup archaea archaea Unrooted tree archaea Rooted by outgroup bacteria outgroup archaea Monophyletic group archaea archaea eukaryote Monophyletic group eukaryote root eukaryote eukaryote

  13. What kind of data?

  14. Fossil skulls

  15. Family tree for humans

  16. Microbial morphologies - some are complex but many are simple - for example look at a drop of lake water:

  17. Linus Pauling

  18. Molecules as documents of evolutionary history • “We may ask the question where in the now living systems the greatest amount of information of their past history has survived and how it can be extracted” • “Best fit are the different types of macromolecules (sequences) which carry the genetic information”

  19. Small subunit ribosomal RNA 18S or 16S rRNA

  20. An alignment involves hypotheses of positional homology between bases or amino acids Alignment of 16S rRNA sequences from different bacteria

  21. Automated Progressive Alignment of Sequences • Essentially a heuristic method and as such is not guaranteed to find the ‘optimal’ alignment. • Most successful implementation is Clustal (Des Higgins). This software is cited 3,000 times per year in the scientific literature.

  22. Des Higgins is very famous

  23. Automatic alignment programs • There are a variety available: • Clustal W 2.0, Muscle, T-Coffee are among the most popular • All are easy to use and relatively quick (but this depends on how many sequences and how similar they are). • Outputs files are produced which can be read by most phylogenetic analysis programmes. • Can fail badly with highly divergent sequences.

  24. James McInerney is not here • But he has produced a nice lecture on some background issues for multiple alignment • This can be downloaded from the embo world 2009 directory on our lab webpage: • http://research.ncl.ac.uk/microbial_eukaryotes/index.html

  25. Advice on alignments • Treat cautiously • Can be improved by eye (usually) • Often helps to have colour-coding • Depending on the use, the user should be able to make a judgement on those regions that are reliable or not • For phylogeny reconstruction, only use those positions whose hypothesis of positional homology is unimpeachable (or do experiments)

  26. Patterns in sequence data

  27. Exploring patterns in sequence data 1: • Which sequences should we use? • Do the sequences contain phylogenetic signal for the relationships of interest? (might be too conserved or too variable) • Are there features of the data which might mislead us about evolutionary relationships?

  28. Is there a molecular clock? • The idea of a molecular clock was initially suggested by Zuckerkandl and Pauling in 1962 • They noted that rates of amino acid replacements in animal haemoglobins were roughly proportional to time - as judged against the fossil record

  29. Rate Heterogeneity

  30. Rates of amino acid replacement in different proteins

  31. There is no universal molecular clock • The initial proposal saw the clock as a Poisson process with a constant rate • Now known to be more complex - differences in rates occur for: • different sites in a molecule • different genes • different regions of genomes • different genomes in the same cell • different taxonomic groups for the same gene • There is no universal molecular clock

  32. Small subunit ribosomal RNA 18S or 16S rRNA

  33. Failure To Accommodate Rate Heterogeneity Can Lead To Problems When Making Trees

  34. A A B p p D q q q C C D B Unequal rates in different lineages may cause problems for phylogenetic analysis • Felsenstein (1978) made a simple model phylogeny including four taxa and a mixture of short and long branches TRUE TREE WRONG TREE p > q • All methods are susceptible to “long branch” problems • Methods which assume that all sites changeat the same rate are particularly poor at recovering the true tree

  35. Chaperonin 60 Protein Maximum Likelihood Tree(PROTML, Roger et al. 1998, PNAS 95: 229) Longest branches Bootstrap values are a common way of assessing support for relationships

  36. High bootstrap values can be misleading - adding a single new sequence

  37. A proposal for three domains of life(Woese, Kandler and Wheelis 1990 PNAS 87, 4576)

  38. Concatenated LSU+SSU rRNA analyzed using a standard (GTR plus gamma*2) model eukaryotes The 3-domains tree of life Two longest branches archaebacteria eocyte archaebacteria bacteria Cox et al. 2008. PNAS

  39. The same RNA data analyzed using better models (Cox et al. 2008) eukaryotes eocytes 0.75 0.95 bacteria Other archaebacteria NDCH (GTR+g+2cv)*2 Heterogeneous across tree CAT model

  40. Saturation in sequence data: • Saturation is due to multiple changes at the same site subsequent to lineage splitting • Most data will contain some fast evolving sites which are potentially saturated (e.g. in proteins often position 3) • In severe cases the data becomes essentially random and all information about relationships can be lost

  41. 1 3 2 C G T A C A 1 Multiple changes at a single site - hidden changes Seq 1 AGCGAG Seq 2 GCGGAC Number of changes Seq 1 Seq 2

  42. Exploring patterns in sequence data • Do sequences manifest biased base compositions (e.g thermophilic convergence) or biased codon usage patterns which may obscure phylogenetic signal

  43. A case study in phylogenetic analysis:Deinococcus and Thermus • Deinococcus are radiation resistant bacteria • Thermus are thermophilic bacteria • BUT: • Both have the same very unusual cell wall based upon ornithine • Both have the same menaquinones (Mk 9) • Both have the same unusual polar lipids • Congruence between these complex characters supports a phylogenetic relationship between Deinococcus and Thermus

  44. % Guanine + Cytosine in 16S rRNA genes from mesophiles and thermophiles %GC all sites variable sites Thermophiles: Thermotoga maritima Thermus thermophilus Aquifex pyrophilus Mesophiles: Deinococcus radiodurans Bacillus subtilis 62 64 65 55 55 72 72 73 52 50

  45. Shared nucleotide or amino acid composition biases can also cause problems for phylogenetic analysis Aquifex Thermus Aquifex (73%) Bacillus (50%) True tree Wrong tree 16S rRNA Thermus (72%) Bacillus Deinococcus Deinococcus (52% G+C) Aquifex The correct tree can be obtained if a model is used which allows base/aa composition to vary between sequences -LogDet/Paralinear Distances Heterogeneous Maximum Likelihood Bacillus Thermus Deinococcus

  46. Gene trees and species trees A a Species tree Gene tree B b C c We often assume that gene trees give us species trees

  47. Orthologues and paralogues paralogous b* C* A* orthologous orthologous c C* B A* a b* A mixture of orthologues and paralogues sampled Duplication to give 2 copies on the same genome = paralogues of each other Ancestral gene

More Related