1 / 25

Phylogenetics

Phylogenetics. Alexei Drummond. Friday quiz : How many rooted binary trees having 20 labeled terminal nodes are there? . (A) 2027025 (B) 34459425 (C) 8.20  10 21 (D) 3.21  10 70. Bonus question: What about unrooted trees?. Computational Biology.

etan
Download Presentation

Phylogenetics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Phylogenetics Alexei Drummond

  2. Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425 (C) 8.20  1021 (D) 3.21  1070 Bonus question: What about unrooted trees? CS369 2007

  3. Computational Biology Pairwise sequence alignment (global and local) Multiple sequence alignment Substitution matrices Database searching BLAST Sequence statistics Global Local Evolutionary tree reconstruction CS369 2007 Adapted from slide by Dannie Durant

  4. Molecules as Documents of Evolutionary History • Macromolecules contain information about the processes and history that formed themHIV-1 (UK) ATCGGATGCTAAAGCATATGACACAGAGGTACATAATGTTTHIV-1 (USA) ATCAGATGCTAGAGCTTATGATACAGAGGTACA---TGTTT • However, this information is often fragmentary, camouflaged or lost completely • One of the aims of computational biology is to recover as much of this information as possible and decipher its meaning

  5. Phylogenetics • Views similarity (homology) as evidence of common ancestry • Homology: similarity that is the result of inheritance from a common ancestor • Uses tree diagrams to portray relationships based upon recency of common ancestry • Monophyletic groups (clades) - contain species which are more closely related to each other than to any outside of the group • Phylogenetics has in recent years become a statistical science based on probabilistic models of evolution.

  6. Types of Phylogenies • Phylograms show clusters and branch lengths • Branch lengths can represent time or genetic distance • Vertical dimension is meaningless • Cladograms show clusters • Branch lengths are meaningless Bacterium 1 Bacterium 1 Bacterium 2 Bacterium 2 Bacterium 3 Bacterium 3 Eukaryote 1 Eukaryote 1 Eukaryote 2 Eukaryote 2 Eukaryote 3 Eukaryote 3 Eukaryote 4 Eukaryote 4

  7. Rooting trees using an outgroup archaea eukaryote archaea Unrooted tree archaea eukaryote eukaryote eukaryote bacteria outgroup archaea Monophyletic Group (clade) Rooted by outgroup archaea archaea eukaryote Monophyletic Group (clade) eukaryote eukaryote eukaryote root of ingroup

  8. Anatomy of a tree Root Taxon Bacterium 1 External branch or edge Bacterium 2 Bacterium 3 Eukaryote 1 Eukaryote 2 Eukaryote 3 Eukaryote 4 Internal branch or edge Internal node External node or tip CS369 2007

  9. Problems in Phylogenetics • Correctly aligning multiple sequences • Choosing an evolutionary model of sequence change • To estimate the genetic distance between sequences • Inferring phylogenetic trees • Testing evolutionary hypotheses • (we won’t cover this material in 369)

  10. How many trees are there? For n taxa there are (2n – 3)! = (2n – 3)(2n – 5)...(3)(1) rooted, binary trees: #trees n 15 105 945 10395 135135 2027025 34459425 8.20  1021 3.21  1070 2.11  10267 4 5 6 7 8 9 10 20 48 136 enumerable by hand enumerable by hand on a rainy day enumerable by computer still searchable very quickly on computer a bit more than the number of hairs on your head Greater than the population of Auckland ≈ upper limit for exhaustive searching; about the number of possible combinations of numbers in the UK National Lottery ≈ upper limit for branch-and-bound searching ≈ the number of particles in the universe number of trees to choose from in the “Out of Africa” data (Vigilant et al., 1991)

  11. Phylogenetic Reconstruction • There are essentially two types of data for phylogenetic tree estimation: • Distance data, usually stored in a distance matrix, e.g. DNA×DNA hybridisation data, morphometric differences, immunological data, pairwise genetic distances • Character data, usually stored in a character array; • e.g. multiple sequence alignment of DNA sequences, morphological characters. Characters Distances 1 1 2 3 4 5 6 7 8 9 0 A A 1 0 0 0 1 1 0 1 1 1 B B 0 1 0 0 1 1 1 1 1 1 Taxa C Taxa C 0 0 1 0 0 0 1 1 1 1 D D 0 0 0 1 0 0 0 0 0 1 E E 0 0 0 0 0 0 0 0 0 0

  12. Phylogenetic Reconstruction • Given the huge number of possible trees even for small data sets, we have two options: • Build one according to some clustering algorithm • Assign a “goodness of fit” criterion (an objective function) and find the tree(s) which optimise(s) this criterion

  13. Phylogenetic Reconstruction Type of Data Nucleotide Distances Sites UPGMA Clustering Algorithm Neighbor-Joining Tree Building Method Maximum Parsimony Optimality Minimum Criterion Evolution Maximum Likelihood CS369 2007

  14. Clustering Algorithms • The clustering algorithms are usually very fast, and simple but • there is no explicit optimality criterion, so • we have no measure of how good the tree is! • we do not get any idea about other potential trees – were there any better trees? • Common methods are Neighbour-Joining and UPGMA.

  15. Node 1 B A Clustering Algorithms • The UPGMA and neighbor-joining (NJ) methods are both greedy heuristics which join, at each step, the two closest* sub-trees that are not already joined. • They are based on the minimum evolution principle. • An important concept in both of these methods is a pair of neighbors, which is defined as two nodes that are connected via a single node: * NJ uses rate-corrected distances

  16. UPGMA Example A 3.5 3.5 C CS369 2007

  17. UPGMA Example A 3.5 0.75 3.5 C 4.25 B CS369 2007

  18. UPGMA Example A 3.5 0.75 3.5 1.92 C 4.25 B 6.17 D CS369 2007

  19. UPGMA weaknesses A 3 1 5 2 B 3 C 6 D There is a (non clock-like) tree that fits the distance matrix exactly! CS369 2007

  20. UPGMA properties • UPGMA assumes that the rates of evolution are clock-like. • Assumes the rate of substitution is the same on all branches of the tree • Produces a rooted tree CS369 2007

  21. Neighbor-joining • Most widely-used distance based method for phylogenetic reconstruction • UPGMA illustrated that it is not enough to pick the closest neighbors (at least when there is rate heterogeneity across branches) • Idea: take into account averaged distances to other leaves as well • Produces an unrooted tree CS369 2007

  22. The basic idea • We start by moving every node i closer to all other nodes by this amount: • As a result the new (squashed) distances are: • We are pushing node i closer to all other nodes by an amount slightly more than the average distance to all other taxa. CS369 2007

  23. The basic idea • In effect, the nodes that were far away from everything get pushed towards everything quite a lot. • This counteracts the effect of long branches. C 0.3 A UPGMA would incorrectly group A and B, whereas NJ would reconstruct the correct tree in this case. 0.1 0.1 B 0.1 0.3 D CS369 2007

  24. Neighbor-joining • We use an algorithm very similar to UPGMA to connect the two closest nodes, i and j, using these new squashed distances. • We join these into a cluster and make a new node k to correspond to their ancestor, and pick distances from i, j and all other nodes to k. • The squashed distances are updated at each step. • See Durbin book, p171 for details. CS369 2007

  25. Runtime of the algorithm • Both of these clustering-based algorithms take O(n3) time once we have the distance matrix. • There are n steps and in each step we do: • (1) find the smallest distance • (2) join these two taxa • (3) compute the distance from the new ancestor to all others • Step (1) takes O(n2) and the other two steps take O(n) CS369 2007

More Related