1 / 110

Phylogenetic Inference

Phylogenetic Inference. Involves an attempt to estimate the evolutionary history of a collection of organisms (taxa) or a family of genes Two major components Estimation of the evolutionary tree (branching order)

elton-pope
Download Presentation

Phylogenetic Inference

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Phylogenetic Inference • Involves an attempt to estimate the evolutionary history of a collection of organisms (taxa) or a family of genes • Two major components • Estimation of the evolutionary tree (branching order) • Using estimated trees (phylogenies) as analytical framework for further evolutionary study • Traditional role: systematics and classification

  2. Example 1: Closest living relatives of humans Gorillas Humans Chimpanzees Chimpanzees Bonobos Bonobos Orangutans Gorillas Orangutans Humans 0 15-30 14 0 MYA MYA Pre-molecular view (morphology) Emerging picture from mtDNA, most nuclear genes, DNA/DNA hybridization

  3. Example 2: Who are whales related to? Morphological data suggest that whales are a “sister clade” to extant artiodactylans, but molecular data suggest strongly that whales and hippos are more closely related to each other than hippos are to other artiodactylans Morphology Mt and nuclear DNA sequences, SINEs, LINEs

  4. No No Other interesting applications DENTIST Phylogenetic tree of HIV sequences from the DENTIST, his Patients, & Local HIV-infected People: Patient C Forensics—Transmission of HIV by Florida dentist Patient A Patient G Yes: The HIV sequences from these patients fall within the clade of HIV sequences found in the dentist. Patient B Patient E Patient A DENTIST Local control 2 Local control 3 Patient F Local control 9 From Ou et al. (1992) and Page & Holmes (1998), redrawn by Caro-Beth Stewart Local control 35 Local control 3 Patient D

  5. Other interesting applications Studying dynamics of microbial communities: Known sequences from database Novel microbial sequences Sequence 16s rDNA to identify and quantify microbes in soil before and after pesticide exposure (many microbes are previously unknown, so study gene sequences phylogenetically to follow changes in community composition)

  6. Other interesting applications Predicting evolution of influenza viruses Lineages with many mutations in one set of positively selected codons were usually the ones which led to successful strains in subsequent seasons

  7. Other interesting applications Predicting functions of uncharacterized genes Use “character-mapping” to infer functions based on parsimonious reconstructions Many situations where similarity-based methods are inadequate, e.g.:

  8. Other interesting applications • Drug Discovery—predicting natural ligands for cell surface receptors that are potential drug targets (e.g., G-protein coupled receptors) G-protein-coupled receptors are a pharmacologically important protein family with approximately 450 genes identified to date. Pathways involving these receptors are the targets of hundreds of drugs, including antihistamines, neuroleptics, antidepressants, and antihypertensives. The functions of many of these proteins are unknown, and determining ligands and signaling pathways is time-consuming and expensive. This difficulty motivates the search for a computational method which can predict ligand and second messenger with high reliability. Classifying this family of proteins helps us classify drugs, a technique which might be called "evolutionary pharmacology”… A computational method based on evolutionary tree reconstruction and employing an accepted-mutation stepmatrix can predict the ligand selectivities and intracellular signaling pathways of uncharacterized receptors, given only the amino acid sequence of the receptor. This dramatically increases the efficiency of functional characterization of new receptors.(http://www.cis.upenn.edu/~krice/receptor.html) • Vaccine development—engineer vaccines to confer immunity against multiple virus populations by targeting their inferred common ancestors

  9. Common Phylogenetic Tree Terminology Terminal Nodes Branches (edges) and lineages A Represent the TAXA (genes, populations, species, etc.) used to infer the phylogeny B C D Ancestral Node or ROOT of the Tree E Internal Nodes or Divergence Points (represent hypothetical ancestors of the taxa)

  10. A A A B C E C E C D B B E D D Polytomy or multifurcation A bifurcation The goal of phylogeny inference is to resolve the branching orders of lineages in evolutionary trees: Completely unresolved or "star" phylogeny Partially resolved phylogeny Fully resolved, bifurcating phylogeny (binary tree)

  11. Three possible unrooted trees for four taxa (A, B, C, D) Tree 1 Tree 2 Tree 3 A C A B A B D D C D B C Phylogenetic tree building (or inference) methods are aimed at discovering which of the possible unrooted trees is "correct". We would like this to be the “true” biological tree — that is, one that accurately represents the evolutionary history of the taxa. However, we must settle for discovering the optimal tree for the phylogenetic method of choice (no guarantee that optimality = truth). C-B Stewart, NHGRI lecture, 12/5/00

  12. A B A C C D B C D A E B C A D E B F The number of unrooted trees increases in a greater than exponential manner with number of taxa (2N - 5)!! = # unrooted trees for N taxa

  13. B C Root D A A C B D Rooted tree Note that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D. Root Inferring evolutionary relationships between the taxa requires rooting the tree: To root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root: Unrooted tree

  14. Now, try it again with the root at another position: B C Root Unrooted tree D A A B C D Rooted tree Note that in this rooted tree, taxon A is most closely related to taxon B, and together they are equally distantly related to taxa C and D. Root

  15. 2 4 1 5 3 Rooted tree 1a Rooted tree 1b Rooted tree 1c Rooted tree 1d Rooted tree 1e B A A C D A B D C B C C C A A D B B D D An unrooted, four-taxon tree theoretically can be rooted in five different places to produce five different rooted trees A C The unrooted tree 1: D B These trees showfive different evolutionary relationships among the taxa

  16. A A C D D B C B B A B C C D D A C B D D A C B A All of these rearrangements show the same evolutionary relationships between the taxa Rooted tree 1a D C A B

  17. There are two major ways to root trees: By outgroup: Uses taxa (the “outgroup”) that are known to fall outside of the group of interest (the “ingroup”). Requires some prior knowledge about the relationships among the taxa. The outgroup can either be species (e.g., birds to root a mammalian tree) or previous gene duplicates (e.g., a-globins to root b-globins). outgroup By midpoint or distance: Roots the tree at the midway point between the two most distant taxa in the tree, as determined by branch lengths. Assumes that the taxa are evolving in a clock-like manner. This assumption is built into some of the distance-based tree building methods. A d (A,D) = 10 + 3 + 5 = 18 Midpoint = 18 / 2 = 9 10 C 3 2 2 B D 5

  18. Types of data used in phylogenetic inference: Character-based methods:Use the aligned characters, such as DNA or protein sequences, directly during tree inference. TaxaCharacters Species A ATGGCTATTCTTATAGTACG Species B ATCGCTAGTCTTATATTACA Species C TTCACTAGACCTGTGGTCCA Species D TTGACCAGACCTGTGGTCCG Species E TTGACCAGTTCTCTAGTTCG Distance-based methods:Transform the sequence data into pairwise distances (dissimilarities), and then use the matrix during tree building. A B C D E Species A ---- 0.20 0.50 0.45 0.40 Species B 0.23 ---- 0.40 0.55 0.50 Species C 0.87 0.59 ---- 0.15 0.40 Species D 0.73 1.12 0.17 ---- 0.25 Species E 0.59 0.89 0.61 0.31 ---- Example 1: Uncorrected “p” distance (=observed percent sequence difference) Example 2: Kimura 2-parameter distance (estimate of the true number of substitutions between taxa)

  19. 6 Taxon B 1 1 Taxon C 3 1 Taxon A 5 Taxon D C is more similar in sequence to A (d = 3) than to B (d = 7), but C and B are most closely related (that is, C and B shared a common ancestor more recently than either did with A). Similarity vs. Evolutionary Relationship: Similarity and relationship are not the same thing, even though evolutionary relationship is inferred from certain types of similarity. Similar: having likeness or resemblance (an observation) Related: genetically connected (an historical fact) Two taxa can be most similar without being most closely-related:

  20. C C G G C G G C Types of Similarity Observed similarity between two entities can be due to: Evolutionary relationship: Shared ancestral characters (‘symplesiomorphies’) Shared derived characters (‘’synapomorphy’) Homoplasy (independent evolution of the same character): Convergent events (in either related on unrelated entities), Parallel events (in related entities), Reversals (in related entities) G C C G T G C G Character-based methods can tease apart types of similarity and theoretically find the true evolutionary tree. Similarity = relationship only if certain conditions are met (if the distances are ‘ultrametric’).

  21. a 9 6 5 b c METRIC DISTANCES between any two or three taxa (a, b, and c) have the following properties: Property 1: d (a, b) ≥ 0 Non-negativity Property 2: d (a, b) = d (b, a) Symmetry Property 3: d (a, b) = 0 if and only if a = b Distinctness and... Property 4: d (a, c) ≤ d (a, b) + d (b, c) Triangle inequality:

  22. 4 a b 6 6 c ULTRAMETRIC DISTANCES must satisfy the previous four conditions, plus: Property 5 d (a, b) ≤ maximum [d (a, c), d (b, c)] This implies that the two largest distances are equal, so that they define an isosceles triangle: Similarity = Relationship if the distances are ultrametric! a If distances are ultrametric, then the sequences are evolving in a perfectly clock-like manner, thus can be used in UPGMA trees and for the most precise calculations of divergence dates. 2 2 2 b 4 c

  23. General strategy for estimating a phylogeny 1. Get data 2. Select an optimality criterion (e.g., parsimony, least-squares distance, maximum likelihood) 3. Choose a search strategy (e.g., stepwise addition with branch swapping, branch-and-bound) 4. Evaluate optimality criterion for each tree visited during search, always keeping track of best tree(s) found

  24. Parsimony (optimality criterion) • In general: choose the tree requiring the fewest number of (possibly weighted) character-state changes (= steps) • Assume character independence; can calculate length required by each character and sum over characters to get total tree length

  25. Parsimony variants used for molecular data • Fitch parsimony (unordered/nonadditive): Each change counts 1 step, regardless of the nature of this change • Transversion parsimony: changes between a purine (A or G) and a pyrimidine (C or T) (“transversions”) count 1, changes between two purines or between two pyrimidines (“transitions”) count 0 • Generalized parsimony: User specifies cost of each type of change A C G T = 1 step = 3 steps

  26. Calculating tree lengths under parsimony using “brute force” • For each character: • Consider every possible ancestral state reconstruction • Count total cost required for each of these reconstructions • Sum over all characters

  27. Calculating tree lengths using dynamic programming • Analogous to pairwise alignment: determine implications of each possible state assignment at one level (node) for length at next level (parent node)

  28. Faster algorithms for special cases • Farris (1970) algorithm for ordered characters • Fitch (1971) algorithm for unordered characters • Assign “state sets” to terminal taxa based on observed data, and initialize tree length to 0 • Traverse tree from tips to root; for each node consider state sets of two immediate descendants (children) • If child state sets have a nonempty intersection, new state set equals this intersection • Otherwise, make new state set equal to the union of the two child state sets, and add 1 to the tree length

  29. Example of tree length calculation using Fitch optimization

  30. Searching for trees 1.Generate all 3 trees for first 4 taxa: • Generation of all possible trees

  31. Searching for trees 2. Generate all 15 trees for first 5 taxa: (likewise for each of the other two 4-taxon trees)

  32. Searching for trees 3. Full search tree:

  33. Searching for trees The branch-and-bound algorithm for exact solution of the problem of finding an optimal parsimony tree. The search tree is the same as for exhaustive search, with tree lengths for a hypothetical data set shown in boldface type. If a tree lying at a node of this search tree has a length that exceeds the current lower bound on the optimal tree length, this path of the search tree is terminated (indicated by a cross-bar), and the algorithm backtracks and takes the next available path. When a tip of the search tree is reached (i.e., when we arrive at a tree containing the full set of taxa), the tree is either optimal (and hence retained) or suboptimal (and rejected). When all paths leading from the initial 3-taxon tree have been explored, the algorithm terminates, and all most-parsimonious trees will have been identified. Asterisks indicate points at which the current lower bound is reduced. See text for additional explanation, and circled numbers represent the order in which phylogenetic trees are visited in the search tree. Branch and bound algorithm:

  34. Searching for trees Heuristic search methods A greedy stepwise-addition search applied to the example used for branch-and-bound. The best 4-taxon tree is determined by evaluating the lengths of the three trees obtained by joining taxon D to tree 1 containing only the first three taxa. Taxa E and F are then connected to the five and seven possible locations, respectively, on trees 4 and 9, with only the shortest trees found during each step being used for the next step. In this example, the 233-step tree obtained is not a global optimum. Circled numbers indicate the order in which phylogenetic trees are evaluated in the stepwise-addition search.

  35. Searching for trees Heuristic search methods continued Nearest neighbor interchange: All possible NNIs on 6-taxon tree:

  36. Searching for trees Heuristic search methods continued Subtree pruning regrafting:

  37. Searching for trees Heuristic search methods continued Trees resulting from SPR:

  38. Searching for trees Heuristic search methods continued Tree bisection-reconnection: Reconnection distances:

  39. Searching for trees Heuristic search methods continued Tree bisection-reconnection: Reconnection distances:

  40. Star-decomposition search

  41. Other search strategies • These “hill-climbing” methods work well for up to 20-30 taxa. For larger numbers of taxa, highly prone to entrapment in local optima. Therefore, additional strategies may be necessary: • Random restart (random trees, stepwise addition with random addition sequences) • Other optimization (meta)heuristics: iterated local search (restart after random perturbations); simulated annealing and other stochastic optimization methods • Genetic algorithms and other population-based approaches

  42. Overview of maximum likelihood as used in phylogenetics • Overall goal: Find a tree topology (and associated parameter estimates) that maximizes the probability of obtaining the observed data, given a model of evolution

  43. Overview of maximum likelihood as used in phylogenetics • Overall goal: Find a tree topology (and associated parameter estimates) that maximizes the probability of obtaining the observed data, given a model of evolution Likelihood(hypothesis) µProb(data|hypothesis) Likelihood(tree,model) = k Prob(observed sequences|tree,model)

  44. Overview of maximum likelihood as used in phylogenetics • Overall goal: Find a tree topology (and associated parameter estimates) that maximizes the probability of obtaining the observed data, given a model of evolution Likelihood(hypothesis) µProb(data|hypothesis) Likelihood(tree,model) = k Prob(observed sequences|tree,model) [not Prob(tree|data,model)]

  45. Computing the likelihood of a single tree 1 jN (1) C…GGACA…C…GTTTA…C (2) C…AGACA…C…CTCTA…C (3) C…GGATA…A…GTTAA…C (4) C…GGATA…G…CCTAG…C

More Related