1 / 65

CSCI2950-C Lecture 8 Molecular Phylogeny: Parsimony and Likelihood

Learn how phylogenetic trees are built from DNA sequences using distance, parsimony, and likelihood methods. Explore algorithms like Sankoff and Fitch for tree labeling.

daviddscott
Download Presentation

CSCI2950-C Lecture 8 Molecular Phylogeny: Parsimony and Likelihood

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSCI2950-C Lecture 8Molecular Phylogeny: Parsimony and Likelihood http://cs.brown.edu/courses/csci2950-c/

  2. 1 4 3 5 2 5 2 3 1 4 Phylogenetic Trees How are these trees built from DNA sequences? • Leaves represent existing species • Internal vertices represent ancestors • Root represents the oldest evolutionary ancestor

  3. 1 4 3 5 2 5 2 3 1 4 Phylogenetic Trees How are these trees built from DNA sequences? Methods • Distance • Parsimony Minimum number of mutations • Likelihood Probabilistic model of mutations

  4. Outline • Last Lecture: distance-based Methods • Additive distances • 4 Point condition • UPGMA & Neighbor joining • Today: • Parsimony-based methods • Sankoff + Fitch’s algorithms • Likelihood Methods • Perfect Phylogeny

  5. Weighted Small Parsimony Problem: Formulation • Input:Tree T with each leaf labeled by elements of a k-letter alphabet and a k x k scoring matrix (ij) • Output: Labeling of internal vertices of the tree T minimizing the weighted parsimony score

  6. Sankoff AlgorithmDynamic Programming • Calculate and keep track of a score for every possible label at each vertex • st(v) = minimum parsimony score of the subtree rooted at vertex v if v has character t • The score at each vertex is based on scores of its children: • st(parent) = mini {si( left child) + i, t} + minj {sj( right child) + j, t}

  7. Sankoff Algorithm (cont.) • Begin at leaves: • If leaf has the character in question, score is 0 • Else, score is 

  8. Sankoff Algorithm (cont.) st(v) = mini {si(u) + i, t} + minj{sj(w) + j, t} sA(v) = mini{si(u) + i, A} + minj{sj(w) + j, A} sA(v) = 0

  9. Sankoff Algorithm (cont.) st(v) = mini {si(u) + i, t} + minj{sj(w) + j, t} sA(v) = mini{si(u) + i, A} + minj{sj(w) + j, A} sA(v) = 0 + 9 = 9

  10. Sankoff Algorithm (cont.) st(v) = mini {si(u) + i, t} + minj{sj(w) + j, t} Repeat for T, G, and C

  11. Sankoff Algorithm (cont.) Repeat for right subtree

  12. Sankoff Algorithm (cont.) Repeat for root

  13. Sankoff Algorithm (cont.) Smallest score at root is minimum weighted parsimony score In this case, 9 – so label with T

  14. Sankoff Algorithm: Traveling down the Tree • The scores at the root vertex have been computed by going up the tree • After the scores at root vertex are computed the Sankoff algorithm moves down the tree and assign each vertex with optimal character.

  15. Sankoff Algorithm (cont.) 9 is derived from 7 + 2 So left child is T, And right child is T

  16. Sankoff Algorithm (cont.) And the tree is thus labeled…

  17. Fitch’s Algorithm • Solves Small Parsimony problem • Published 4 years before Sankoff (1971) • Assigns a set of letters to every vertex in the tree, S(v) • S(l) = observed character for each leaf l

  18. Fitch’s Algorithm: Example a c t a {a,c} {t,a} c t a a a a a a {a,c} {t,a} a a c t a a c t

  19. Fitch Algorithm 1) Assign a set of possible letters Svto every vertex vertex v, traversing the tree from leaves to root • For vertex v with children u and w: • Sv = Su “intersect” Sw if non-empty intersection Su “union” Sw , otherwise • E.g. if the node we are looking at has a left child labeled {A, C} and a right child labeled {A, T}, the node will be given the set {A, C, T}

  20. Fitch Algorithm (cont.) 2) Assign labels to each vertex, traversing the tree from root to leaves • Assign root arbitrarily from its set of letters • For all other vertices, if its parent’s label is in its set of letters, assign it its parent’s label • Else, choose an arbitrary letter from its set as its label

  21. Fitch Algorithm (cont.)

  22. Fitch vs. Sankoff • Both have an O(nk) runtime • Are they actually different? • Let’s compare …

  23. Fitch As seen previously:

  24. Comparison of Fitch and Sankoff • As seen earlier, the scoring matrix for the Fitch algorithm is merely: • So let’s do the same problem using Sankoff algorithm and this scoring matrix

  25. Sankoff

  26. Sankoff vs. Fitch • The Sankoff algorithm gives the same set of optimal labels as the Fitch algorithm • For Sankoff algorithm, character t is optimal for vertex v if st(v) = min1<i<ksi(v) • Let Sv = set of optimal letters for v. • Then • Sv = Su “intersect” Sw if non-empty intersection Su “union” Sw , otherwise • This is also the Fitch recurrence • The two algorithms are identical

  27. Large Parsimony Problem • Input: An n x m matrix M describing n species, each represented by an m-character string • Output: A tree T with n leaves labeled by the n rows of matrix M, and a labeling of the internal vertices such that the parsimony score is minimized over all possible trees and all possible labelings of internal vertices

  28. Large Parsimony Problem (cont.) • Possible search space is huge, especially as n increases • (2n – 3)!! possible rooted trees • (2n – 5)!! possible unrooted trees • Problem is NP-complete • Exhaustive search only possible w/ small n(< 10) • Hence, branch and bound or heuristics used

  29. Nearest Neighbor InterchangeA Greedy Algorithm • A Branch Swapping algorithm • Only evaluates a subset of all possible trees • Defines a neighbor of a tree as one reachable by a nearest neighbor interchange • A rearrangement of the four subtrees defined by one internal edge • Only three different rearrangements per edge

  30. Nearest Neighbor Interchange

  31. Nearest Neighbor Interchange • Start with an arbitrary tree and check its neighbors • Move to a neighbor if it provides the best improvement in parsimony score • No way of knowing if the result is the most parsimonious tree • Could be stuck in local optimum

  32. Nearest Neighbor Interchange

  33. Subtree Pruning and RegraftingAnother Branch Swapping Algorithm http://artedi.ebc.uu.se/course/BioInfo-10p-2001/Phylogeny/Phylogeny-TreeSearch/SPR.gif

  34. Tree Bisection and Reconnection Another Branch Swapping Algorithm • Most extensive swapping routine

  35. Homoplasy • Given: • 1: CAGCAGCAG • 2: CAGCAGCAG • 3: CAGCAGCAGCAG • 4: CAGCAGCAG • 5: CAGCAGCAG • 6: CAGCAGCAG • 7: CAGCAGCAGCAG • Most would group 1, 2, 4, 5, and 6 as having evolved from a common ancestor, with a single mutation leading to the presence of 3 and 7

  36. Homoplasy • But what if this was the real tree?

  37. Homoplasy • 6 evolved separately from 4 and 5 • Parsimony groups 4, 5, and 6 together as having evolved from a common ancestor • Homoplasy: Independent (or parallel) evolution of same/similar characters • Parsimony results minimize homoplasy, so if homoplasy is common, parsimony may give wrong results

  38. Contradicting Characters • An evolutionary tree is more likely to be correct when it is supported by multiple characters Human Lizard MAMMALIA Hair Single bone in lower jaw Lactation etc. Dog Frog • Note: In this case, tails are homoplastic

  39. 1 1 0 Perfect Phylogeny • Evolutionary model • Binary characters {0,1} • Each character changes state only once in evolutionary history (no homoplasy!). • Tree in which every mutation is on an edge of the tree. • All the species in one sub-tree contain a 0, and all species in the other contain a 1. • For simplicity, assume root = (0, 0, 0, 0, 0) • How can one reconstruct such a tree? species 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 traits

  40. The 4-gamete condition • A column i partitions the set of species into two sets i0, and i1 • A column is homogeneous w.r.t a set of species, if it has the same value for all species. Otherwise, it is heterogeneous. • Example: i is heterogeneous w.r.t {A,D,E} i A 0 B 0 C 0 D 1 E 1 F 1 i0 i1

  41. 4 Gamete Condition There exists a perfect phylogeny if and only if for all pair of columns (i, j), j is homogenous w.r.t i0or i1. Equivalently, There exists a perfect phylogeny if and only if for all pairs of columns (i, j), the following 4 rows do not exist (0,0), (0,1), (1,0), (1,1) i A 0 B 0 C 0 D 1 E 1 F 1 i0 i1

  42. i i0 i1 4-gamete condition: proof (only if) Every perfect phylogeny satisfies the 4-gamete condition • Depending on which edge the mutation j occurs, either i0, or i1 should be homogenous. (if) If the 4-gamete condition is satisfied, does a perfect phylogeny exist? Need to give an algorithm…

  43. An algorithm for constructing a perfect phylogeny • We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later. • In any tree, each node (except the root) has a single parent. • It is sufficient to construct a parent for every node. • In each step, we add a column and refine some of the nodes containing multiple children. • Stop if all columns have been considered.

  44. Inclusion Property • For any pair of columns i, j: i < j if and only if i1 j1 • Note that if i < j then the edge containing i is an ancestor of the edge containing j i j

  45. r A B C D E Example 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 Initially, there is a single clade r, and each node has r as its parent

  46. Sort columns • Sort columns according to the inclusion property: i < j if and only if i1 j1 • This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0

  47. Add first column 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 • In adding column i • Check each edge and decide which side you belong. • Finally add a node if you can resolve a clade r u B D A C E

  48. Adding other columns 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 • Add other columns on edges using the ordering property r 1 3 E 2 B 5 4 D A C

  49. Unrooted case • Switch the values in each column, so that 0 is the majority element. • Apply the algorithm for the rooted case

  50. Problems with Parsimony Ignores branch lengths on trees A A A A A A A C A A A A A C Same parsimony score. Mutation “more likely” on longer branch.

More Related