650 likes | 671 Views
Learn how phylogenetic trees are built from DNA sequences using distance, parsimony, and likelihood methods. Explore algorithms like Sankoff and Fitch for tree labeling.
E N D
CSCI2950-C Lecture 8Molecular Phylogeny: Parsimony and Likelihood http://cs.brown.edu/courses/csci2950-c/
1 4 3 5 2 5 2 3 1 4 Phylogenetic Trees How are these trees built from DNA sequences? • Leaves represent existing species • Internal vertices represent ancestors • Root represents the oldest evolutionary ancestor
1 4 3 5 2 5 2 3 1 4 Phylogenetic Trees How are these trees built from DNA sequences? Methods • Distance • Parsimony Minimum number of mutations • Likelihood Probabilistic model of mutations
Outline • Last Lecture: distance-based Methods • Additive distances • 4 Point condition • UPGMA & Neighbor joining • Today: • Parsimony-based methods • Sankoff + Fitch’s algorithms • Likelihood Methods • Perfect Phylogeny
Weighted Small Parsimony Problem: Formulation • Input:Tree T with each leaf labeled by elements of a k-letter alphabet and a k x k scoring matrix (ij) • Output: Labeling of internal vertices of the tree T minimizing the weighted parsimony score
Sankoff AlgorithmDynamic Programming • Calculate and keep track of a score for every possible label at each vertex • st(v) = minimum parsimony score of the subtree rooted at vertex v if v has character t • The score at each vertex is based on scores of its children: • st(parent) = mini {si( left child) + i, t} + minj {sj( right child) + j, t}
Sankoff Algorithm (cont.) • Begin at leaves: • If leaf has the character in question, score is 0 • Else, score is
Sankoff Algorithm (cont.) st(v) = mini {si(u) + i, t} + minj{sj(w) + j, t} sA(v) = mini{si(u) + i, A} + minj{sj(w) + j, A} sA(v) = 0
Sankoff Algorithm (cont.) st(v) = mini {si(u) + i, t} + minj{sj(w) + j, t} sA(v) = mini{si(u) + i, A} + minj{sj(w) + j, A} sA(v) = 0 + 9 = 9
Sankoff Algorithm (cont.) st(v) = mini {si(u) + i, t} + minj{sj(w) + j, t} Repeat for T, G, and C
Sankoff Algorithm (cont.) Repeat for right subtree
Sankoff Algorithm (cont.) Repeat for root
Sankoff Algorithm (cont.) Smallest score at root is minimum weighted parsimony score In this case, 9 – so label with T
Sankoff Algorithm: Traveling down the Tree • The scores at the root vertex have been computed by going up the tree • After the scores at root vertex are computed the Sankoff algorithm moves down the tree and assign each vertex with optimal character.
Sankoff Algorithm (cont.) 9 is derived from 7 + 2 So left child is T, And right child is T
Sankoff Algorithm (cont.) And the tree is thus labeled…
Fitch’s Algorithm • Solves Small Parsimony problem • Published 4 years before Sankoff (1971) • Assigns a set of letters to every vertex in the tree, S(v) • S(l) = observed character for each leaf l
Fitch’s Algorithm: Example a c t a {a,c} {t,a} c t a a a a a a {a,c} {t,a} a a c t a a c t
Fitch Algorithm 1) Assign a set of possible letters Svto every vertex vertex v, traversing the tree from leaves to root • For vertex v with children u and w: • Sv = Su “intersect” Sw if non-empty intersection Su “union” Sw , otherwise • E.g. if the node we are looking at has a left child labeled {A, C} and a right child labeled {A, T}, the node will be given the set {A, C, T}
Fitch Algorithm (cont.) 2) Assign labels to each vertex, traversing the tree from root to leaves • Assign root arbitrarily from its set of letters • For all other vertices, if its parent’s label is in its set of letters, assign it its parent’s label • Else, choose an arbitrary letter from its set as its label
Fitch vs. Sankoff • Both have an O(nk) runtime • Are they actually different? • Let’s compare …
Fitch As seen previously:
Comparison of Fitch and Sankoff • As seen earlier, the scoring matrix for the Fitch algorithm is merely: • So let’s do the same problem using Sankoff algorithm and this scoring matrix
Sankoff vs. Fitch • The Sankoff algorithm gives the same set of optimal labels as the Fitch algorithm • For Sankoff algorithm, character t is optimal for vertex v if st(v) = min1<i<ksi(v) • Let Sv = set of optimal letters for v. • Then • Sv = Su “intersect” Sw if non-empty intersection Su “union” Sw , otherwise • This is also the Fitch recurrence • The two algorithms are identical
Large Parsimony Problem • Input: An n x m matrix M describing n species, each represented by an m-character string • Output: A tree T with n leaves labeled by the n rows of matrix M, and a labeling of the internal vertices such that the parsimony score is minimized over all possible trees and all possible labelings of internal vertices
Large Parsimony Problem (cont.) • Possible search space is huge, especially as n increases • (2n – 3)!! possible rooted trees • (2n – 5)!! possible unrooted trees • Problem is NP-complete • Exhaustive search only possible w/ small n(< 10) • Hence, branch and bound or heuristics used
Nearest Neighbor InterchangeA Greedy Algorithm • A Branch Swapping algorithm • Only evaluates a subset of all possible trees • Defines a neighbor of a tree as one reachable by a nearest neighbor interchange • A rearrangement of the four subtrees defined by one internal edge • Only three different rearrangements per edge
Nearest Neighbor Interchange • Start with an arbitrary tree and check its neighbors • Move to a neighbor if it provides the best improvement in parsimony score • No way of knowing if the result is the most parsimonious tree • Could be stuck in local optimum
Subtree Pruning and RegraftingAnother Branch Swapping Algorithm http://artedi.ebc.uu.se/course/BioInfo-10p-2001/Phylogeny/Phylogeny-TreeSearch/SPR.gif
Tree Bisection and Reconnection Another Branch Swapping Algorithm • Most extensive swapping routine
Homoplasy • Given: • 1: CAGCAGCAG • 2: CAGCAGCAG • 3: CAGCAGCAGCAG • 4: CAGCAGCAG • 5: CAGCAGCAG • 6: CAGCAGCAG • 7: CAGCAGCAGCAG • Most would group 1, 2, 4, 5, and 6 as having evolved from a common ancestor, with a single mutation leading to the presence of 3 and 7
Homoplasy • But what if this was the real tree?
Homoplasy • 6 evolved separately from 4 and 5 • Parsimony groups 4, 5, and 6 together as having evolved from a common ancestor • Homoplasy: Independent (or parallel) evolution of same/similar characters • Parsimony results minimize homoplasy, so if homoplasy is common, parsimony may give wrong results
Contradicting Characters • An evolutionary tree is more likely to be correct when it is supported by multiple characters Human Lizard MAMMALIA Hair Single bone in lower jaw Lactation etc. Dog Frog • Note: In this case, tails are homoplastic
1 1 0 Perfect Phylogeny • Evolutionary model • Binary characters {0,1} • Each character changes state only once in evolutionary history (no homoplasy!). • Tree in which every mutation is on an edge of the tree. • All the species in one sub-tree contain a 0, and all species in the other contain a 1. • For simplicity, assume root = (0, 0, 0, 0, 0) • How can one reconstruct such a tree? species 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 traits
The 4-gamete condition • A column i partitions the set of species into two sets i0, and i1 • A column is homogeneous w.r.t a set of species, if it has the same value for all species. Otherwise, it is heterogeneous. • Example: i is heterogeneous w.r.t {A,D,E} i A 0 B 0 C 0 D 1 E 1 F 1 i0 i1
4 Gamete Condition There exists a perfect phylogeny if and only if for all pair of columns (i, j), j is homogenous w.r.t i0or i1. Equivalently, There exists a perfect phylogeny if and only if for all pairs of columns (i, j), the following 4 rows do not exist (0,0), (0,1), (1,0), (1,1) i A 0 B 0 C 0 D 1 E 1 F 1 i0 i1
i i0 i1 4-gamete condition: proof (only if) Every perfect phylogeny satisfies the 4-gamete condition • Depending on which edge the mutation j occurs, either i0, or i1 should be homogenous. (if) If the 4-gamete condition is satisfied, does a perfect phylogeny exist? Need to give an algorithm…
An algorithm for constructing a perfect phylogeny • We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later. • In any tree, each node (except the root) has a single parent. • It is sufficient to construct a parent for every node. • In each step, we add a column and refine some of the nodes containing multiple children. • Stop if all columns have been considered.
Inclusion Property • For any pair of columns i, j: i < j if and only if i1 j1 • Note that if i < j then the edge containing i is an ancestor of the edge containing j i j
r A B C D E Example 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 Initially, there is a single clade r, and each node has r as its parent
Sort columns • Sort columns according to the inclusion property: i < j if and only if i1 j1 • This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0
Add first column 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 • In adding column i • Check each edge and decide which side you belong. • Finally add a node if you can resolve a clade r u B D A C E
Adding other columns 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 • Add other columns on edges using the ordering property r 1 3 E 2 B 5 4 D A C
Unrooted case • Switch the values in each column, so that 0 is the majority element. • Apply the algorithm for the rooted case
Problems with Parsimony Ignores branch lengths on trees A A A A A A A C A A A A A C Same parsimony score. Mutation “more likely” on longer branch.