CSCI2950-C Lecture 8 Molecular Phylogeny: Parsimony and Likelihood

CSCI2950-C Lecture 8Molecular Phylogeny: Parsimony and Likelihood http://cs.brown.edu/courses/csci2950-c/

1 4 3 5 2 5 2 3 1 4 Phylogenetic Trees How are these trees built from DNA sequences? • Leaves represent existing species • Internal vertices represent ancestors • Root represents the oldest evolutionary ancestor

1 4 3 5 2 5 2 3 1 4 Phylogenetic Trees How are these trees built from DNA sequences? Methods • Distance • Parsimony Minimum number of mutations • Likelihood Probabilistic model of mutations

Outline • Last Lecture: distance-based Methods • Additive distances • 4 Point condition • UPGMA & Neighbor joining • Today: • Parsimony-based methods • Sankoff + Fitch’s algorithms • Likelihood Methods • Perfect Phylogeny

Weighted Small Parsimony Problem: Formulation • Input:Tree T with each leaf labeled by elements of a k-letter alphabet and a k x k scoring matrix (ij) • Output: Labeling of internal vertices of the tree T minimizing the weighted parsimony score

Sankoff AlgorithmDynamic Programming • Calculate and keep track of a score for every possible label at each vertex • st(v) = minimum parsimony score of the subtree rooted at vertex v if v has character t • The score at each vertex is based on scores of its children: • st(parent) = mini {si( left child) + i, t} + minj {sj( right child) + j, t}

Sankoff Algorithm (cont.) • Begin at leaves: • If leaf has the character in question, score is 0 • Else, score is 

Sankoff Algorithm (cont.) st(v) = mini {si(u) + i, t} + minj{sj(w) + j, t} sA(v) = mini{si(u) + i, A} + minj{sj(w) + j, A} sA(v) = 0

Sankoff Algorithm (cont.) st(v) = mini {si(u) + i, t} + minj{sj(w) + j, t} sA(v) = mini{si(u) + i, A} + minj{sj(w) + j, A} sA(v) = 0 + 9 = 9

Sankoff Algorithm (cont.) st(v) = mini {si(u) + i, t} + minj{sj(w) + j, t} Repeat for T, G, and C

Sankoff Algorithm (cont.) Repeat for right subtree

Sankoff Algorithm (cont.) Repeat for root

Sankoff Algorithm (cont.) Smallest score at root is minimum weighted parsimony score In this case, 9 – so label with T

Sankoff Algorithm: Traveling down the Tree • The scores at the root vertex have been computed by going up the tree • After the scores at root vertex are computed the Sankoff algorithm moves down the tree and assign each vertex with optimal character.

Sankoff Algorithm (cont.) 9 is derived from 7 + 2 So left child is T, And right child is T

Sankoff Algorithm (cont.) And the tree is thus labeled…

Fitch’s Algorithm • Solves Small Parsimony problem • Published 4 years before Sankoff (1971) • Assigns a set of letters to every vertex in the tree, S(v) • S(l) = observed character for each leaf l

Fitch’s Algorithm: Example a c t a {a,c} {t,a} c t a a a a a a {a,c} {t,a} a a c t a a c t

Fitch Algorithm 1) Assign a set of possible letters Svto every vertex vertex v, traversing the tree from leaves to root • For vertex v with children u and w: • Sv = Su “intersect” Sw if non-empty intersection Su “union” Sw , otherwise • E.g. if the node we are looking at has a left child labeled {A, C} and a right child labeled {A, T}, the node will be given the set {A, C, T}

Fitch Algorithm (cont.) 2) Assign labels to each vertex, traversing the tree from root to leaves • Assign root arbitrarily from its set of letters • For all other vertices, if its parent’s label is in its set of letters, assign it its parent’s label • Else, choose an arbitrary letter from its set as its label

Fitch Algorithm (cont.)

Fitch vs. Sankoff • Both have an O(nk) runtime • Are they actually different? • Let’s compare …

Fitch As seen previously:

Comparison of Fitch and Sankoff • As seen earlier, the scoring matrix for the Fitch algorithm is merely: • So let’s do the same problem using Sankoff algorithm and this scoring matrix

Sankoff

Sankoff vs. Fitch • The Sankoff algorithm gives the same set of optimal labels as the Fitch algorithm • For Sankoff algorithm, character t is optimal for vertex v if st(v) = min1<i<ksi(v) • Let Sv = set of optimal letters for v. • Then • Sv = Su “intersect” Sw if non-empty intersection Su “union” Sw , otherwise • This is also the Fitch recurrence • The two algorithms are identical

Large Parsimony Problem • Input: An n x m matrix M describing n species, each represented by an m-character string • Output: A tree T with n leaves labeled by the n rows of matrix M, and a labeling of the internal vertices such that the parsimony score is minimized over all possible trees and all possible labelings of internal vertices

Large Parsimony Problem (cont.) • Possible search space is huge, especially as n increases • (2n – 3)!! possible rooted trees • (2n – 5)!! possible unrooted trees • Problem is NP-complete • Exhaustive search only possible w/ small n(< 10) • Hence, branch and bound or heuristics used

Nearest Neighbor InterchangeA Greedy Algorithm • A Branch Swapping algorithm • Only evaluates a subset of all possible trees • Defines a neighbor of a tree as one reachable by a nearest neighbor interchange • A rearrangement of the four subtrees defined by one internal edge • Only three different rearrangements per edge

Nearest Neighbor Interchange

Nearest Neighbor Interchange • Start with an arbitrary tree and check its neighbors • Move to a neighbor if it provides the best improvement in parsimony score • No way of knowing if the result is the most parsimonious tree • Could be stuck in local optimum

Nearest Neighbor Interchange

Subtree Pruning and RegraftingAnother Branch Swapping Algorithm http://artedi.ebc.uu.se/course/BioInfo-10p-2001/Phylogeny/Phylogeny-TreeSearch/SPR.gif

Tree Bisection and Reconnection Another Branch Swapping Algorithm • Most extensive swapping routine

Homoplasy • Given: • 1: CAGCAGCAG • 2: CAGCAGCAG • 3: CAGCAGCAGCAG • 4: CAGCAGCAG • 5: CAGCAGCAG • 6: CAGCAGCAG • 7: CAGCAGCAGCAG • Most would group 1, 2, 4, 5, and 6 as having evolved from a common ancestor, with a single mutation leading to the presence of 3 and 7

Homoplasy • But what if this was the real tree?

Homoplasy • 6 evolved separately from 4 and 5 • Parsimony groups 4, 5, and 6 together as having evolved from a common ancestor • Homoplasy: Independent (or parallel) evolution of same/similar characters • Parsimony results minimize homoplasy, so if homoplasy is common, parsimony may give wrong results

Contradicting Characters • An evolutionary tree is more likely to be correct when it is supported by multiple characters Human Lizard MAMMALIA Hair Single bone in lower jaw Lactation etc. Dog Frog • Note: In this case, tails are homoplastic

1 1 0 Perfect Phylogeny • Evolutionary model • Binary characters {0,1} • Each character changes state only once in evolutionary history (no homoplasy!). • Tree in which every mutation is on an edge of the tree. • All the species in one sub-tree contain a 0, and all species in the other contain a 1. • For simplicity, assume root = (0, 0, 0, 0, 0) • How can one reconstruct such a tree? species 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 traits

The 4-gamete condition • A column i partitions the set of species into two sets i0, and i1 • A column is homogeneous w.r.t a set of species, if it has the same value for all species. Otherwise, it is heterogeneous. • Example: i is heterogeneous w.r.t {A,D,E} i A 0 B 0 C 0 D 1 E 1 F 1 i0 i1

4 Gamete Condition There exists a perfect phylogeny if and only if for all pair of columns (i, j), j is homogenous w.r.t i0or i1. Equivalently, There exists a perfect phylogeny if and only if for all pairs of columns (i, j), the following 4 rows do not exist (0,0), (0,1), (1,0), (1,1) i A 0 B 0 C 0 D 1 E 1 F 1 i0 i1

i i0 i1 4-gamete condition: proof (only if) Every perfect phylogeny satisfies the 4-gamete condition • Depending on which edge the mutation j occurs, either i0, or i1 should be homogenous. (if) If the 4-gamete condition is satisfied, does a perfect phylogeny exist? Need to give an algorithm…

An algorithm for constructing a perfect phylogeny • We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later. • In any tree, each node (except the root) has a single parent. • It is sufficient to construct a parent for every node. • In each step, we add a column and refine some of the nodes containing multiple children. • Stop if all columns have been considered.

Inclusion Property • For any pair of columns i, j: i < j if and only if i1 j1 • Note that if i < j then the edge containing i is an ancestor of the edge containing j i j

r A B C D E Example 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 Initially, there is a single clade r, and each node has r as its parent

Sort columns • Sort columns according to the inclusion property: i < j if and only if i1 j1 • This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0

Add first column 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 • In adding column i • Check each edge and decide which side you belong. • Finally add a node if you can resolve a clade r u B D A C E

Adding other columns 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 • Add other columns on edges using the ordering property r 1 3 E 2 B 5 4 D A C

Unrooted case • Switch the values in each column, so that 0 is the majority element. • Apply the algorithm for the rooted case

Problems with Parsimony Ignores branch lengths on trees A A A A A A A C A A A A A C Same parsimony score. Mutation “more likely” on longer branch.

CSCI2950-C Lecture 8 Molecular Phylogeny: Parsimony and Likelihood

CSCI2950-C Lecture 8 Molecular Phylogeny: Parsimony and Likelihood

Presentation Transcript

Likelihood-Based Phylogeny

Lecture 3 Molecular Evolution and Phylogeny

Molecular Phylogeny

Molecular Phylogeny and Evolution

Lecture 8 The Principle of Maximum Likelihood

Molecular Phylogeny course: Sequence Information

Parsimony, Likelihood, Common Causes, and Phylogenetic Inference

Molecular Evolution and Phylogeny

How to estimate phylogenies? On parsimony, likelihood and probability.

Molecular phylogeny of the Arcellinida

Molecular Phylogeny

Phylogeny II : Parsimony, ML, SEMPHY

An Equivalence of Maximum Parsimony and Maximum Likelihood revisited

Molecular Phylogeny and Evolution

CSCI2950-C Lecture 7 Molecular Evolution and Phylogeny

CSCI2950-C Lecture 12 Networks

Molecular phylogeny of the Arcellinida

CSCI2950-C Lecture 3 DNA Sequencing and Fragment Assembly

CSCI2950-C Lecture 2 DNA Sequencing and Fragment Assembly

CSCI2950-C Lecture 9 Cancer Genomics

CSCI2950-C Genomes, Networks, and Cancer

Molecular Phylogeny