1 / 39

Molecular Evolution and Phylogenetic Tree Reconstruction

1. 4. 3. 5. 2. 5. 2. 3. 1. 4. Molecular Evolution and Phylogenetic Tree Reconstruction. Phylogenetic Trees. Nodes: species Edges: time of independent evolution Edge length represents evolution time AKA genetic distance Not necessarily chronological time.

tovi
Download Presentation

Molecular Evolution and Phylogenetic Tree Reconstruction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 1 4 3 5 2 5 2 3 1 4 Molecular Evolution and Phylogenetic Tree Reconstruction

  2. Phylogenetic Trees • Nodes: species • Edges: time of independent evolution • Edge length represents evolution time • AKA genetic distance • Not necessarily chronological time

  3. Inferring Phylogenetic Trees Trees can be inferred by several criteria: • Morphology of the organisms • Can lead to mistakes! • Sequence comparison Example: Mouse: ACAGTGACGCCCCAAACGT Rat: ACAGTGACGCTACAAACGT Baboon: CCTGTGACGTAACAAACGA Chimp: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA

  4. Inferring Phylogenetic Trees Sequence-based methods Deterministic (Parsimony) Probabilistic (SEMPHY) Distance-based methods UPGMA Neighbor-Joining Can compute distances from sequences

  5. Distance Between Two Sequences Basic principles: Degree of sequence difference is proportional to length of independent sequence evolution Only use positions where alignment is certain – avoid areas with (too many) gaps

  6. Distance Between Two Sequences Given sequences xi, xj, Define dij = distance between the two sequences One possible definition: dij = fraction f of sites u where xi[u]  xj[u] Better scores are derived by modeling evolution as a continuous change process

  7. Outline • Molecular Evolution • Distance Methods • UPGMA / Average Linkage • Neighbor-Joining • Sequence Methods • Deterministic (Parsimony) • Probabilistic (SEMPHY)

  8. Molecular Evolution Q: How can we model evolution on nucleotide level? (ignore gaps, focus on substitutions) A: Consider what happens at a specific position for small time interval Δt P(t) = vector of probabilities of {A,C,G,T} at time t μAC = rate of transition from A to C per unit time μA = μAC + μAG + μAT rate of transition out of A pA(t+Δt) = pA(t) – pA(t) μA Δt + pC(t) μCA Δt + …

  9. Molecular Evolution In matrix/vector notation, we get P(t+Δt) = P(t) + Q P(t) Δt where Q is the substitution rate matrix

  10. Molecular Evolution This is a differential equation: P’(t) = Q P(t) A substitution rate matrix Q implies a probability distribution over {A,C,G,T} at each position, including stationary (equilibrium) frequencies πA, πC, πG, πT Each Q is an evolutionary model (some work better than others)

  11. Evolutionary Models Jukes-Cantor Kimura Felsenstein HKY

  12. Estimating Distances Solve the differential equation and compute expected evolutionary time given sequences Jukes-Cantor Kimura

  13. Outline Molecular Evolution Distance Methods UPGMA / Average Linkage Neighbor-Joining Sequence Methods Deterministic (Parsimony) Probabilistic (SEMPHY)

  14. A simple clustering method for building tree UPGMA (unweighted pair group method using arithmetic averages) Or the Average Linkage Method Given two disjoint clusters Ci, Cj of sequences, 1 dij = ––––––––– {p Ci, q Cj}dpq |Ci|  |Cj| Claim that if Ck = Ci  Cj, then distance to another cluster Cl is: dil |Ci| + djl |Cj| dkl = –––––––––––––– |Ci| + |Cj|

  15. Algorithm: Average Linkage 1 4 Initialization: Assign each xi into its own cluster Ci Define one leaf per sequence, height 0 Iteration: Find two clusters Ci, Cj s.t. dij is min Let Ck = Ci  Cj Define node connecting Ci, Cj, and place it at height dij/2 Delete Ci, Cj Termination: When two clusters i, j remain, place root at height dij/2 3 5 2 5 2 3 1 4

  16. Average Linkage Example 4 3 2 1 v w x y z

  17. Ultrametric Distances and Molecular Clock Definition: A distance function d(.,.) is ultrametric if for any three distances dij dik  dij, it is true that dij dik = dij The Molecular Clock: The evolutionary distance between species x and y is 2 the Earth time to reach the nearest common ancestor That is, the molecular clock has constant rate in all species The molecular clock results in ultrametric distances years 1 4 2 3 5

  18. Ultrametric Distances & Average Linkage Average Linkage is guaranteed to reconstruct correctly a binary tree with ultrametric distances Proof: Exercise 5 1 4 2 3

  19. Weakness of Average Linkage Molecular clock: all species evolve at the same rate (Earth time) However, certain species (e.g., mouse, rat) evolve much faster Example where UPGMA messes up: AL tree Correct tree 3 2 1 3 4 4 2 1

  20. d1,4 Additive Distances 1 Given a tree, a distance measure is additive if the distance between any pair of leaves is the sum of lengths of edges connecting them Given a tree T & additive distances dij, can uniquely reconstruct edge lengths: • Find two neighboring leaves i, j, with common parent k • Place parent node k at distance dkm = ½ (dim + djm – dij) from any node m  i, j 4 12 8 3 13 7 9 5 11 10 6 2

  21. Reconstructing Additive Distances Given T x T D y 5 4 3 z 3 4 w 7 6 v If we know T and D, but do not know the length of each leaf, we can reconstruct those lengths

  22. Reconstructing Additive Distances Given T x T D y z w v

  23. Reconstructing Additive Distances Given T D x T y z a w D1 v dax = ½ (dvx + dwx – dvw) day = ½ (dvy + dwy – dvw) daz = ½ (dvz + dwz – dvw)

  24. Reconstructing Additive Distances Given T D1 x T y 5 4 b 3 z 3 a 4 c w 7 D2 6 d(a, c) = 3 d(b, c) = d(a, b) – d(a, c) = 3 d(c, z) = d(a, z) – d(a, c) = 7 d(b, x) = d(a, x) – d(a, b) = 5 d(b, y) = d(a, y) – d(a, b) = 4 d(a, w) = d(z, w) – d(a, z) = 4 d(a, v) = d(z, v) – d(a, z) = 6 Correct!!! v D3

  25. Neighbor-Joining • Guaranteed to produce the correct tree if distance is additive • May produce a good tree even when distance is not additive Step 1: Finding neighboring leaves Define Dij = (N – 2) dij – kidik – kjdjk Claim: The above “magic trick” ensures that Dij is minimal iffi, j are neighbors 1 3 0.1 0.1 0.1 0.4 0.4 4 2

  26. Algorithm: Neighbor-Joining Initialization: Define T to be the set of leaf nodes, one per sequence Let L = T Iteration: Pick i, j s.t. Dij is minimal Define a new node k, and set dkm = ½ (dim + djm – dij) for all m  L Add k to T, with edges of lengths dik = ½ (dij + ri – rj), djk = dij – dik where ri = (N – 2)-1kidik Remove i, j from L; Add k to L Termination: When L consists of two nodes, i, j, and the edge between them of length dij

  27. Outline Molecular Evolution Distance Methods UPGMA / Average Linkage Neighbor-Joining Sequence Methods Deterministic (Parsimony) Probabilistic (SEMPHY)

  28. Parsimony • One of the most popular methods: • GIVEN multiple alignment • FIND tree & history of substitutions explaining alignment Idea: Find the tree that explains the observed sequences with a minimal number of substitutions Two computational subproblems: • Find the parsimony cost of a given tree (easy) • Search through all tree topologies (hard)

  29. Example: Parsimony Cost of One Column {A} C = 1 {A} {A, B} C++ A B A A A A B A {B} {A} {A} {A}

  30. Parsimony Scoring Given a tree, and an alignment column u Label internal nodes to minimize the number of required substitutions Initialization: Set cost C = 0; node k = 2N – 1 (last leaf) Iteration: If k is a leaf, set Rk = { xk[u] } // Rk is simply the character of kth species If k is not a leaf, Let i, j be the daughter nodes; Set Rk = Ri Rj if intersection is nonempty Set Rk = Ri  Rj, and increment C if intersection is empty Termination: Minimal cost of tree for column u, = C

  31. Example {B} {A,B} {A} {B} {A} {A,B} {A} A A A A B B A B {A} {A} {A} {A} {B} {B} {A} {B}

  32. Parsimony Traceback Traceback: • Choose an arbitrary nucleotide from R2N – 1 for the root • Having chosen nucleotide r for parent k, If r  Ri choose r for daughter i Else, choose arbitrary nucleotide from Ri Easy to see that this traceback produces some assignment of cost C

  33. Another Parsimony Algorithm Let C(v) be cost for subtree rooted at node v Let C(v,x) be cost for subtree rooted at v if we force v to have value x Initialization: For each leaf v C(v) = 0 C(v,x) = 0 if x is input character that labels v; C(v,x) = ∞ otherwise Iteration: Let u, w be children of v C(v,x) = min(C(u) + 1, C(u,x)) + min(C(v) + 1, C(v,x)) C(v) = min C(v,x) Termination: Minimal cost is C(root)

  34. Probabilistic Methods A more refined measure of evolution along a tree than parsimony P(x1, x2, xroot | t1, t2) = P(xroot) P(x1 | t1, xroot) P(x2 | t2, xroot) If we use Jukes-Cantor, for example, and x1 = xroot = A, x2 = C, t1 = t2 = 1, = pA¼(1 + 3e-4α) ¼(1 – e-4α) = (¼)3(1 + 3e-4α)(1 – e-4α) xroot t1 t2 x1 x2

  35. Probabilistic Methods xroot • If we know all internal labels xu, P(x1, x2, …, xN, xN+1, …, x2N-1 | T, t) = P(xroot)jrootP(xj | xparent(j), tj, parent(j)) • Usually we don’t know the internal labels, therefore P(x1, x2, …, xN | T, t) = xN+1 xN+2 … x2N-1P(x1, x2, …, x2N-1 | T, t) xu x2 xN x1

  36. Felsenstein’s Likelihood Algorithm Define: and recursively compute:

  37. Felsenstein’s Likelihood Algorithm Now using u and U we can compute: and

  38. Probabilistic Methods Given M (ungapped) alignment columns of N sequences, • Define likelihood of a tree: L(T, t) = P(Data | T, t) = m=1…M P(x1m, …, xnm | T, t) Maximum Likelihood Reconstruction: • Given data X = (xij), find a topology T and length vector t that maximize likelihood L(T, t)

  39. Current popular methods HUNDREDS of programs available! http://evolution.genetics.washington.edu/phylip/software.html#methods Some recommended programs: • Discrete—Parsimony-based • Rec-1-DCM3 http://www.cs.utexas.edu/users/tandy/mp.html Tandy Warnow and colleagues • Probabilistic • SEMPHY http://www.cs.huji.ac.il/labs/compbio/semphy/ Nir Friedman and colleagues

More Related