1 / 25

Phylogeny Tree Reconstruction

1. 4. 3. 5. 2. 5. 2. 3. 1. 4. Phylogeny Tree Reconstruction. Inferring Phylogenies. Trees can be inferred by several criteria: Morphology of the organisms Sequence comparison Example: Orc: A C A G T G A C G CCCC AAA C G T Elf: A C A G T G A C G C T A C AAA C G T

Download Presentation

Phylogeny Tree Reconstruction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 1 4 3 5 2 5 2 3 1 4 Phylogeny Tree Reconstruction

  2. Inferring Phylogenies Trees can be inferred by several criteria: • Morphology of the organisms • Sequence comparison Example: Orc: ACAGTGACGCCCCAAACGT Elf: ACAGTGACGCTACAAACGT Dwarf: CCTGTGACGTAACAAACGA Hobbit: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA

  3. Modeling Evolution During infinitesimal time t, there is not enough time for two substitutions to happen on the same nucleotide So we can estimate P(x | y, t), for x, y  {A, C, G, T} Then let P(A|A, t) …… P(A|T, t) S(t) = … … P(T|A, t) …… P(T|T, t)

  4. Modeling Evolution Reasonable assumption: multiplicative (implying a stationary Markov process) S(t+t’) = S(t)S(t’) That is, P(x | y, t+t’) = z P(x | z, t) P(z | y, t’) Jukes-Cantor: constant rate of evolution 1 - 3    For short time , S() =  1 - 3     1 - 3     1 - 3

  5. Modeling Evolution Jukes-Cantor: For longer times, r(t) s(t) s(t) s(t) S(t) = s(t) r(t) s(t) s(t) s(t) s(t) r(t) s(t) s(t) s(t) s(t) r(t) Where we can derive: r(t) = ¼ (1 + 3 e-4t) s(t) = ¼ (1 – e-4t)

  6. Modeling Evolution Kimura: Transitions: A/G, C/T Transversions: A/T, A/C, G/T, C/G Transitions (rate ) are much more likely than transversions (rate ) r(t)s(t)u(t)s(t) S(t) = s(t)r(t)s(t)u(t) u(t)s(t)r(t)s(t) s(t)u(t)s(t)r(t) Where s(t) = ¼ (1 – e-4t) u(t) = ¼ (1 + e-4t – e-2(+)t) r(t) = 1 – 2s(t) – u(t)

  7. Phylogeny and sequence comparison Basic principles: • Degree of sequence difference is proportional to length of independent sequence evolution • Only use positions where alignment is pretty certain – avoid areas with (too many) gaps

  8. Distance between two sequences Given (portion of) sequences xi, xj, Define dij = distance between the two sequences One possible definition: dij = fraction f of sites u where xi[u]  xj[u] Better model (Jukes-Cantor): dij = - ¾ log(1 – 4f / 3)

  9. A simple clustering method for building tree UPGMA (unweighted pair group method using arithmetic averages) Given two disjoint clusters Ci, Cj of sequences, 1 dij = ––––––––– {p Ci, q Cj}dpq |Ci|  |Cj| Note that if Ck = Ci  Cj, then distance to another cluster Cl is: dil |Ci| + djl |Cj| dkl = –––––––––––––– |Ci| + |Cj|

  10. Algorithm: UPGMA 1 4 Initialization: Assign each xi into its own cluster Ci Define one leaf per sequence, height 0 Iteration: Find two clusters Ci, Cj s.t. dij is min Let Ck = Ci  Cj Define node connecting Ci, Cj, & place it at height dij/2 Delete Ci, Cj Termination: When two clusters i, j remain, place root at height dij/2 3 5 2 5 2 3 1 4

  11. Ultrametric Distances & UPGMA UPGMA is guaranteed to build the correct tree if distance is ultrametric Proof: • The tree topology is unique, given that the tree is binary • UPGMA constructs a tree obeying the pairwise distances 5 2 3 1 4

  12. Weakness of UPGMA Molecular clock: implied time is constant for all species However, certain species (e.g., mouse, rat) evolve much faster Example where UPGMA messes up: UPGMA Correct tree 3 2 1 3 4 4 2 1

  13. d1,4 Additive Distances 1 Given a tree, a distance measure is additive if the distance between any pair of leaves is the sum of lengths of edges connecting them Given a tree T & additive distances dij, can uniquely reconstruct edge lengths: • Find two neighboring leaves i, j, with common parent k • Place parent node k at distance dkm = ½ (dim + djm – dij) from any node m 4 12 8 3 13 7 9 5 11 10 6 2

  14. Neighbor-Joining • Guaranteed to produce the correct tree if distance is additive • May produce a good tree even when distance is not additive Step 1: Finding neighboring leaves Define Dij = dij – (ri + rj) Where 1 ri = –––––k dik |L| - 2 Claim: The above “magic trick” ensures that Dij is minimal iffi, j are neighbors Proof: Too little time today, please read Durbin et al.! 1 3 0.1 0.1 0.1 0.4 0.4 4 2

  15. Algorithm: Neighbor-joining Initialization: Define T to be the set of leaf nodes, one per sequence Let L = T Iteration: Pick i, j s.t. Dij is minimal Define a new node k, and set dkm = ½ (dim + djm – dij) for all m  L Add k to T, with edges of lengths dik = ½ (dij + ri – rj) Remove i, j from L; Add k to L Termination: When L consists of two nodes, i, j, and the edge between them of length dij

  16. Parsimony • One of the most popular methods Idea: Find the tree that explains the observed sequences with a minimal number of substitutions Two computational subproblems: • Find the parsimony cost of a given tree (easy) • Search through all tree topologies (hard)

  17. Parsimony Scoring Given a tree, and an alignment column Label internal nodes to minimize the number of required substitutions Initialization: Set cost C = 0; k = 2N – 1 Iteration: If k is a leaf, set Rk = { xk[u] } If k is not a leaf, Let i, j be the daughter nodes; Set Rk = Ri Rj if intersection is nonempty Set Rk = Ri  Rj, and C += 1, if intersection is empty Termination: Minimal cost of tree for column u, = C

  18. Example {A, B} C+=1 {A} {A, B} C+=1 B A B A {B} {B} {A} {A}

  19. Traceback to find ancestral nucleotides Traceback: • Choose an arbitrary nucleotide from R2N – 1 for the root • Having chosen nucleotide r for parent k, If r  Ri choose r for daughter i Else, choose arbitrary nucleotide from Ri Easy to see that this traceback produces some assignment of cost C

  20. Example Admissible with Traceback x B Still optimal, but inadmissible with Traceback A {A, B} B A x {A} B A A B B {A, B} x B x A A B B A A A B B {B} {A} {A} {B} A x A x A A B B

  21. Search through tree topologies: Branch and Bound Observation: adding an edge to an existing tree can only increase the parsimony cost Enumerate all unrooted trees with at most n leaves: [i3][i5][i7]……[i2N–5]] where each ik can take values from 0 (no edge) to k At each point keep C = smallest cost so far for a complete tree Start B&B with tree [1][0][0]……[0] Whenever cost of current tree T is > C, then: • T is not optimal • Any tree extending T with more edges is not optimal: Increment by 1 the rightmost nonzero counter

  22. Bootstrapping to get the best trees Main outline of algorithm • Select random columns from a multiple alignment – one column can then appear several times • Build a phylogenetic tree based on the random sample from (1) • Repeat (1), (2) many (say, 1000) times • Output the tree that is constructed most frequently

  23. Probabilistic Methods xroot • Instead of parsimony, can define a more refined measure of how “good” (likely) a tree is P(x1, x2, …, xN, xN+1, …, x2N-1 | T, t) = P(xroot)jrootP(xj | xparent(j), tj, parent(j)) • Usually we don’t know the internal labels, therefore P(x1, x2, …, xN | T, t) = xN+1 xN+2 … x2N-1P(x1, x2, …, x2N-1 | T, t) x2 xN x1

  24. Felsenstein’s Likelihood Algorithm To calculate P(x1, x2, …, xN | T, t) Initialization: Set k = 2N – 1 Recursion: Compute P(Lk | a) for all a   If k is a leaf node: Set P(Lk | a) = 1(a = xk) If k is not a leaf node: 1. Compute P(Li | b), P(Lj | b) for all b, for daughter nodes i, j 2. Set P(Lk | a) = b, cP(b | a, ti)P(Li | b) P(c | a, tj) P(Lj | c) Termination: Likelihood at this column = P(x1, x2, …, xN | T, t) = aP(L2N-1 | a)P(a)

  25. Probabilistic Methods Given M (ungapped) alignment columns of N sequences, • Define likelihood of a tree: L(T, t) = P(Data | T, t) = m=1…M P(x1m, …, xnm, T, t) Maximum Likelihood Reconstruction: • Given data X = (xij), find a topology T and length vector t that maximize likelihood L(T, t)

More Related