1 / 63

Evolutionary HMMs: a Bayesian approach to multiple alignment

Evolutionary HMMs: a Bayesian approach to multiple alignment. Presented by: Ryan Cunningham. The goal. Input: Multiple sequences and a tree representing their evolutionary relationship

avital
Download Presentation

Evolutionary HMMs: a Bayesian approach to multiple alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evolutionary HMMs: a Bayesian approach to multiple alignment Presented by: Ryan Cunningham

  2. The goal • Input: Multiple sequences and a tree representing their evolutionary relationship • Output: A multiple sequence alignment which maximizes the probability of the evolutionary relationships between the sequences, given the tree (and other parameters)

  3. Evolutionary Models • Pairwise model for an ancestor aligned to a descendant, conditioned on time and other model parameters

  4. Reversibility • A pairwise model is reversible if

  5. Additivity • A pairwise model satisfies additive if

  6. The goal • If these constraints are satisfied, then the goal can be expressed as

  7. Model • Links Model • Pair HMM • Multiple HMM • Composition of branch alignments • Eliminating internal nodes

  8. Links model • Models indels in a sequence • Each residue can either spawn a child or die • One “immortal link” can spawn residues from the left end of the sequence

  9. Links model • Models indels in a sequence • Each residue can either spawn a child or die • One “immortal link” can spawn residues from the left end of the sequence

  10. Links model • What is the time evolution of the probability of a link surviving and spawning n descendants?

  11. Links model • What is the time evolution of the probability of a link surviving and spawning n descendants? Probability of deleting or inserting from a sequence of length n Probability of inserting into a sequence of length n-1 Probability of deleting from a sequence of length n+1

  12. Links model • What is the time evolution of the probability of a link dying before time t and spawning n descendants?

  13. Links model • What is the time evolution of the probability of the immortal link spawning n descendants at time t?

  14. Links model • The solution to these differential equations is

  15. Links model • Where…

  16. Links model • Where… Probability the ancestral residue survives Probability of insertions from descendants Probability of insertions if the ancestral residue is dead

  17. Pair HMM • Just like a standard HMM, but emits two sequences instead of one • Used to model aligned sequences

  18. Pair HMM

  19. Pair HMM • Can be used to realize the links model outlined previously

  20. Pair HMM

  21. Pair HMM • Can be used to realize the links model outlined previously • The path through the Pair HMM is pi

  22. Pair HMM • This model is insufficient for our tasks for an obvious reason…

  23. Pair HMM • This model is insufficient for our tasks for an obvious reason…

  24. Multiple HMMs • Instead of emitting 2 sequences, emit N sequences • 2N-1 states! • Can develop such a model for any tree • Dynamic programming table for entire tree too large for practical use

  25. Multiple HMMs

  26. Multiple HMMs

  27. Composition of branch alignments • Given a complete set of pairwise alignments for an evolutionary tree, how do we derive the multiple alignment (and vice versa)? • Rule for two sequences X and Y: Residues Xi and Yj are aligned iff • They are in the same column • That column contains no gaps for intermediate sequences

  28. Composition of branch alignments • In other words, no deletion->insertions are allowed • Produces cliques of ungapped columns • Ignore columns which are all gaps in a clique

  29. Composition of branch alignments • Example: 1 1: GATTACA 2: G-TTATA 3: GATAT-A 4: CATTA-T 2 4 3

  30. Eliminating internal nodes • Sequences at internal nodes are not actually known • Ideally, they should be summed out of the final likelihood function (i.e. the probability of all possible sequences should be considered)

  31. Eliminating internal nodes • This is not feasible for all possible indel histories • All of the branches become dependent • Substitution histories are possible • Given a multiple alignment • Post order traversal of the history • Compute the conditional likelihood for each clique • Like the Sankoff Algorithm discussed last semester • Call these characters “Felsenstein wildcards”

  32. Algorithm • Sample from this distribution: • Sample space is too large (decomposition on the right won’t work) for complete indel history • Sample from a subset of alignments at each step using Gibbs sampling

  33. Algorithm • Need to come up with a set of transformations on an alignment • These transformations need to be ergotic • Allow for transformation of any alignment into any other alignment • These transformations need to satisfy detailed balance • Gives convergence to desired stationary distribution

  34. Move 1: parent sampling • Goal: align two sibling nodes Y and Z and infer their parent X • Construct the multiple HMM, fixing everything else • Sample an alignment of (Y,Z) using the forward algorithm • This imposes an alignment of (X,Z) and (Y,Z)

  35. Multiple HMM

  36. Move 2: branch sampling • Goal: align two adjacent nodes X and Y • Construct the pair HMM for X and Y, fixing everything else • Resample the alignment using the forward algorithm

  37. Pair HMM

  38. Move 3: node sampling • Goal: resample the sequence at an internal node X • Given parent node W and children nodes Y and Z, fix everything but these branch alignments • Construct the multiple HMM and sample X • Note that we’re sampling the sequence, not an alignment

  39. Multiple HMM

  40. Sufficiency of moves • Can transform any alignment to any other alignment => ergodic • Samples from the conditional distributions, so this is Gibbs sampling => detailed balance • Together these imply the moves will produce an unbiased sample of the distribution

  41. Algorithm • Construct a multiple alignment by parent sampling up the guide tree • Visit each node and each branch once in a random order and resample them • GOTO 2

  42. Refinement 1: Greedy approach • Periodically save current alignment, then take a greedy approach • Just use Viterbi algorithm instead of forward algorithm for node and branch sampling • Store this and compare likelihood to other alignments at the end of the run

  43. Refinement 2: Simulated annealing • Inject a variable “temperature” parameter T that decreases over time • Raise all probabilities to the 1/T power • Increases the “randomness” of the solutions early in the run to explore more of the search space

  44. Refinement 3: Ordered over-relaxation • Sampling is a random walk, so follows Brownian motion • Would be better to avoid previously explored spaces • Accomplished without biasing the outputs through this technique

  45. Refinement 3: Ordered over-relaxation • Impose a weak order on states • Emit some number N samples • Sort the N samples and the original sample by the specified weak ordering • The original sample ends up in position k, choose the (N-k)th sample for the next emission

  46. Refinement 3: Ordered over-relaxation • For this algorithm, the weak ordering is a sort on “the centroid of all match states resolved transverse to the main diagonal of the dynamic programming matrix”

  47. Implementation • Coded in C++ under GNU • Useful logging and parameter options • Very large package with helper scripts • Incorporated into the DART package

  48. Simulated data • Yule process to create a random tree • Probabilistic process which creates an evolutionary tree • Number of links is distributed by a power law • Tfkemit to emit an alignment given a tree

  49. Experiments on simulated data • Only looking at leaf sequences now • Tfkdistance program creates a distance (time) matrix from observed sequences • Uses “bracketed minimization” • Weighbor estimates a tree from a distance matrix • Uses neighbor joining, but weighted for longer distances according to a probabilistic model

  50. Experiments on simulated data • Using the aligned leaf sequences, reconstruct the indel history using wildcards • Use maximum likelihood up the tree once • Use B to do greedy refinement (repeatedly sample up the tree until no improvements are made) • Use C to do 100 sampling moves from the Gibbs sampler, followed by greedy refinement

More Related