1 / 60

CS5263 Bioinformatics

CS5263 Bioinformatics. Lecture 13: Hidden Markov Models and applications. Project ideas. Implement an HMM Iterative refinement MSA Motif finder Implement an algorithm in a paper Write a survey paper. NAR Web Server Issue.

saddam
Download Presentation

CS5263 Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS5263 Bioinformatics Lecture 13: Hidden Markov Models and applications

  2. Project ideas • Implement an HMM • Iterative refinement MSA • Motif finder • Implement an algorithm in a paper • Write a survey paper

  3. NAR Web Server Issue • Every December, Nucleic Acids Research has a special issue on web servers • Not necessary to invent original methods • Biologists appreciate web tools • You get a nice publication • And potentially many citations if your tool is useful (think about BLAST!) • Talk to me if you want to work on this project

  4. Review of last lectures

  5. Problems in HMM • Decoding • Predict the state of each symbol • Viterbi algorithm • Forward-backward algorithm • Evaluation • The probability that a sequence is generated by a model • Forward-backward algorithm • Learning • “unsupervised” • Baum-Welch • Viterbi

  6. A general HMM 1 2 K states Completely connected (possibly with 0 transition probabilities) Each state has a set of emission probabilities (emission probabilities may be 0 for some symbols in some states) 3 K … …

  7. The Viterbi algorithm V(1,i) + w(1, j) + r(j, xi+1), V(2,i) + w(2, j) + r(j, xi+1), V(j, i+1) = max V(3,i) + w(3, j) + r(j, xi+1), …… V(k,i) + w(k, j) + r(j, xi+1) Or simply: V(j, i+1) = Maxl {V(l,i) + w(l, j) + r(j, xi+1)}

  8. Viterbi finds the single best path • In many cases it is also interesting to know what are the other possible paths • Viterbi assigns a state to each symbol • In many cases it is also interesting to know the probability that the assignment is correct • Posterior decoding using Forward-backward algorithm

  9. The forward algorithm

  10. 1 This does not include the emission probability of xi

  11. Forward-backward algorithm • fk(i): prob to get to pos i in state k and emit xi • bk(i): prob from i to end, given i is in state k • fk(i) * bk(i) = P(i=k, x)

  12. Sequence state  Forward probabilities Backward probabilities / P(X) Space: O(KN) Time: O(K2N) P(i=k | x) P(i=k | x) = fk(i) * bk(i) / P(x)

  13. Learning • When the states are known • “supervised” learning • When the states are unknown • Estimate parameters from unlabeled data • “unsupervised” learning How to find CpG islands in the porcupine genome?

  14. Basic idea • Estimate our “best guess” on the model parameters θ • Use θ to predict the unknown labels • Re-estimate a new set of θ • Repeat 2 & 3 until converge Two ways

  15. Viterbi training • Estimate our “best guess” on the model parameters θ • Find the Viterbi path using current θ • Re-estimate a new set of θbased on the Viterbi path • Repeat 2 & 3 until converge

  16. Baum-Welch training • Estimate our “best guess” on the model parameters θ • Find P(i=k | x,θ) using forward-backward algorithm • Re-estimate a new set of θbased on all possible paths • Repeat 2 & 3 until converge E-step M-step

  17. Viterbi vs Baum-Welch training • Viterbi training • Returns a single path • Each position labeled with a fixed state • Each transition counts one • Each emission also counts one • Baum-Welch training • Does not return a single path • Considers the probability that each transition is used • and the probability that a symbol is generated by a certain state • They only contribute partial counts

  18. Probability that a transition is used

  19. Viterbi vs Baum-Welch training • Both guaranteed to converges • Baum-Welch improves the likelihood of the data in each iteration • True EM (expectation-maximization) • Viterbi improves the probability of the most probable path in each iteration • EM-like

  20. Today • Some practical issues in HMM modeling • HMMs for sequence alignment

  21. Duration modeling • For any sub-path, the probability consists of two components • The product of emission probabilities • Depend on symbols and state path • The product of transition probabilities • Depend on state path

  22. Duration modeling • Model a stretch of DNA for which the distribution does not change for a certain length • The simplest model implies that P(length = L) = pL-1(1-p) • i.e., length follows geometric distribution • Not always appropriate P Duration: the number of steps that a state is used consecutively without visiting other states p s 1-p L

  23. Duration models P s s s s 1-p Min, then geometric P P P P s s s s 1-p 1-p 1-p 1-p Negative binominal

  24. Explicit duration modeling Exon Intron Intergenic P(A | I) = 0.3 P(C | I) = 0.2 P(G | I) = 0.2 P(T | I) = 0.3 Generalized HMM. Often used in gene finders P L Empirical intron length distribution

  25. Silent states • Silent states are states that do not emit symbols (e.g., the state 0 in our previous examples) • Silent states can be introduced in HMMs to reduce the number of transitions

  26. Silent states • Suppose we want to model a sequence in which arbitrary deletions are allowed (later this lecture) • In that case we need some completely forward connected HMM (O(m2) edges)

  27. Silent states • If we use silent states, we use only O(m) edges • Nothing comes free Algorithms can be modified easily to deal with silent states, so long as no silent-state loops Suppose we want to assign high probability to 1→5 and 2→4, there is no way to have also low probability on 1→4 and 2→5.

  28. Modular design of HMM • HMM can be designed modularly • Each modular has own begin / end states (silent) • Each module communicates with other modules only through begin/end states

  29. C+ G+ A+ T+ B+ E+ HMM modules and non-HMM modules can be mixed B- E- A- T- C- G-

  30. Explicit duration modeling Exon Intron Intergenic P(A | I) = 0.3 P(C | I) = 0.2 P(G | I) = 0.2 P(T | I) = 0.3 Generalized HMM. Often used in gene finders P L Empirical intron length distribution

  31. HMM applications • Pair-wise sequence alignment • Multiple sequence alignment • Gene finding • Speech recognition: a good tutorial on course website • Machine translation • Many others

  32. FSA for global alignment e Xi aligned to a gap  d Xi and Yj aligned  d Yj aligned to a gap e 

  33. HMM for global alignment  Xi aligned to a gap 1-  q(xi): 4 emission probabilities  Xi and Yj aligned 1-2 q(yj): 4 emission probabilities  Yj aligned to a gap P(xi,yj)  16 emission probabilities 1- Pair-wise HMM: emit two sequences simultaneously Algorithm is similar to regular HMM, but need an additional dimension. e.g. in Viterbi, we need Vk(i, j) instead of Vk(i)

  34. HMM and FSA for alignment • After proper transformation between the probabilities and substitution scores, the two are identical • (a, b)  log [p(a, b) / (q(a) q(b))] • d  log  • e  log  • Details in Durbin book chap 4 • Finding an optimal FSA alignment is equivalent to finding the most probable path with Viterbi

  35. HMM for pair-wise alignment • Theoretical advantages: • Full probabilistic interpretation of alignment scores • Probability of all alignments instead of the best alignment (forward-backward alignment) • Posterior probability that Ai is aligned to Bj • Sampling sub-optimal alignments • Not commonly used in practice • Needleman-Wunsch and Smith-Waterman algorithms work pretty well, and more intuitive to biologists • Other reasons?

  36. Other applications • HMM for multiple alignment • Widely used • HMM for gene finding • Foundation for most gene finders • Include many knowledge-based fine-tunes and GHMM extensions • We’ll only discuss basic ideas

  37. HMM for multiple seq alignment • Proteins form families both across and within species • Ex: Globins, Zinc finger • Descended from a common ancestor • Typically have similar three-dimensional structures, functions, and significant sequence similarity • Identifying families is very useful: suggest functions • So: search and alignment are both useful • Multiple alignment is hard • One very useful approach: profile-HMM

  38. Say we already have a database of reliable multiple alignment of protein families When a new protein comes, how do we align it to the existing alignments and find the family that the protein may belong to?

  39. Solution 1 • Use regular expression to represent the consensus sequences • Implemented in the Prosite database • for example C - x(2) - P - F - x - [FYWIV] - x(7) - C - x(8,10) - W - C - x(4) - [DNSR] - [FYW] - x(3,5) - [FYW] - x - [FYWI] - C

  40. Multi-alignments represented by consensus • Consensus sequences are very intuitive • Gaps can be represented by Do-not-cares • Aligning sequences with regular expressions can be done easily with DP • Possible to allow mismatches in searching • Problems • Limited power in expressing more divergent positions • E.g. among 100 seqs, 60 have Alanine, 20 have Glycine, 20 others • Specify boundaries of indel not so easy • unaligned regions have more flexibilities to evolve • May have to change the regular expression when a new member of a protein family is identified

  41. Solution 2 • For a non-gapped alignment, we can create a position-specific weight matrix (PWM) • Also called a profile • May use pseudocounts

  42. Scoring by PWMs x = KCIDNTHIKR P(x | M) = i ei(xi) Random model: each amino acid xi can be generated with probability q(xi) P(x | random) = i q(xi) Log-odds ratio: S = log P(X|M) / P(X|random) = i log ei(xi) / q(xi)

  43. PWMs • Advantage: • Can be used to identify both strong and weak homologies • Easy to implement and use • Probabilistic interpretation • PWMs are used in PSI-BLAST • Given a sequence, get k similar seqs by BLAST • Construct a PWM with these sequences • Search the database for seqs matching the PWM • Improved sensitivity • Problem: • Not intuitive to deal with gaps

  44. PWMs are HMMs • This can only be used to search for sequences without insertion / deletions (indels) • We can add additional states for indels! B M1 Mk E Transition probability = 1 20 emission probabilities for each state

  45. Dealing with insertions • This would allow an arbitrary number of insertions after the j-th position • i.e. the sequence being compared can be longer than the PWM Ij B M1 Mj Mk E

  46. Dealing with insertions • This allows insertions at any position I1 Ij Ik B M1 Mj Mk E

  47. Dealing with Deletions • This would allow a subsequence [i, j] being deleted • i.e. the sequence being compared can be shorter than the PWM B Mi Mj E

  48. Dealing with Deletions • This would allow an arbitrary length of deletion • Completely forward connected • Too many transitions B E

  49. Deletion with silent states • Still allows any length of deletions • Fewer parameters • Less detailed control Dj Silent state B Mj E

  50. Full model • Called profile-HMM Dj D: deletion state I: insertion state M: matching state Ij B Mj E Algorithm: basically the same as a regular HMM

More Related