1 / 33

CSCE555 Bioinformatics

CSCE555 Bioinformatics. Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555. HAPPY CHINESE NEW YEAR. University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu.

zyta
Download Presentation

CSCE555 Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun HuCourse page: http://www.scigen.org/csce555 HAPPY CHINESE NEW YEAR University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu.

  2. Roadmap • Probablistic Models of Sequences • Introduction to HMM • Profile HMMs as MSA models • Measuring Similarity between Sequence and HMM Profile model • Summary

  3. Multiple Sequence Alignment • Alignment containing multiple DNA / protein sequences • Look for conserved regions → similar function • Example: #Rat ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT #Mouse ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT #Rabbit ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC #Human ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC #OppossumATGGTGCACTTGACTTTT---GAGGAGAAGAACTG #Chicken ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT #Frog ---ATGGGTTTGACAGCACATGATCGT---CAGCT

  4. Probablistic Model: Position-specific scoring matrices (PSSM) Limitations of PSSM?

  5. Difficulty in biological sequences • Variation in a family of sequences • Gaps of variable lengths • Conserved segments with different degrees • PSSM cannot handle variable-length gaps • Need a statistical sequence model

  6. Regular Expressions Model • Regular expressions • Protein spelling is much more free that English spelling • [AT] [CG] [AC] [ACGT]* A [TG] [GC] Limitation of Regular expression model?

  7. Roadmap • Probablistic Models of Sequences • Introduction to HMM • Profile HMMs as MSA models • Measuring Similarity between Sequence and HMM Profile model • Summary

  8. Hidden Markov Model (HMM) • HMM is: • Statistical model • Well suited for many tasks in molecular biology • Using HMM in molecular biology • Probabilistic profile (profile HMM) • From a family of proteins, for searching a database for other members of the family • Resemble the profile and weight matrix methods • Grammatical structure • Gene finding • Recognize signals • Prediction (must follow the rules of a gene)

  9. Detect Cheating in Coin Toss Game • Fair and biased coins could be used • Question: is it possible to determine whether a biased coin has been used based on the output sequence of the Head/Tail sequence? • HTTTHTHTHTTHHHHTHTHTHTHHHHTHT

  10. T H EXAMPLE : Fair Coin Toss • Consider the single coin scenario • We could model the process producing the sequence of H’s and T’s as a Markov model with two states, and equal transition probabilities: 0.5 0.5 0.5 0.5 Only one fair coin is used here

  11. Consider the scenario where there are two coins: Fair coin and Biased coin Visible state do not correspond to hidden state - Visible state : Output of H or T - Hidden state : Which coin was tossed Example: Fair and Biased Coins HTTTHTHTHTTHHHHTHTHTHTHHHHTHT

  12. Hidden Markov Models

  13. Ingredients of a HMM • Collection of states: {S1, S2,…,SN} • State transition probabilities (transition matrix) Aij = P(qt+1 = Si | qt = Sj) • Initial state distribution i = P(q1 = Si) • Observations: {O1, O2,…,OM} • Observation probabilities: Bj(k) = P(vt = Ok | qt = Sj)

  14. Ingredients of Our HMM • States: {Ssunny, Srainy, Ssnowy} • State transition probabilities (transition matrix) A = • Initial state distribution i = (.7 .25 .05) • Observations: {O1, O2,…,OM} • Observation probabilities (emission matrix): B =

  15. Probability of a Sequence of Events P(O) = P(Ogloves, Ogloves, Oumbrella,…, Oumbrella) =  P(O | Q)P(Q) =  P(O | q1,…,q7) = 0.7x0.86x0.32x0.14x0.6 + … q1,…q7 all Q

  16. Typical HMM Problems Annotation Given a model M and an observed string S, what is the most probable path through M generating S Classification Given a model M and an observed string S, what is the total probability of S under M Consensus Given a model M, what is the string having the highest probability under M Training Given a set of strings and a model structure, find transition and emission probabilities assigning high probabilities to the strings

  17. Roadmap • Probablistic Models of Sequences • Introduction to HMM • Profile HMMs as MSA models • Measuring Similarity between Sequence and HMM Profile model • Summary

  18. HMM Profiles as Sequence Models • Given the multiple alignment of sequences, we can use HMM to model the sequences • Each column of the alignment may be represented by a hidden state that produced that column • Insertions and deletions may be represented by other states

  19. Profile HMMs • HMM with a structure that in a natural way allows position-dependent gap penalties • Main states • model the columns of the alignment • Insert states • model highly variable regions • Delete states • to jump over one or more columns • i.e. to model the situation when just a few of the sequences have a “-” in the multiple alignment at a position

  20. HMM Sequences Continued

  21. Profile HMM Example • Consider the following six sequences shown below • A multiple sequence alignment of these sequences is the first step towards the processing of inducing the hidden markov model SEQ1 G C C C A SEQ2 A G C SEQ3 A A G C SEQ4 A G A A SEQ5 A A A C SEQ6 A G C

  22. Profile HMM Topology • The topology of HMM is established using consensus sequence • The structure of a Profile HMM is shown below:- • The square box represent match states • Diamonds represent insert states • Circles represent delete states

  23. Profile HMM Example Continued • The aligned columns correspond to either emissions from the match state or to emissions from the insert state • The consensus columns are used to define the match states M1,M2,M3 for the HMM • After defining the match states, the corresponding insert and delete states are used to define the complete HMM topology

  24. Transition Probabilities • The values of the transition probabilities are computed using the frequency of the transitions as each sequence is considered • The model parameters are computed using the state transition sequences shown in the figure below:-

  25. Transition Probabilities Continued • The frequency of each of the transitions and the corresponding emission probabilities are shown below

  26. Emission Probabilities • The emission probability is computed using the formula:- • The emission probability specifies the probability of emitting each of the symbols in |∑ | in the state k

  27. Emission Probabilities Continued • The emission probability for each state is computed as shown below:

  28. Searching the Profile HMM • Sequences can be searched against the HMM to detect whether or not they belong to a particular family of sequences described by the profile HMM • Using a global alignment, the probability of the most probable alignment and sequence can be determined using the Viterbi algorithm • Full probability of a sequence aligning to the profile HMM determined using the forward algorithm

  29. How A Sequence Fit a Model? • Probability depends on the length of the sequence • Not suitable to use as a score

  30. Length-independent Score • Log-odds score • The logarithm of the probability of the sequence divided by the probability according to a null model

  31. Length-independent Score • HMM using log-odds

  32. Summary • HMM • How to build Profile HMM model • Scoring Fit between Sequence and HMM model

  33. Next Lecture • Gene-finding • Reading: • Textbook (CG) chapter 4 • Textbook (EB) chapter 8

More Related