1 / 45

Introduction to Profile Hidden Markov Models

Introduction to Profile Hidden Markov Models. Mark Stamp. Hidden Markov Models. Here, we assume you know about HMMs If not, see “A revealing introduction to hidden Markov models” Executive summary of HMMs HMM is a machine learning technique Also, a discrete hill climb technique

vian
Download Presentation

Introduction to Profile Hidden Markov Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Profile Hidden Markov Models Mark Stamp PHMM

  2. Hidden Markov Models • Here, we assume you know about HMMs • If not, see “A revealing introduction to hidden Markov models” • Executive summary of HMMs • HMM is a machine learning technique • Also, a discrete hill climb technique • Train model based on observation sequence • Score given sequence to see how closely it matches the model • Efficient algorithms, many useful applications PHMM

  3. HMM Notation • Recall, HMM model denoted λ = (A,B,π) • Observation sequence is O • Notation: PHMM

  4. Hidden Markov Models • Among the many uses for HMMs… • Speech analysis • Music search engine • Malware detection • Intrusion detection systems (IDS) • Many more, and more all the time PHMM

  5. Limitations of HMMs • Positional information not considered • HMM has no “memory” • Higher order models have some memory • But no explicit use of positional information • Does not handle insertions or deletions • These limitations are serious problems in some applications • In bioinformatics string comparison, sequence alignment is critical • Also, insertions and deletions occur PHMM

  6. Profile HMM • Profile HMM (PHMM) designed to overcome limitations on previous slide • In some ways, PHMM easier than HMM • In some ways, PHMM more complex • The basic idea of PHMM • Define multiple B matrices • Almost like having an HMM for each position in sequence PHMM

  7. PHMM • In bioinformatics, begin by aligning multiple related sequences • Multiple sequence alignment (MSA) • This is like training phase for HMM • Generate PHMM based on given MSA • Easy, once MSA is known • Hard part is generating MSA • Then can score sequences using PHMM • Use forward algorithm, like HMM PHMM

  8. Generic View of PHMM • Circles are Delete states • Diamonds are Insert states • Rectangles are Match states • Match states correspond to HMM states • Arrows are possible transitions • Each transition has associated probability • Transition probabilities are A matrix • Emission probabilities are B matrices • In PHMM, observations are emissions • Match and insert states have emissions PHMM

  9. Generic View of PHMM • Circles are Delete states, diamonds are Insert states, rectangles are Match states • Also, begin and end states PHMM

  10. PHMM Notation • Notation PHMM

  11. PHMM • Match state probabilities easily determined from MSA, that is • aMi,Mi+1 transitions between match states • eMi(k) emission probability at match state • Note: other transition probabilities • For example, aMi,Ii and aMi,Di+1 • Emissions at all match & insert states • Remember, emission == observation PHMM

  12. MSA • First we show MSA construction • This is the difficult part • Lots of ways to do this • “Best” way depends on specific problem • Then construct PHMM from MSA • The easy part • Standard algorithm for this • How to score a sequence? • Forward algorithm, similar to HMM PHMM

  13. MSA • How to construct MSA? • Construct pairwise alignments • Combine pairwise alignments to obtain MSA • Allow gaps to be inserted • Makes better matches • But gaps tend to weaken scoring • So there is a tradeoff PHMM

  14. Global vs Local Alignment • In these pairwise alignment examples • “-” is gap • “|” are aligned • “*” omitted beginning and ending symbols PHMM

  15. Global vs Local Alignment • Global alignment is lossless • But gaps tend to proliferate • And gaps increase when we do MSA • More gaps implies more sequences match • So, result is less useful for scoring • We usually only consider local alignment • That is, omit ends for better alignment • For simplicity, we assume global alignment here PHMM

  16. Pairwise Alignment • We allow gaps when aligning • How to score an alignment? • Based on nxnsubstitution matrix S • Where n is number of symbols • What algorithm(s) to align sequences? • Usually, dynamic programming • Sometimes, HMM is used • Other? • Local alignment --- more issues PHMM

  17. Pairwise Alignment • Example • Note gaps vs misaligned elements • Depends on S and gap penalty PHMM

  18. Substitution Matrix • Masquerade detection • Detect imposter using an account • Consider 4 different operations • E == send email • G == play games • C == C programming • J == Java programming • How similar are these to each other? PHMM

  19. Substitution Matrix • Consider 4 different operations: • E, G, C, J • Possible substitution matrix: • Diagonal --- matches • High positive scores • Which others most similar? • J and C, so substituting C for J is a high score • Game playing/programming, very different • So substituting G for C is a negative score PHMM

  20. Substitution Matrix • Depending on problem, might be easy or very difficult to get useful S matrix • Consider masquerade detection based on UNIX commands • Sometimes difficult to say how “close” 2 commands are • Suppose aligning DNA sequences • Biological rationale for closeness of symbols PHMM

  21. Gap Penalty • Generally must allow gaps to be inserted • But gaps make alignment more generic • So, less useful for scoring • Therefore, we penalize gaps • How to penalize gaps? • Linear gap penalty function • f(g) = dg (i.e., constant penalty per gap) • Affine gap penalty function • f(g) = a + e(g – 1) • Gap opening penalty a, then constant factor of e PHMM

  22. Pairwise Alignment Algorithm • We use dynamic programming • Based on S matrix, gap penalty function • Notation: PHMM

  23. Pairwise Alignment DP • Initialization: • Recursion: PHMM

  24. MSA from Pairwise Alignments • Given pairwise alignments… • …how to construct MSA? • Generic approach is “progressive alignment” • Select one pairwise alignment • Select another and combine with first • Continue to add more until all are combined • Relatively easy (good) • Gaps may proliferate, unstable (bad) PHMM

  25. MSA from Pairwise Alignments • Lots of ways to improve on generic progressive alignment • Here, we mention one such approach • Not necessarily “best” or most popular • Feng-Dolittle progressive alignment • Compute scores for all pairs of n sequences • Select n-1 alignments that a) “connect” all sequences and b) maximize pairwise scores • Then generate a minimum spanning tree • For MSA, add sequences in the order that they appear in the spanning tree PHMM

  26. MSA Construction • Create pairwise alignments • Generate substitution matrix • Dynamic program for pairwise alignments • Use pairwise alignments to make MSA • Use pairwise alignments to construct spanning tree (e.g., Prim’s Algorithm) • Add sequences to MSA in spanning tree order (from highest score, insert gaps as needed) • Note: gap penalty is used PHMM

  27. MSA Example • Suppose 10 sequences, with the following pairwise alignment scores: PHMM

  28. MSA Example: Spanning Tree • Spanning tree based on scores • So process pairs in following order: (5,4), (5,8), (8,3), (3,2), (2,7), (2,1), (1,6), (6,10), (10,9) PHMM

  29. MSA Snapshot • Intermediate step and final • Use “+” for neutral symbol • Then “-” for gaps in MSA • Note increase in gaps PHMM

  30. PHMM from MSA • For PHMM, must determine match and insert states & probabilities from MSA • “Conservative” columns are match states • Half or less of symbols are gaps • Other columns are insert states • Majority of symbols are gaps • Delete states are a separate issue PHMM

  31. PHMM States from MSA • Consider a simpler MSA… • Columns 1,2,6 are match states 1,2,3, respectively • Since less than half gaps • Columns 3,4,5 are combined to form insert state 2 • Since more than half gaps • Match states between insert PHMM

  32. PHMM Probabilities from MSA • Emission probabilities • Based on symbol distribution in match and insert states • State transition probs • Based on transitions in the MSA PHMM

  33. PHMM Probabilities from MSA • Emission probabilities: • But 0 probabilities are bad • Model “overfits” the data • So, use “add one” rule • Add one to each numerator, add total to denominators PHMM

  34. PHMM Probabilities from MSA • More emission probabilities: • But 0 probabilities are bad • Model “overfits” the data • Again, use “add one” rule • Add one to each numerator, add total to denominators PHMM

  35. PHMM Probabilities from MSA • Transition probabilities: • We look at some examples • Note that “-” is delete state • First, consider begin state: • Again, use add one rule PHMM

  36. PHMM Probabilities from MSA • Transition probabilities • When no information in MSA, set probs to uniform • For example I1 does not appear in MSA, so PHMM

  37. PHMM Probabilities from MSA • Transition probabilities, another example • What about transitions from state D1? • Can only go to M2, so • Again, use add one rule: PHMM

  38. PHMM Emission Probabilities • Emission probabilities for the given MSA • Using add-one rule PHMM

  39. PHMM Transition Probabilities • Transition probabilities for the given MSA • Using add-one rule PHMM

  40. PHMM Summary • Construct pairwise alignments • Usually, use dynamic programming • Use these to construct MSA • Lots of ways to do this • Using MSA, determine probabilities • Emission probabilities • State transition probabilities • In effect, we have trained a PHMM • Now what??? PHMM

  41. PHMM Scoring • Want to score sequences to see how closely they match PHMM • How did we score sequences with HMM? • Forward algorithm • How to score sequences with PHMM? • Forward algorithm • But, algorithm is a little more complex • Due to complex state transitions PHMM

  42. Forward Algorithm • Notation • Indices i and j are columns in MSA • xi is ith observation symbol • qxi is distribution of xi in “random model” • Base case is • is score of x1,…,xiup to state j (note that in PHMM, i and j may not agree) • Some states undefined • Undefined states ignored in calculation PHMM

  43. Forward Algorithm • Compute P(X|λ) recursively • Note that depends on , and • And corresponding state transition probs PHMM

  44. PHMM • We will see examples of PHMM later • In particular, • Malware detection based on opcodes • Masquerade detection based on UNIX commands PHMM

  45. References • Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Durbin, et al • Masquerade detection using profile hidden Markov models, L. Huang and M. Stamp, to appear in Computers and Security • Profile hidden Markov models for metamorphic virus detection, S. Attaluri, S. McGhee and M. Stamp, Journal in Computer Virology, Vol. 5, No. 2, May 2009, pp. 151-169 PHMM

More Related