1 / 30

Sequence classification & hidden Markov models

Bioinformatics, Models & algorithms, 8 th November 2005 Patrik Johansson, Dept. of Cell & Molecular Biology, Uppsala University. Sequence classification & hidden Markov models. A family of proteins share a similar structure but not necessarily sequence.

olive
Download Presentation

Sequence classification & hidden Markov models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics, Models & algorithms, 8th November 2005 Patrik Johansson, Dept. of Cell & Molecular Biology, Uppsala University Sequence classification & hidden Markov models

  2. A family of proteins share a similar structure but not necessarily sequence

  3. Classification of an unknown sequence s to family A or B using HMMs A A A A A A A A A s B B B B B B B B

  4. A A A B B B C C C Hidden Markov Models, introduction • General method for pattern recognition, comp. Neural networks • An HMM generates sequences / sequence distributions • Markov chain of events Three coins A, B & C gives a Markov chain Г = CAABA.. Outcome, e.g. Heads Heads Tails, generated by hidden Markov chain Г

  5. A A A B B B C C C Heads Tails Tails Hidden Markov Models, introduction.. • Model M is emitting a symbol (T, H) in each state i based on some probability ei • The next state j is chosen based on some transition probability ai,j e.g, the sequence s = ‘Tails Heads Tails’ over the pathГ = BCC

  6. B E M1 Mj MN Profile hidden Markov Model architecture • A first approach for sequence distribution modelling

  7. Ij B E Mj - Mj Mj+ Profile hidden Markov Model architecture.. • Insertion modelling Insertions random; ejI(a) =q(a)

  8. Mj Dj Mj Profile Hidden Markov Model architecture.. • Deletion modelling Alt.

  9. Profile Hidden Markov Model architecture.. Insert & deletestates are generalized to all positions. The model M can generate sequences from state B by successive emissions and transitions to state E Dj Ij E B Mj

  10. Probabilistic sequence modelling • Classification criteria ( 1 ) Bayes theorem ; ( 2 ) ..but, P(M) & P(s)..? ( 3 )

  11. Probabilistic sequence modelling.. If N models the whole sequence space (N = q) ( 4 ) Since , logarithm probabilities more convenient Def., log-odds score V; ( 5 )

  12. Probabilistic sequence modelling.. Eq. ( 4 ) & ( 5 ) gives new classification criteria ; logzP(s | M) ≥ d ( 6 ) score = logzP(s | q) ..for a certain significance level  (i.e. the number of incorrect classifications in an n big database) a threshold d is required  ( 7 )

  13. Probabilistic sequence modelling.. Example If z=e or z=2, the significance level is chosen to one incorrect classification (false positive) per 1000 trials in a database of n=10000 sequences ; bits nits,

  14. Large vs. small threshold d High d A Low d A A A A A A A A B B B B B True positives B B B False positive

  15. Model characteristics One can define sensitivity, ‘how many are found’ ; ..and selectivity, ‘how many are correct’ ;

  16. Model construction • From initial alignment Most common method. Start from an initial multiple alignment of e.g. a protein family • Iteratively By successive database searches incorporating new similar sequences into the model • Neural-inspired The model is trained using some continuous minimization algorithm, e.g. Baum-Welsh, Steepest Descent etc.

  17. E Model construction.. A short family alignment gives a simple model M, potential matchstates marked with an () B

  18. E B E B E B B E Model construction.. A more generalized model Ex. evaluate sequence s=‘AIEH’

  19. Dj-1 Ij-1 Mj-1 Mj Sequence evaluation The optimal alignment, i.e. the path that has the greatest probability of generating sequences s, can be determined through dynamic programming The maximum log-odds score VjM(si) for matchstate j that is emitting si is calculated from the emission score, previous maximum score plus transition score

  20. Sequence evaluation.. Viterbis Algorithm, ( 8 ) ( 9 ) ( 10 )

  21. Parameter estimation, background • Proteins with similar structures can have very different sequences • Classical sequence alignment based only on heuristic rules & parameters cannot deal with sequence identities below ~ 50-60% • Substitution matrices add static a priori information about amino acids and protein sequences  good alignments down to ~ 25-30% sequence identity, ex. CLUSTAL • How to get further down into ‘the twilight zone’..? - More and dynamic a priori information..!

  22. Parameter estimation Probability of emitting an alanine in the first matchstate, eM1(‘A’)..? • Maximum likelihood-estimation

  23. Parameter estimation.. • Add-one pseudocount estimation • Background pseudocount estimation

  24. Parameter estimation.. • Substitution mixture estimation Score :  Maximum likelihood gives pseudocounts  : Total estimation :

  25. Parameter estimation.. All above methods are in spite of their dynamic implementation, still based on heuristic parameters Method that compensates & complements lack of data in a statistically correct way ; • Dirichlet mixture estimation Looking at sequence alignments, several different amino acid distributions seem to be reoccurring, not just the background distribution q Assume that there are k probability densities that generates these

  26. Parameter estimation, Dirichlet Mixture style.. Given the data, a countvector , this method allows a linear combination of k individual estimations weighted with the probability that n is generated by each component The k componets can be modelled from a curated database of alignments. Using some parametric form of the probability density, an explicit expression for the probability that n has been generated by the jth component can be derived Ex.

  27. Parameter estimation, Dirichlet Mixture style.. n The k components describe peaks of aa distributions in some kind of multidimensional space Depending on where in sequence space our countvector n lies, i.e. depending on which components that can be assumed to have generatedn, distribution information is incorporated into the probability estimation e

  28. Classification example Alignment of some known glycoside hydrolase family 16 sequences • Define which columns are to be regarded as matchstates (*) • Build the corresponding model M & HMM graph • Estimate all emission and transition probabilities, ej& ajk • Evaluate the log-odds score / probability that an unknown sequence s has been generated by M using Viterbis algorithm • If score(s | M) > d, the sequence can be classified as a GH16 family member

  29. Classification example.. A certain sequence s1=WHKLRQ.. is evaluated and gets a score of -17.63 nits, i.e. the probability that M has generated s1 is very small Another sequence s2=SDGSYT.. gets a score of 27.49 nits and can with good significance be classified as a family member

  30. Summary • Hidden Markov models are used mainly for classification / searching (PFAM), but also for sequence mapping / alignment • As compared to normal alignment, a position specific approach is used for sequence distributions, insertions and deletions • Model building is usually a compromise between sensitivity and selectivity. If more a priori information is incorporated, the sensitivity goes up whereas the selectivity goes down

More Related