1 / 23

Profile Hidden Markov Models

Profile Hidden Markov Models. Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering The Pennsylvania State University. Outline. Introduction to HMMs Profile HMMs

Download Presentation

Profile Hidden Markov Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and DrClaude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering The Pennsylvania State University

  2. Outline • Introduction to HMMs • Profile HMMs • Available resources for Profile HMMs • Some online demonstrations

  3. Introduction to HMMs • Hidden Markov Models – Formalism • statistical techniques for modeling patterns in data • First order Markov property - memorylessness • state generally a hidden entity which spawns symbols or features • the same symbol could be emitted by several states • HMM characterized by transition probabilities and emission distribution

  4. Introduction to HMMs • Hidden Markov Models – Parameter Estimation • Parameters- transition probabilities and emission probabilities • iterative computational algorithms used • EM algorithm, Viterbi algorithm • algorithms based on dynamic programming to save computational cost • usually the iterations involve variants of the following two steps • estimate state sequence which maximizes likelihood under a parameter set • update parameter set based on the estimated state sequence • algorithms converge to local optima sometimes

  5. Outline • Introduction to HMMs • Profile HMMs • Available resources for Profile HMMs • Some online demonstrations

  6. Profile Hidden Markov Models • Stochastic methods to model multiple sequence alignments – proteins and dna sequences • Potential application domains: • protein families could be modeled as an HMM or a group of HMMs • constructing a profile HMM • new protein sequences could be aligned with stored models to detect remote homology • aligning a sequence with a stored profile HMM • align two or more protein family profile HMMs to detect homology • finding statistical similarities between two profile HMM models

  7. Profile Hidden Markov Models • Constructing a profile HMM • A multiple sequence alignment assumed • each consensus column can exist in 3 states • match, insert and delete states • number of states depends upon length of the alignment

  8. Profile Hidden Markov Models • A typical profile HMM architecture • squares represent match states • diamonds represent insert states • circles represent delete states • arrows represent transitions

  9. Profile Hidden Markov Models • A typical profile HMM architecture • transition between match states - • transition from match state to insert state - • transition within insert state - • transition from match state to delete state - • transition within delete state - • emission of symbol at a state -

  10. Profile Hidden Markov Models • Estimation of parameters • transition probabilities estimated as frequency of a transition in a given alignment • emission probabilities estimated as frequency of an emission in a given alignment • pseudo counts usually introduced to account for transititions / emissions which were not present in the alignment

  11. Profile Hidden Markov Models • Estimation of parameters • with pseudo counts • Dirichlet prior distribution used to determine pseudo counts

  12. Profile Hidden Markov Models • Scoring a sequence against a profile HMM • Viterbi algorithm used to find the best state path • Simulated annealing based methods also used • Maximization criteria – log likelihood or log odds • Log likelihood score generally depends on length of sequence and hence not preferred • If an alignment not given initially, the alignment could be learnt iteratively using Viterbi

  13. Profile Hidden Markov Models • Comparing two profile HMMs • Profile-profile comparison tool based on information theory • based on Kullback-Leibler divergence criterion for comparing 2 statistical distributions • dynamic programming used to compare entire profiles • detect weak similarities between models

  14. Outline • Introduction to HMMs • Profile HMMs • Available resources for Profile HMMs • Some online demonstrations

  15. Available resources for Profile HMMs • HMMER and SAM one of the first available programs for profile HMMs • HMMER : S Eddy at Washington University • SAM : Sequence alignment and Modeling System R. Hughey at University of California, Santa Cruz • available free for research • SAM has online servers to perform sequence comparisons http://www.cse.ucsc.edu/research/compbio/sam.html

  16. Available resources for Profile HMMs • InterPro consortium in Europe has many resources for protein data • Database of protein families and domains • Brings together several different databases under one umbrella • Pfam and Superfamily are profile HMM libraries associated with Interpro • Pfam based on HMMER search and Superfamily based on SAM search and modeling

  17. Available resources for Profile HMMs • SAM’s iterative approach for building HMM • find a set of close homologs using BLASTP • learn the alignment and build model using close homologs • use BLASTP to get more remote homologs using the first set of sequences (relax the E value) • iteratively refine the HMM model • SAM uses Dirichlet priors as pseudo counts for parameters • Hand tuned seed alignments not required as the alignments are learnt by the algorithm – unlike HMMER

  18. Available resources for Profile HMMs • SUPERFAMILY database incorporates: • library of profile HMMs representing all proteins of known structure • assignments to predicted proteins from all completely sequenced genomes • search and alignment services • models and domain assignments are freely available • Based on SCOP classification of protein domains • SAM HMM iterative procedure used for model building and sequence alignment

  19. Available resources for Profile HMMs • In Superfamily: • Each SCOP superfamily is represented as an HMM model • Model built using SAM procedure based 4 variants • accurate structure based alignments • hand labeled alignments • autonomic alignments using ClustalW • sequence members used separately as seeds • Assignment of superfamilies • for a given sequence, every model is scored across the whole sequence using Viterbi scoring • model which scores highest has its superfamily assigned to the region

  20. Outline • Introduction to HMMs • Profile HMMs • Available resources for Profile HMMs • Some online demonstrations

  21. Online Demonstrations http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/temp/624288710157514.html

  22. References • Durbin. R, Eddy. S, Krough. A, and Mitchenson. G, ``Biological Sequence Analysis’’, Cambridge University Press, 2002 • Baldi. P and Brunak. S, ``Bioinformatics, the Machine Learning Approach’’, the MIT Press, Cambridge, 1998 • Eddy. S, ``Profile Hidden Markov Models’’, Bioinformatics Review, vol. 19, no. 8, pp. 755-763, 1998 • Karplus. K, Barrett. C, and Hughey. R, ``Hidden Markov models for detecting remote homologies’’, Bioinformatics, vol. 14, no. 10, pp. 846-856, 1998 • Madera. M, Gough, J, ``A comparison of profile hidden Markov model procedures for remote homology detection’’, Nucleic Acids Research, vol. 30, no. 19, pp. 4321-4328, 2002 • Gough. J, Karplus. K, Hughey. R, and Chothia. C, ``Assignment of Homology to Genome Sequences using a Library of Hidden Markov Models that represent all Proteins of known structure’’, J. Mol. Biol., 313, pp. 903-919, 2001

  23. References • Yona. G, Levitt. M, ``Within the Twilight Zone: A sensitive Profile-Profile comparison tool based on Information Theory’’, J. Mol. Biol., 315, 1257-1275, 2002 • Mandera. M, Vogel. C, Kummerfeld. K, Chothia. C, and Gough. J, ``The SUPERFAMILY database in 2004: additions and improvements’’, Nucleic Acids Research, vol. 32, Database Issue, D235-239, 2004 • Bateman. A, Birney. E, Durbin. R, Eddy. S, Finn. R, Sonnhammer. E, ``Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins’’, Nucleic Acids Research, vol. 27, no. 1, 1999 • Andreeva. A, et. al., ``SCOP database in 2004: refinements integrate structure and sequence family data’’, Nucleic Acids Research, vol. 32, Database Issue, D226-D229,2004 • Many other online resources and tutorials

More Related