1 / 21

HMMs for alignments & Sequence pattern discovery

I519 Introduction to Bioinformatics. HMMs for alignments & Sequence pattern discovery. Contents. Motifs We have seen motifs in regular expression Profiles & consensus Motif search

ghazi
Download Presentation

HMMs for alignments & Sequence pattern discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. I519 Introduction to Bioinformatics HMMs for alignments & Sequence pattern discovery

  2. Contents • Motifs • We have seen motifs in regular expression • Profiles & consensus • Motif search • sequence motifs represent critical positions that are conserved in evolution, so search algorithms employing motifs may be used to identify more divergent sequences than methods based on global sequence similarity • PSI-BLAST (similarity search using PSSM, Position Specific Scoring Matrix) • HMM of protein family (a very brief introduction)

  3. a G g t a c T t C c A t a c g t Alignment a c g t T A g t a c g t C c A t C c g t a c g G A3 0 1 0 31 1 0 ProfileC24 0 0 14 0 0 G 0 1 4 0 0 0 31 T 0 0 0 5 1 0 14 Consensus A C G T A C G T Line up the patterns by their start indexes s = (s1, s2, …, st) Construct matrix profile with frequencies of each nucleotide in columns Consensus nucleotide in each position has the highest score in column Motifs: Profiles and Consensus

  4. Aligned DNA sequences can be represented by a 4 ·n profile matrix reflecting the frequencies of nucleotides in every aligned position. Profile Representation of Protein Families Protein family can be represented by a 20·nprofile representing frequencies of amino acids.

  5. Profiles and HMMs • HMMs can also be used for aligning a sequence against a profile representing protein family. • A 20·n profile P corresponds to n sequentially linked match states M1,…,Mnin the profile HMM of P.

  6. Multiple Alignments and Protein Family Classification • Multiple alignment of a protein family shows variations in conservation along the length of a protein • Example: after aligning many globin proteins, the biologists recognized that the helices region in globins are more conserved than others.

  7. What are Profile HMMs ? • A Profile HMM is a probabilistic representation of a multiple alignment. • A given multiple alignment (of a protein family) is used to build a profile HMM. • This model then may be used to find and score less obvious potential matches of new protein sequences.

  8. Profile HMM A profile HMM

  9. Building a Profile HMM • Multiple alignment is used to construct the HMM model. • Assign each column to a Match state in HMM. Add Insertion and Deletion state. • Estimate the emission probabilities according to amino acid counts in column. Different positions in the protein will have different emission probabilities. • Estimate the transition probabilities between Match, Deletion and Insertion states • The HMM model gets trained to derive the optimal parameters.

  10. States of Profile HMM • Match states M1…Mn (plus begin/end states) • Insertion states I0I1…In • Deletion states D1…Dn

  11. Transition Probabilities in Profile HMM • log(aMI)+log(aIM) = gap initiation penalty • log(aII) = gap extension penalty

  12. Emission Probabilities in Profile HMM • Probabilty of emitting a symbol a at an • insertion state Ij: • eIj(a) = p(a) • where p(a) is the frequency of the • occurrence of the symbol a in all the • sequences.

  13. Profile HMM Alignment • Define vMj(i)as the logarithmic likelihood score of the best path for matching x1..xi to profile HMM ending with xi emitted by the state Mj. • vIj (i) and vDj (i) are defined similarly.

  14. vMj-1(i-1) + log(aMj-1,Mj ) vMj(i) = log (eMj(xi)/p(xi)) + max vIj-1(i-1) + log(aIj-1,Mj ) vDj-1(i-1) + log(aDj-1,Mj ) vMj(i-1) + log(aMj, Ij) vIj(i) = log (eIj(xi)/p(xi)) + max vIj(i-1) + log(aIj, Ij) vDj(i-1) + log(aDj, Ij) Profile HMM Alignment: Dynamic Programming

  15. Paths in Edit Graph and Profile HMM A path through an edit graph and the corresponding path through a profile HMM

  16. Making a Collection of HMM for Protein Families • Use Blast to separate a protein database into families of related proteins • Construct a multiple alignment for each protein family. • Construct a profile HMM model and optimize the parameters of the model (transition and emission probabilities). • Align the target sequence against each HMM to find the best fit between a target sequence and an HMM

  17. Application of Profile HMM to Modeling Globin Proteins • Globins represent a large collection of protein sequences • 400 globin sequences were randomly selected from all globins and used to construct a multiple alignment. • Multiple alignment was used to assign an initial HMM • This model then get trained repeatedly with model lengths chosen randomly between 145 to 170, to get an HMM model optimized probabilities.

  18. hmmer package • Tools for making HMMs and for hmmscan • hmmer3 (as fast as blast)

  19. Sequence Pattern (Motif) Discovery • Finding patterns in multiple alignments, or in unaligned sequences • eMotif (a protein pattern database); eBLOCKs • Gibbs and MEME • To infer patterns in unaligned sequences • Gibbs program starts with a fixed pattern length of W and a random set of locations of the pattern in given input sequences (i.e., the initial pattern is random); and then one sequence is selected at a time randomly and an attempt is made to improve its pattern position. • MEME uses many similar concepts, but uses the EM (expectation maximization) method.

  20. Utilization of Multiple Alignments • Residue conservation • Jalview • Subfamilies • SCI-PHY • FunShift

  21. Readings • Chapter 6

More Related