Profile Hidden Markov Models

Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry colinc@cs

Biological Motivation: • Given a single amino acid target sequence of unknown structure, we want to infer the structure of the resulting protein.

But wait, that’s hard! • There’s physics, chemistry, secondary structure, tertiary structure and all sorts of other nasty stuff to deal with. • Let’s rephrase the problem: • Given a target amino acid sequence of unknown structure, we want to identify the structural family of the target sequence through identification of a homologous sequence of known structure.

It still sounds hard… • In other words: • We find a similar protein with a structure that we understand, and we see if it makes sense to fold our target into the same sort of shape. • If not, we try again with the second most similar structure, and so on. • What we’re doing is taking advantage of the wealth of knowledge that has been collected in protein and structure databases.

So, the next question is: • How do we find a known protein that is similar to our target sequence? • One method happens to be: Hidden Markov Models! (Profile Hidden Markov Models, to be precise)

Lecture Objectives • Once I’m done, you should know: • The standard architecture for a profile HMM. • The three major uses in bioinformatics for a profile HMM. • The high-level concepts behind the algorithms to train and use a profile HMM. • The two different starting points for training an HMM. • How to avoid over-fitting a profile HMM. • The high-level ideas behind using profile HMMs to determine protein structure. • Please feel free to interrupt at any point.

Outline • This talk be broken into the following sections: • Methods for Characterizing A Protein Family • The Architecture of a Profile HMM • Alignment and Training with Profile HMMs • Preventing Over-fitting • Determining Structure • Conclusion

Methods for Characterizing a Protein Family • Objective: Given a number of related sequences, encapsulate what they have in common in such a way that we can recognize other members of the family. • Some standard methods for characterization: • Multiple Alignments • Regular Expressions • Consensus Sequences • Hidden Markov Models

A Characterization Example How could we characterize this (hypothetical) family of nucleotide sequences? • Keep the Multiple Alignment • Try a regular expression [AT] [CG] [AC] [ACTG]* A [TG] [GC] • But what about? • T G C T - - A G G vrs • A C A C - - A T C • Try a consensus sequence: A C A - - - A T C • Depends on distance measure Example borrowed from Salzberg, 1998

HMMs to the rescue! Emission Probabilities Transition probabilities

Insert (Loop) States

Scoring our simple HMM • #1 - “T G C T - - A G G” vrs: #2 - “A C A C - - A T C” • Regular Expression ([AT] [CG] [AC] [ACTG]* A [TG] [GC]): • #1 = Member #2: Member • HMM: • #1 = Score of 0.0023% #2 Score of 4.7% (Probability) • #1 = Score of -0.97 #2 Score of 6.7 (Log odds)

Standard Profile HMM Architecture • Three types of states: • Match • Insert • Delete • One delete and one match per position in model • One insert per transition in model • Start and end “dummy” states Example borrowed from Cline, 1999

Match States Example borrowed from Cline, 1999

Insert States Example borrowed from Cline, 1999

Delete States Example borrowed from Cline, 1999

Aligning and Training HMMs • Training from a Multiple Alignment • Aligning a sequence to a model • Can be used to create an alignment • Can be used to score a sequence • Can be used to interpret a sequence • Training from unaligned sequences

Training from an existing alignment • This process what we’ve been seeing up to this point. • Start with a predetermined number of states in your HMM. • For each position in the model, assign a column in the multiple alignment that is relatively conserved. • Emission probabilities are set according to amino acid counts in columns. • Transition probabilities are set according to how many sequences make use of a given delete or insert state.

Remember the simple example • Chose six positions in model. • Highlighted area was selected to be modeled by an insert due to variability. • Can also do neat tricks for picking length of model, such as model pruning.

Aligning sequences to a model • Now that we have a profile model, let’s use it! • Try every possible path through the model that would produce the target sequence • Keep the best one and its probability. • Viterbi alg. has been around for a while • Dynamic Programming based method • Time: O(N*M) Space: O(N*M) • (Assuming a constant # of transitions per state) • N = Length of sequence, M = # of states in HMM

So… what do we do with an alignment to a model? • Align a bunch of sequences to the model, and get a new multiple alignment. • Align a single sequence to the model and get a numerical score stating how well it fits the model • “Find me all sequences in the database that match this family profile X with a log odds score of at least Y” • Align a single sequence to the model, and get a description of its columns • “Columns 124 and 125 map to insert states of family Y, I wonder what that means?”

Training from unaligned sequences • One method: • Start with a model whose length matches the average length of the sequences and with random emission and transition probabilities. • Align all the sequences to the model. • Use the alignment to alter the emission and transition probabilities • Repeat. Continue until the model stops changing • By-product: It produced a multiple alignment

Training from unaligned continued • Advantages: • You take full advantage of the expressiveness of your HMM. • You might not have a multiple alignment on hand. • Disadvantages: • HMM training methods are local optimizers, you may not get the best alignment or the best model unless you’re very careful. • Can be alleviated by starting from a logical model instead of a random one.

For those of you keeping score… Lecture Objectives • The standard architecture for a profile HMM. • The three major uses in bioinformatics for a profile HMM. • The high-level concepts behind the algorithms to train and use a profile HMM. • The two different starting points for training an HMM. • How to avoid over-fitting a profile HMM. • The high-level ideas behind using profile HMMs to determine protein structure.

Preventing Over-fitting • Prior Information (Dirichlet Mixtures) • Combines prior information regarding amino acid frequencies at each step of training • Prior distribution used at each step depends on: • Number of examples seen in a given column • Distribution of examples seen • Sequence Weighting • Some sequences are more frequent than others • Your model should not reflect this • Give the less frequent sequences more weight • Get more data

Finally, Protein Structure Determination! Two good routes to take: • A database of profile HMMs • Make a profile of the target and search • Or, do both

Database of Profiles • Make an HMM for each protein in the database with known structure • Collect its homologs from the database • Build a model with the homologs • Match your protein sequence against every model in the database • Predict the structure of whichever model scores the highest

Profile of Target • SAM-T98: • Best method that made use of no direct structural information at CASP 3 (Current Assessment of Structure Prediction) • Create a model of your target sequence • Search a database of proteins using that model • Whichever sequence scores highest, predict that structure

How do we build a model using only one sequence?

Profile HMM Effectiveness Overview • Advantages: • Very expressive profiling method • Transparent method: You can view and interpret the model produced • Very effective at detecting remote homologs • Disadvantages: • Slow – full search on a database of 400,000 sequences can take 15 hours • Have to avoid over-fitting and locally optimal models

Score Board: • What is the standard architecture for a profile HMM? • What are the three major uses in bioinformatics for a profile HMM? • What are the high-level concepts behind the algorithms to train and use a profile HMM? Dynamic Programming and iterative alignments • Aligning a sequence • Scoring a sequence • Interpreting a sequence

Score Board: • What are the 2 different starting points for model training? • Either from aligned or unaligned sequences • How do I avoid over-fitting a profile HMM? • Prior values, sequence weighting, get more data • What are the high-level ideas behind using profile HMMs to determine protein structure? • Use a profile of your target or many profiles of your sequences to match you target to a known structure

Any questions? ? • More info available at: • www.cs.ualberta.ca/~colinc

Profile Hidden Markov Models