BINF6201/8201 Hidden Markov Models for Sequence Analysis 4 11-29-2011

BINF6201/8201 Hidden Markov Models for Sequence Analysis 4 11-29-2011

Choice of model topology • The structure (topology) and parameters together determine a HMM. • The parameters of a HHM can be determined by the Baum-Welch algorithm and other optimization methods, • The design of the topology of a HMM is based on the understanding of the problem and the data available to solve the problem.

Profile HMM for sequence families • Profile HMMs are a special type of HMMs used to mode multiple alignments of protein families. Matches Indels • Once a profile HMM is constructed for a protein family, it can be used to evaluate if a new sequence belongs to the family or not (the scoring problem). • The most probable path of a sequence generated by the model can be used to align the sequence to the members of the family (the decoding problem).

Profile HMM for sequence families • Given a block of ungapped multiple alignment of a protein family, we can use the following HMM to model the block. e1(bi) ej(bi) eL(bi) … … Begin M1 Mj End Mn • Here, Mj corresponds to the ungapped column j in the alignment, and is called a match state. Mjemits amino acid bi with probability ej(bi). • The transition probability between two adjacent match states Mj-1 and Mjis 1, i.e., a(j-1) j = 1, because Mjcannot transit to any other states or back to itself.

Profile HMM for sequence families • Since we know the path of a sequence generated by the model, the probability that a sequence x is generated by the model is, • To make this probability more meaningful, we can compare it with a background probability. • The probability that the sequence x is generated randomly (by random model R) is, • The log odds ratio is, which essentially is a position specific scoring/weigh matrix (PSSM). • Therefore, this HMM is equivalent to a PSSM, and we score the sequence x with the PSSM of the block, which is more sensitive than using a general-purposed scoring matrices such as PAM and BLOSSUM in a pair-wise alignment.

Profile HMM for sequence families • To model insertions after the match state Mj, we introduce in the model an insertion state Ij. Ij … … Begin M1 Mj End Mj+1 Mn • In this case, Mj can transit to the next match state Mj+1 or Ij, and Ij can move to Mj+1or remain in Ij. • Ijemits an amino acid b with probability, which is usually set to the background frequency of the amino acid b, qb. • The log odds-ratio for generating an insertion sequence of length k is, • This is equivalent to an affine penalty function, but it is position dependent, therefore is more accurate.

Profile HMM for sequence families • To model deletions at some match states, we use a deletion state Djat each position j. k deletions D1 … Dj-1 Dj Dj+1 Dj+k-1 Dj+k … … … Mj End Begin M1 Mj-1 Mj+1 Mj+k-1 Mj+k • The deletion stateDj does not emit any signal, so it is called a silent state. • The penalty score for a deletion of length k starting at j will be, • Therefore, the penalty for deletions is not equivalent to an affine penalty function, but again, it is position dependent.

Profile HMM for sequence families • The complete profile HMM has the following structure if transitions between insertion and deletion states are not considered. • Leaving them out has little effect on scoring sequences, but may have problem when training the model. • The following profile model considers transitions between insertion and deletion states. • In this model each Mj, Djand Ijhas three transitions, except for the last position at which each state has only two transitions.

Derive profile HMMs from multiple alignments • Given a multiple alignment of a protein family, we first determine how many match states should be used to model the family. • A general rule is to treat columns in which less than 50% of sequences have a deletion as match states. A segment of multiple alignment of hemoglobin proteins • Using this rule we model this segment of alignment by a mode having eight match states.

Derive profile HMMs from multiple alignments • Based on the general design of profile HHMs, we have the following model for the segment of the alignment. A HMM of length 8 • From the alignment, we know the path of each sequence, therefore, the transition and emission probabilities can be estimated by the general formula,

Pseudocounts • When counting events, to avoid zero probabilities, we usually add pseudo-counts to the total counts. • The most simplest way to add pseudocounts is to add one to each frequency, this is called Laplace’s rule, e.g., using this rule, we have, • A slightly more sophisticated method is to add a quantity proportional to the background frequency. • For example, if we add A sequences to the alignment, we expect that Aqbof them will have a b at the position, then the emission probability of b at Mk is where, qb is the background frequency of amino acid b in the alignment.

Dirichlet prior distribution • This means that we add our prior knowledge to the counts, and it is equivalent to that we compute the posterior probability of the theoretic value of eM(b), q after we see some counts of EM(b), n out of a total of K counts,ie., • To see this, we need to do some mathematical derivation. If we consider the frequencies of 20 amino acids in a column in an alignment, these 20 frequencies are summed to 1. • These values will change for different columns in the alignment, so they are random variables, and they follow a Dirichlet distribution, where Z is a normalization factor, and a1,…, ai,…,a20 are the parameters that determine the shape of the distribution. • Interestingly, it can be shown that the mean of qi is,

Dirichlet prior distribution • Let ai=Aqi, then we have a Dirichlet prior distribution, The mean of qi is qi, • Therefore, if we do not know the frequencies of the 20 amino acids in a column, we can use such a Dirichlet distribution to model the prior of these frequencies. The average frequency of amino acid i is qi. • Although the parameter A does not affect the average of frequency qi, it affects the shape of the distribution. • To see the effect of A on the shape of Dirichlet, let’s consider only one type of amino acids (e.g., acidic amino acids) with a prior frequency q, and a mean frequency q. The prior frequency of all the other amino acids is 1- q,, and its mean is 1-q.

Dirichlet prior distribution • The Dirichlet distribution of this frequency q is, • When the average of frequency of this type of amino acids is q=0.05, changing A, we have the following shapes of the Dirichlet distributions. • Although the means of q are the same, the larger the value of A, the narrow the shape of the distribution • In general, when we have a high confidence of q, we use a large A value, otherwise, we should use a small A value.

Dirichlet prior distribution • Now let’s consider posterior distribution after observing data using a Dirichlet prior distribution. Let K be the total number of observed amino acids in a column, of which n are of the type that we are considering. • The likelihood for this to happen can be computed by a binomial distribution, • The posterior distribution is, • Through normalization, we have,

Dirichlet prior distribution • Therefore the posterior probability also follows a Dirichlet distribution, but with different parameters. • The mean of the posterior distribution of q is, • When K is large, adding prior Aq has little effect on the probability, but when K is small, the effects could be big. • This gives the justification that we can use pseudocounts Aqbto estimate the posterior frequency of the amino acids b. • The figure shows the posterior distribution p(q|n) when q=0.05 but the real frequency is 0.5.

Application of profile HMMs • Once a profile HMM is constructed for a protein family, it can be used to score a new sequence. The sequence can be also aligned to the family using the path decoded by the Viterbi algorithm or the forward and backward algorithm. • The two popular tools for profile HMM applications are free on line: Hmmer: http://hmmer.janelia.org/ Sam: http://compbio.soe.ucsc.edu/sam.html • Developed by Sean Eddy and colleagues in earlier 1990s. • It contains tools for building a HMM based on a multiple alignment, and tools for searching a HMM database. • Hammer is also associated with the Pfam protein family database at the same site • The first profile HMMs tools developed by David Haussler and Andrew Krogh in earlier 1990s.

BINF6201/8201 Hidden Markov Models for Sequence Analysis 4 11-29-2011