1 / 17

BINF6201/8201 Hidden Markov Models for Sequence Analysis 4 11-29-2011

BINF6201/8201 Hidden Markov Models for Sequence Analysis 4 11-29-2011. Choice of model topology. The structure (topology) and parameters together determine a HMM. The parameters of a HHM can be determined by the Baum-Welch algorithm and other optimization methods,

khoi
Download Presentation

BINF6201/8201 Hidden Markov Models for Sequence Analysis 4 11-29-2011

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BINF6201/8201 Hidden Markov Models for Sequence Analysis 4 11-29-2011

  2. Choice of model topology • The structure (topology) and parameters together determine a HMM. • The parameters of a HHM can be determined by the Baum-Welch algorithm and other optimization methods, • The design of the topology of a HMM is based on the understanding of the problem and the data available to solve the problem.

  3. Profile HMM for sequence families • Profile HMMs are a special type of HMMs used to mode multiple alignments of protein families. Matches Indels • Once a profile HMM is constructed for a protein family, it can be used to evaluate if a new sequence belongs to the family or not (the scoring problem). • The most probable path of a sequence generated by the model can be used to align the sequence to the members of the family (the decoding problem).

  4. Profile HMM for sequence families • Given a block of ungapped multiple alignment of a protein family, we can use the following HMM to model the block. e1(bi) ej(bi) eL(bi) … … Begin M1 Mj End Mn • Here, Mj corresponds to the ungapped column j in the alignment, and is called a match state. Mjemits amino acid bi with probability ej(bi). • The transition probability between two adjacent match states Mj-1 and Mjis 1, i.e., a(j-1) j = 1, because Mjcannot transit to any other states or back to itself.

  5. Profile HMM for sequence families • Since we know the path of a sequence generated by the model, the probability that a sequence x is generated by the model is, • To make this probability more meaningful, we can compare it with a background probability. • The probability that the sequence x is generated randomly (by random model R) is, • The log odds ratio is, which essentially is a position specific scoring/weigh matrix (PSSM). • Therefore, this HMM is equivalent to a PSSM, and we score the sequence x with the PSSM of the block, which is more sensitive than using a general-purposed scoring matrices such as PAM and BLOSSUM in a pair-wise alignment.

  6. Profile HMM for sequence families • To model insertions after the match state Mj, we introduce in the model an insertion state Ij. Ij … … Begin M1 Mj End Mj+1 Mn • In this case, Mj can transit to the next match state Mj+1 or Ij, and Ij can move to Mj+1or remain in Ij. • Ijemits an amino acid b with probability, which is usually set to the background frequency of the amino acid b, qb. • The log odds-ratio for generating an insertion sequence of length k is, • This is equivalent to an affine penalty function, but it is position dependent, therefore is more accurate.

  7. Profile HMM for sequence families • To model deletions at some match states, we use a deletion state Djat each position j. k deletions D1 … Dj-1 Dj Dj+1 Dj+k-1 Dj+k … … … Mj End Begin M1 Mj-1 Mj+1 Mj+k-1 Mj+k • The deletion stateDj does not emit any signal, so it is called a silent state. • The penalty score for a deletion of length k starting at j will be, • Therefore, the penalty for deletions is not equivalent to an affine penalty function, but again, it is position dependent.

  8. Profile HMM for sequence families • The complete profile HMM has the following structure if transitions between insertion and deletion states are not considered. • Leaving them out has little effect on scoring sequences, but may have problem when training the model. • The following profile model considers transitions between insertion and deletion states. • In this model each Mj, Djand Ijhas three transitions, except for the last position at which each state has only two transitions.

  9. Derive profile HMMs from multiple alignments • Given a multiple alignment of a protein family, we first determine how many match states should be used to model the family. • A general rule is to treat columns in which less than 50% of sequences have a deletion as match states. A segment of multiple alignment of hemoglobin proteins • Using this rule we model this segment of alignment by a mode having eight match states.

  10. Derive profile HMMs from multiple alignments • Based on the general design of profile HHMs, we have the following model for the segment of the alignment. A HMM of length 8 • From the alignment, we know the path of each sequence, therefore, the transition and emission probabilities can be estimated by the general formula,

  11. Pseudocounts • When counting events, to avoid zero probabilities, we usually add pseudo-counts to the total counts. • The most simplest way to add pseudocounts is to add one to each frequency, this is called Laplace’s rule, e.g., using this rule, we have, • A slightly more sophisticated method is to add a quantity proportional to the background frequency. • For example, if we add A sequences to the alignment, we expect that Aqbof them will have a b at the position, then the emission probability of b at Mk is where, qb is the background frequency of amino acid b in the alignment.

  12. Dirichlet prior distribution • This means that we add our prior knowledge to the counts, and it is equivalent to that we compute the posterior probability of the theoretic value of eM(b), q after we see some counts of EM(b), n out of a total of K counts,ie., • To see this, we need to do some mathematical derivation. If we consider the frequencies of 20 amino acids in a column in an alignment, these 20 frequencies are summed to 1. • These values will change for different columns in the alignment, so they are random variables, and they follow a Dirichlet distribution, where Z is a normalization factor, and a1,…, ai,…,a20 are the parameters that determine the shape of the distribution. • Interestingly, it can be shown that the mean of qi is,

  13. Dirichlet prior distribution • Let ai=Aqi, then we have a Dirichlet prior distribution, The mean of qi is qi, • Therefore, if we do not know the frequencies of the 20 amino acids in a column, we can use such a Dirichlet distribution to model the prior of these frequencies. The average frequency of amino acid i is qi. • Although the parameter A does not affect the average of frequency qi, it affects the shape of the distribution. • To see the effect of A on the shape of Dirichlet, let’s consider only one type of amino acids (e.g., acidic amino acids) with a prior frequency q, and a mean frequency q. The prior frequency of all the other amino acids is 1- q,, and its mean is 1-q.

  14. Dirichlet prior distribution • The Dirichlet distribution of this frequency q is, • When the average of frequency of this type of amino acids is q=0.05, changing A, we have the following shapes of the Dirichlet distributions. • Although the means of q are the same, the larger the value of A, the narrow the shape of the distribution • In general, when we have a high confidence of q, we use a large A value, otherwise, we should use a small A value.

  15. Dirichlet prior distribution • Now let’s consider posterior distribution after observing data using a Dirichlet prior distribution. Let K be the total number of observed amino acids in a column, of which n are of the type that we are considering. • The likelihood for this to happen can be computed by a binomial distribution, • The posterior distribution is, • Through normalization, we have,

  16. Dirichlet prior distribution • Therefore the posterior probability also follows a Dirichlet distribution, but with different parameters. • The mean of the posterior distribution of q is, • When K is large, adding prior Aq has little effect on the probability, but when K is small, the effects could be big. • This gives the justification that we can use pseudocounts Aqbto estimate the posterior frequency of the amino acids b. • The figure shows the posterior distribution p(q|n) when q=0.05 but the real frequency is 0.5.

  17. Application of profile HMMs • Once a profile HMM is constructed for a protein family, it can be used to score a new sequence. The sequence can be also aligned to the family using the path decoded by the Viterbi algorithm or the forward and backward algorithm. • The two popular tools for profile HMM applications are free on line: Hmmer: http://hmmer.janelia.org/ Sam: http://compbio.soe.ucsc.edu/sam.html • Developed by Sean Eddy and colleagues in earlier 1990s. • It contains tools for building a HMM based on a multiple alignment, and tools for searching a HMM database. • Hammer is also associated with the Pfam protein family database at the same site • The first profile HMMs tools developed by David Haussler and Andrew Krogh in earlier 1990s.

More Related