1 / 21

A Hidden Markov Model for Progressive Multiple Alignment

A Hidden Markov Model for Progressive Multiple Alignment. Ari L ö ytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12 , 2003 Presented by Sowmya Venkateswaran April 20,2006. Outline. Motivations Drawbacks of existing methods System and Methods Substitution Model

kumiko
Download Presentation

A Hidden Markov Model for Progressive Multiple Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12 , 2003 Presented by Sowmya Venkateswaran April 20,2006

  2. Outline • Motivations • Drawbacks of existing methods • System and Methods • Substitution Model • Hidden Markov Model • Pairwise Alignment using Viterbi Algorithm • Posterior Probability • Multiple Alignment • Results • Discussion

  3. Motivation • Progressive alignment techniques are used for Multiple Sequence Alignment • Used to deduce the phylogeny. • Identify protein families. • Probabilistic methods can be used to estimate the reliability of global/local alignments.

  4. Drawbacks of existing Systems • Iterative application of global/local pairwise sequence alignment algorithms does not guarantee a globally optimum alignment. • A best scoring alignment may not correspond with true alignment. Hence reliability of a score/alignment needs to be inferred.

  5. System and Methods • The idea is to provide a probabilistic framework for a guide tree and define a vector of probabilities at each character site. • Guide tree is constructed by using Neighbor Joining Clustering after producing a distance matrix. It can also be imported from CLUSTALW. • At each internal node, a probabilistic alignment is performed. Pointers from parent to child sites are stored and so also is a vector of probabilities of the different character states( ‘A/C/T/G/-’ for nucleotides or the 20 amino acids with a gap)

  6. Substitution Model • Consider 2 sequences x1…nand y1…m, whose alignment we would like to find and their parent in the guide tree is z1…l. • Pa(xi) is the probability that site xi contains character a. Pa(xi) = 1, if a character a appears at terminal node, else it is 0. • At internal nodes, different characters have different probabilities summing to 1. • If the observed character is ambiguous, probability is shared among different characters.

  7. Emission Probabilities Pxi,yj represents the probability that xi and yj are aligned. pxi,yj=pzk(xi,yj)=∑pzk=a(xi,yj) Pzk=a(xi,yj)=qa∑bsabpb(xi)∑bsabpb(yj) qa is the character background probability sab, probability of aligning characters a and b, is calculated with the Jukes Cantor Model sab=1/n + (n-1)/n * e –(n/n-1) v when a=b sab=1/n - 1/n * e –(n/n-1) v when a≠b n is the size of the alphabet , v is the NJ-estimated branch length

  8. Probabilities • To find pxi,- , the probability that zk evolved to a character on one of the child sites and a gap on the other child is pzk=a(xi,-)=qa∑bsabpb(xi)sa- The same applies for pxi,-. sa- is computed just like sab. • Any other model can be used for calculation of sab, instead of the Jukes Cantor Model. Ex: PAM (20 X 20) substitution matrix can be modified to include gaps and transformed to a (21X21) matrix, and the substitution probabilities can be derived from that.

  9. Hidden Markov Model X pxi,- ε δ 1-ε 1-2δ M pxi,yj 1-ε δ Y p-,yj ε

  10. Hidden Markov Model • δ – probability of moving to an insert state (gap opening penalty) ; lower the value, higher the penalty. • ε – probability of staying at an insert state (gap extension penalty); again, lower the value, more the extension penalty. • pxi,yj ,pxi,- , p-,yj –emission frequencies for match, insert X and insert Y states. • For testing purposes, δ and ε were estimated from pairwise alignments of terminal sequences such that δ=1/2(lm+1) and ε=1-1/(lg+1); lm and lg are the mean lengths of match and gap segments.

  11. Pairwise Alignment • In this probabilistic model, the best alignment between 2 sequences corresponds to the Viterbi path through the HMM. • Since there are 3 states in the model, and each state needs 2-D space, we have 3 2-D tables : vM for match states, vX and vY for the gap states. • A move within M, X or Y tables produces an additional match or extends an existing gap. A move between M table and either X or Y table closes or opens a gap.

  12. Viterbi Recursion Initialization: v(0,0) = 1, v(i,-1) = v(-1,j)=0 Recursion: vM(i,j) = pxi,yj max { (1-2δ) vM(i-1,j-1), (1-ε) vX(i-1,j-1), (1-ε) vY(i-1,j-1) } vX(i,j) = pxi,- max { δvM(i-1,j), εvX(i-1,j) } vY(i,j) = p-,yj max { δvM(i,j-1), εvY(i,j-1) } Termination: vE=max(vM(n,m),vX(n,m),vY(n,m))

  13. Viterbi traceback • At each cell, the relative probabilities of entering the different cells are stored. Ex: pM-M= (1-2δ) vM(i-1,j-1)/N(i,j) where N(i,j) is the normalizing constant, given by N(i,j)=(1-2δ) vM(i-1,j-1)+(1-ε)*[vX(i-1,j-1)+ vY(i-1,j-1)] The above equation is calculated for each of the 3 tables • Trace back algorithm used to find the best path; a match step will create pointers from the parent site to the child sites, and a gap step will create pointer to one and a gap for the 2nd child site.

  14. Posterior Probabilities-Forward algorithm Forward algorithm-sum of probabilities of all paths entering a given cell from the start position. Initialization: f(0,0)=1;f(i,-1)=f(-1,j)=0; Recursion: i=0,…,n j=0,…,m, except (0,0) fM(i,j) = pxi,yj [ (1-2δ) fM(i-1,j-1) + (1-ε) ( fX(i-1,j-1)+ fY(i-1,j-1))] fX(i,j) = pxi,- [ δ fM(i-1,j) + ε fX(i-1,j)] fY(i,j) = p-,yj [ δ fM(i,j-1) + ε fY(i,j-1)] Termination: fE=fM(n,m)+fX(n,m)+fY(n,m)

  15. Backward algorithm Sum of probabilities of all possible alignments between subsequences xi…n and yj…m. Initialization: b(n,m)=1; b(i,m+1) = f(n+1,j) = 0; Recursion: i=n,…,1 j=m,…,1, except (n,m) bM(i,j) = (1-2δ) px(i+1),y(j+1) bM(i+1,j+1) + δ [ px(i+1),- bX(i+1,j) + p-,y(j+1) bY(i,j+1)] bX(i,j) = (1-ε) px(i+1),y(j+1) bM(i+1,j+1) + ε px(i+1),- bX(i+1,j) bY(i,j) = (1-ε) px(i+1),y(j+1) bM(i+1,j+1) + ε p-,y(j+1) bX(i+1,j)

  16. Reliability Check • Assumption: Posterior probability of the sites on the alignment path is a valid estimator of the local reliability of the alignment since it gives the proportion of total probability corresponding to all alignments passing through the cell (i,j). • Posterior probability for a match is given by: P(xi◊yj|x,y) = fM(i,j) bM(i,j) / fE where fM and bM are the total probabilities of all possible alignments between subsequences x1…i and y1…j and xi…n and yj…m respectively Similar probabilities are calculated for Insert X and Insert Y states too.

  17. Multiple alignment • Each parent node site has a vector of probabilities corresponding to each possible character state (including the gap). For a match, pa(zk)=pzk=a(xi,yj)/∑bpzk=b(xi,yj) • Pairwise alignment builds the tree progressively, from the terminal nodes towards an arbitrary root. • Once root node is defined, trace-back is done to find multiple alignment of the nodes below since each node stores pointers to the matching child sites. • If a gap occurs in one of the internal nodes, a gap character state is introduced in all of the sequences of that sub-tree, and recursive call will not proceed further in that branch.

  18. Testing • Algorithms tested on (i) simulated nucleotide sequences 50 random data sets generated using the program Rose. A root random sequence (length 500) was evolved on a random tree to yield sequences of “low” (no. of substitutions per site 0.5) and “high” (1.0) divergences. Also, the insertion/deletion length distribution was set to ‘short’ or ‘long’. (ii) Amino acid data sets from Ref1 of the BAliBASE database. Ref1 contains alignments of less than 6 equi-distant sequences, i.e., the percent-identity between 2 sequences is within a specified range with no large insertion or deletion. Datasets were divided into 3 groups based on lengths, and further into 3 based on similarities.

  19. Results of Simulation on Nucleotide Sequences

  20. Type1 and Type 2 errors vs. minimum posterior probability

  21. Performance and Future Work • ProAlign performs better than ClustalW for the nucleotide sequences, but not for amino acid sequences with sequence identity less than 25%. • Possible reasons may be that the model does not take into account, the protein secondary structure. So, the HMM can be extended to modeling protein secondary structure too. • Minimum posterior probability correlates well with correctness ; can be used to detect/remove unreliably aligned regions

More Related