1 / 29

Part 1 : Model and Methods

ProbCons : Probabilistic Consistency based MSA for Proteins. Part 1 : Model and Methods. Protein MSA vs DNA MSA. 20 symbols versus 4 symbols Sequences are much shorter Substitution and similarity relationships are more complex

carter
Download Presentation

Part 1 : Model and Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ProbCons : Probabilistic Consistency based MSA for Proteins Part 1 : Model and Methods

  2. Protein MSA vs DNA MSA • 20 symbols versus 4 symbols • Sequences are much shorter • Substitution and similarity relationships are more complex • The substitution model should account for possible functional and structural similarities. • Accuracy : DNA – 50% similarity Protein – 20% similarity • DNA – fewer sequences to compare • Protein – many sequences to compare • DNA aligners need to be able to handle long sequences, protein aligners do not

  3. Current protein alignment tools Most popular progressive alignment method • CLUSTALW (1994) Other new tools: • DIALIGN – using segment-based homology (1998) • T-COFFEE – a consistency based aligner (2000) • MAFFT – varieties of iterative refinement techniques (2002) • MUSCLE – using log-expectation score function (2004) • Align-M – for highly divergent sequences (2004)

  4. Basic Idea : Consistency • Given 3 sequences x, y and z, if xi aligns with zk, yj aligns with zk then xi aligns with yj. • Attempts to generate a scoring function that scores a xi-yj alignment based on consistency with other sequences in the column having the alignment. • Uses multiple sequence information in scoring pairwise alignments • Uses a probabilistic implementation which is natural.

  5. Method Overview • A pairwise HMM is used to generate alignment. • Pxy(i,j)=P(x i& yj align | x ,y). • The consistency match scores are derived from posterior probabilities. Called Maximum Expected Accuracy. • Compute guide tree as usual. • Progressive alignment using guide tree. • Post processing iterative alignment.

  6. xi yj MATCH x ABRACA-DABRA AB-ACARDI--- INSERT X INSERT Y y xi ― ― yj Pairwise HMM 1 • Basic HMM for sequence alignment between two sequences • M emits two letters, one from each sequence • Ix emits a letter from x that aligns to a gap • Iy emits a letter from y that aligns to a gap

  7. Pairwise HMM 2 • M is estimated from substitution • scoring matrix. • Delta and Epsilon and related to • gap penalty parameters. • High delta = Low gap penalty • Low delta = High gap penalty

  8. Expected Accuracy of an Alignment Candidate alignment Correct alignment Alignment probability • Measures the average ‘quality’ of the alignment across all columns. • Compute maximum expected accuracy for all sequence pairs using dynamic programming. • Gives a rough indication. • Finer value

  9. Probabilistic Consistency Transformation Consistency Term Matrix Form • Reestimate match quality scores by applying probabilistic consistency transformation. • Attempt to measure an aligned residue pair based on consistency across all the sequences. • The matrices have most terms ~0. Evaluated using sparse matrix method. • Summed across all alignment columns. • Contrast Viterbi Algorithm which finds most likely alignment. Maximizes sum of log probs.

  10. Guide Tree and Progressive Alignment • Compute guide tree by heirarchical clustering • Use MEA as the measure of similarity between 2 sequences to construct tree. • Then progressively align sequences as per the guide tree using transformed match quality scores. • Use simple scoring function to get basic alignment and sophisticated function to improve on it.

  11. Iterative Refinement • Randomly split the data set into 2 parts and repeat previous alignment procedure. • Repeated 100 times in ProbCons • Can also be repeated until there is convergence of the generated alignment

  12. Column Reliability Estimation • This is defined based on the posterior probability value. • It is the average value of the pairwise alignment probabilities. • Expected value of correct pairwise matches in each alignment column.

  13. Review and Comments + Interesting scoring function concept. + Works very well without incorporating biological knowledge like – position specific gap scoring, evolutionary tree construction etc. used by many algorithms like Clustal. + Can be made more refined by incorporating these features into the framework. • Complex scoring methodology. Computational cost large for long sequences e.g. DNA.

  14. ProbCons : Probabilistic Consistency MSA Part 2 : Performance, Comparisons and Results

  15. Benchmark alignment database BAliBASE 2.01 (Thompson et al. 1999a) • collection of 141 reference protein alignments • five reference sets Ref1 : equi-distant sequences of similar length Ref2 : families of closely related sequences Ref3 : equi-distant divergent families Ref4 : sequences with large N/C - terminal extensions Ref5 : sequences with large internal insertions

  16. Benchmark alignment database PREFAB 3.0 (Edgar 2004) • 1932 alignments averaging 49 sequences of length 240 SABMARK 1.63 (Van Walle et al. 2004) • two sets of consensus regions based on structural alignment • “twilight zone” set with no more than 25% identity “twilight zone” – when sequence identity falls below 30%, alignment accuracies drop considerably

  17. Measure of alignment accuracy Compare ProbCons with other aligners • using the original benchmarking measures associated with each database BAliBASE: SP (sum-of-pairs score) CS (column score) PREFAB: Quality (Q) score SABmark: fD (developer score) fM (modeler score) • for each database, averaged the scores over all multiple alignment datasets • default option for ProbCons: 2 iterations of consistency transformation 100 rounds of iterative refinement

  18. Results on BAliBASE

  19. ProbCons-ext

  20. Significance test for aligner differences

  21. Column reliability

  22. Results on PREFAB ---------------------------------------------------- Aligner Overall time ------------------------------------------------------------- Dialign 57.2 12h, 25min CLUSTALW 58.9 2h, 58min T-coffee 63.6 144h, 51min MUSCLE 64.8 3h,11min MAFFT 64.8 2h,36min ProbCons 66.9 19h,41min ProbCons-ext 68.0 37h,46min --------------------------------------------------------------

  23. Significance test for differences in PREFAB

  24. Result on SABmark

  25. Significance test for differences in SABmark

  26. Comparison of ProbCons variants 1. highest probability alignment vs highest expected accuracy alignment 2. all-pairs pairwise alignment vs full multiple alignment 3. varying the number of consistency transformation 4. omitting the iterative refinement

  27. Results of ProbCons Variants on SABmark Algorithm c Ir Output fD fM Time (mm:ss) -------------------------------------------------------------------------------------------- 1. Viterbi 0 0 Pairwise 27.5 17.2 0.42 2. Posterior0 0 Pairwise 29.6 18.5 2.54 3. Posterior 1 0 Pairwise 32.5 20.4 3.15 4. Posterior 2 0 Pairwise 33.2 21.0 3.47 5. Posterior 0 0 Multiple 29.1 19.8 2.57 6. Posterior 1 0 Multiple 30.9 20.8 3.17 7. Posterior 2 0 Multiple 31.5 21.3 3.50 8. Posterior 0 100 Multiple 30.6 20.8 4.14 9. Posterior 2 100 Multiple 32.1 21.7 5.50 --------------------------------------------------------------------------------------------

  28. Conclusion ProbCons is a practical tool for protein alignment • dramatic improvements in alignment accuracy • competitive running times Main features that contribute to the improvements • maximum expected accuracy as the objective function • probabilistic consistency transformation

  29. Future work Possible extension of the probabilistic model? • other features used by CLUSTALW • position-specific gap scoring • rigorous evolutionary tree construction Application of the methodology to other tasks? • RNA structure alignment and prediction • motif detection and gene finding

More Related