Part 1 : Model and Methods

ProbCons : Probabilistic Consistency based MSA for Proteins Part 1 : Model and Methods

Protein MSA vs DNA MSA • 20 symbols versus 4 symbols • Sequences are much shorter • Substitution and similarity relationships are more complex • The substitution model should account for possible functional and structural similarities. • Accuracy : DNA – 50% similarity Protein – 20% similarity • DNA – fewer sequences to compare • Protein – many sequences to compare • DNA aligners need to be able to handle long sequences, protein aligners do not

Current protein alignment tools Most popular progressive alignment method • CLUSTALW (1994) Other new tools: • DIALIGN – using segment-based homology (1998) • T-COFFEE – a consistency based aligner (2000) • MAFFT – varieties of iterative refinement techniques (2002) • MUSCLE – using log-expectation score function (2004) • Align-M – for highly divergent sequences (2004)

Basic Idea : Consistency • Given 3 sequences x, y and z, if xi aligns with zk, yj aligns with zk then xi aligns with yj. • Attempts to generate a scoring function that scores a xi-yj alignment based on consistency with other sequences in the column having the alignment. • Uses multiple sequence information in scoring pairwise alignments • Uses a probabilistic implementation which is natural.

Method Overview • A pairwise HMM is used to generate alignment. • Pxy(i,j)=P(x i& yj align | x ,y). • The consistency match scores are derived from posterior probabilities. Called Maximum Expected Accuracy. • Compute guide tree as usual. • Progressive alignment using guide tree. • Post processing iterative alignment.

xi yj MATCH x ABRACA-DABRA AB-ACARDI--- INSERT X INSERT Y y xi ― ― yj Pairwise HMM 1 • Basic HMM for sequence alignment between two sequences • M emits two letters, one from each sequence • Ix emits a letter from x that aligns to a gap • Iy emits a letter from y that aligns to a gap

Pairwise HMM 2 • M is estimated from substitution • scoring matrix. • Delta and Epsilon and related to • gap penalty parameters. • High delta = Low gap penalty • Low delta = High gap penalty

Expected Accuracy of an Alignment Candidate alignment Correct alignment Alignment probability • Measures the average ‘quality’ of the alignment across all columns. • Compute maximum expected accuracy for all sequence pairs using dynamic programming. • Gives a rough indication. • Finer value

Probabilistic Consistency Transformation Consistency Term Matrix Form • Reestimate match quality scores by applying probabilistic consistency transformation. • Attempt to measure an aligned residue pair based on consistency across all the sequences. • The matrices have most terms ~0. Evaluated using sparse matrix method. • Summed across all alignment columns. • Contrast Viterbi Algorithm which finds most likely alignment. Maximizes sum of log probs.

Guide Tree and Progressive Alignment • Compute guide tree by heirarchical clustering • Use MEA as the measure of similarity between 2 sequences to construct tree. • Then progressively align sequences as per the guide tree using transformed match quality scores. • Use simple scoring function to get basic alignment and sophisticated function to improve on it.

Iterative Refinement • Randomly split the data set into 2 parts and repeat previous alignment procedure. • Repeated 100 times in ProbCons • Can also be repeated until there is convergence of the generated alignment

Column Reliability Estimation • This is defined based on the posterior probability value. • It is the average value of the pairwise alignment probabilities. • Expected value of correct pairwise matches in each alignment column.

Review and Comments + Interesting scoring function concept. + Works very well without incorporating biological knowledge like – position specific gap scoring, evolutionary tree construction etc. used by many algorithms like Clustal. + Can be made more refined by incorporating these features into the framework. • Complex scoring methodology. Computational cost large for long sequences e.g. DNA.

ProbCons : Probabilistic Consistency MSA Part 2 : Performance, Comparisons and Results

Benchmark alignment database BAliBASE 2.01 (Thompson et al. 1999a) • collection of 141 reference protein alignments • five reference sets Ref1 : equi-distant sequences of similar length Ref2 : families of closely related sequences Ref3 : equi-distant divergent families Ref4 : sequences with large N/C - terminal extensions Ref5 : sequences with large internal insertions

Benchmark alignment database PREFAB 3.0 (Edgar 2004) • 1932 alignments averaging 49 sequences of length 240 SABMARK 1.63 (Van Walle et al. 2004) • two sets of consensus regions based on structural alignment • “twilight zone” set with no more than 25% identity “twilight zone” – when sequence identity falls below 30%, alignment accuracies drop considerably

Measure of alignment accuracy Compare ProbCons with other aligners • using the original benchmarking measures associated with each database BAliBASE: SP (sum-of-pairs score) CS (column score) PREFAB: Quality (Q) score SABmark: fD (developer score) fM (modeler score) • for each database, averaged the scores over all multiple alignment datasets • default option for ProbCons: 2 iterations of consistency transformation 100 rounds of iterative refinement

Results on BAliBASE

ProbCons-ext

Significance test for aligner differences

Column reliability

Results on PREFAB ---------------------------------------------------- Aligner Overall time ------------------------------------------------------------- Dialign 57.2 12h, 25min CLUSTALW 58.9 2h, 58min T-coffee 63.6 144h, 51min MUSCLE 64.8 3h,11min MAFFT 64.8 2h,36min ProbCons 66.9 19h,41min ProbCons-ext 68.0 37h,46min --------------------------------------------------------------

Significance test for differences in PREFAB

Result on SABmark

Significance test for differences in SABmark

Comparison of ProbCons variants 1. highest probability alignment vs highest expected accuracy alignment 2. all-pairs pairwise alignment vs full multiple alignment 3. varying the number of consistency transformation 4. omitting the iterative refinement

Results of ProbCons Variants on SABmark Algorithm c Ir Output fD fM Time (mm:ss) -------------------------------------------------------------------------------------------- 1. Viterbi 0 0 Pairwise 27.5 17.2 0.42 2. Posterior0 0 Pairwise 29.6 18.5 2.54 3. Posterior 1 0 Pairwise 32.5 20.4 3.15 4. Posterior 2 0 Pairwise 33.2 21.0 3.47 5. Posterior 0 0 Multiple 29.1 19.8 2.57 6. Posterior 1 0 Multiple 30.9 20.8 3.17 7. Posterior 2 0 Multiple 31.5 21.3 3.50 8. Posterior 0 100 Multiple 30.6 20.8 4.14 9. Posterior 2 100 Multiple 32.1 21.7 5.50 --------------------------------------------------------------------------------------------

Conclusion ProbCons is a practical tool for protein alignment • dramatic improvements in alignment accuracy • competitive running times Main features that contribute to the improvements • maximum expected accuracy as the objective function • probabilistic consistency transformation

Future work Possible extension of the probabilistic model? • other features used by CLUSTALW • position-specific gap scoring • rigorous evolutionary tree construction Application of the methodology to other tasks? • RNA structure alignment and prediction • motif detection and gene finding

Part 1 : Model and Methods