Progressive multiple sequence alignments from triplets

byMatthias Kruspe and Peter F Stadler Presented by Syed Nabeel Progressive multiple sequence alignments from triplets

Outline • Background • Motivation • Algorithm • Complexity Analysis • Experiments and Results • Discussions and Future work

Background • Sequence alignment A way of arranging the primary sequences of DNA, RNA, or protein to identify regions of similarity. • Pairwise sequence alignment Alignment of two sequences to maximize the common elements of the pair (usually a scoring scheme is used) 3

Multiple sequence alignment (MSA) Scoring Scheme To access the quality of alignment Scores calculated based on substitution matrices e.g. BLOSUM and PAM etc Multiple sequence alignment (MSA) An extension of pairwise alignment to incorporate more than two sequences at a time. Multiple alignments are often used in identifying conserved sequence regions across a group of sequences hypothesized to be evolutionarily related. NP-hard problem

MSA Example

Heuristic methods for MSA Progressive method ClustalW, T-Coffee, POA, and etc. Iterative method Muscle, DIALIGN, and etc. Probabilistic method Probcons, Hmmt, Muscle, and etc.

Progressive method Makes explicit use of the evolutionary relatedness of the sequences to build the alignment. Complete MSA of the given sequences is calculated from pairwise alignments of previous aligned sequences by following the branching order of a pre-computed "guide" tree Reconstruction usually involves some clustering method such as Neighbor-Joining or UPGMA

Problems with Existing Progressive Methods Not guaranteed to find the optimal alignment utilize only a small part of the information that is potentially available in the complete data set the relative placement of adjacent insertions and deletions leads to score-equivalent alignments among which the algorithm chooses one by means of a pragmatic rule (e.g. "Always make insertions before deletions")‏ There is no mechanism to identify errors that have been made in previous steps and to correct them during later stages

Motivation for aln3nn Utilizes an exact algorithm to compute alignment of sequence and profile triples Instead of using a single guide tree, phylogenetic networks as constructed by the Neighbor-Net algorithm are used It involves aggregation step that constructs pairs from triples to subdivide 3-way alignments into pairs of alignments It provides a chance for the removal of erroneously inserted gaps at later aggregation steps.

Dynamic Programming Approaches Needleman-Wunsch algorithm Basic dynamic programming scheme for pairwise sequence comparison Requires quadratic space and time Easily translates to a cubic space and time algorithms for three sequences. Uses trivial gap cost functions.

Linear vs Affine Gap Costs Linear gap cost Has only one parameter d, which is a cost per unit length of gap d is almost always negative, so the alignment with fewer gaps is favoured over the alignment with more gaps The overall cost for one large gap is the same as for many small gaps Affine gap cost Higher penalty is assigned for opening a new gap than for extending an existing one This removes the problem in linear gap costs as overall cost for one large gap is smaller than that for many small gaps

Gotoh’s Algorithm Makes use of affine gap costs Quadratic CPU and memory requirements for two sequences Alignment of three sequences with affine gap costs requires O(n3) time and space Aln3nn is based on Gotoh’s Algorithm with minor modifications

Basic Concepts Let A, B, and C denote the three sequences. Ai, Bj, and Ck to refer to the ith, jth, and kth position in A, B, and C '-' denotes the gap character. Scores for the alignment of two or three non-gap characters are denoted by S(α, β) and S(α, β, γ) Gap penalties are determined from gap open (go) and gap extensions (ge) scores. M(i, j, k) denotes the best score of the alignments of the prefixes Ai, Bj, and Ck if the residues (Ai, Bj, Ck) are aligned Ixy(i, j, k) the best score given that (Ai, Bj,-) is the last column of the partial alignment Ix(i, j, k) the best score given that the last column is of the form (Ai, -, -)‏ Sum-of-pairs model used for substitution scores S(a, b, c) = S(a, b) + S(a, c) + S(b,c).

Recurrences Case 1:(Ai, Bj, Ck)‏ All three sequences are aligned

Recurrences (contd.)‏ Case 2:(Ai, Bj,-) Gap in the C sequence

Recurrences (contd.)‏ Case 3:(Ai, -,-)‏ Gap in the B and C sequence

aln3nn Optimization The above mentioned approach has cubic memory consumption which is acceptable only for small sequence lengths n Aln3nn Optimization: Divide and Conquer Input sequences that exceed a given threshold length l are subsequently subdivided into smaller sequences until the length criterion is fulfilled Partial sequences are aligned separately and the emerging alignments are concatenated afterward Result is an approximate solution of the global MSA problem The threshold length depends on sequence properties and the available amount of memory and CPU resources

Determining Alignment Order The order in which sequences and profiles are aligned has an important influence on the performance of progressive alignment algorithms Pairwise alignments use binary guide trees to determine alignment order It encapsulate an approximation to the phylogenetic relationships of the input sequences The input sequences form the leaves of this tree Each interior node corresponds to an alignment The root of the guide tree represents the desired multiple alignment of all input sequences.

Phylogenetic Networks in aln3nn • Neighbor-Net (Nnet) approach is used to construct a phylogenetic network to calculate the alignment order • The input sequences are represented as nodes that are all disconnected in the beginning. • In each aggregation step, Nnet selects two nodes using a specific selection criterion • In contrast to Neighbor-Joining, the two nodes are not paired immediately • Nnet waits until a node has been paired up a second time. • Then the corresponding three linked nodes are replaced by two new linked nodes. • The distances of the newly introduced nodes to the remaining "actives" node are computed as a linear combination of the distances of the nodes prior to aggregation. • The entire procedure is repeated until only three active nodes are left.

Agglomeration and Splitting • Node agglomeration occurs when one of the three involved nodes (B) has two neighbors, while the other two (A and C) have only a single one • The alignment ABC is split such that the sequences contained in B are distributed between two subsets B' and B" so as to maximize the scores of partial alignments AB‘ and B''C

Agglomeration and Splitting (contd.)

Space and Time Complexity • Simple dynamic programming • For 3 way alignment it takes O(n3) space and time (n being the length of the sequence) • Thus the alignment of all N sequences takes O(Nn3) time • Divide-&-Conquer with the cutoff length l Space Complexity • O(n2+l3) space is required • This is the space needed to store the additional cost matrices plus the space required for aligning the remaining (sub) sequences of length at most l.

Space and Time Complexity (contd.) Time Complexity • O(n2+nl2) time is required for alignment of one triplet • The term n2results from the time that is needed to calculate the additional cost matrices plus the time to search for the optimal slicing positions. • The term nl2 comes from the alignment of the triplet itself • The total time complexity of the alignment is therefore O(Nn2+Nnl2)

Running Time Comparisons

Alignments of Structured RNAs • aln3nn software includes the possibility to use RNA secondary structure annotation as additional input with nucleic acid alignments • Matrix of equilibrium base pairing probabilities Pij is computed for each input sequence • For each sequence position probabilities are calculated for following cases pairing possibilities • position i is paired with a position j <i • a position j > i • it remains unpaired

Structural Score Contributions • These probabilities are used as structure annotation. • For a pair of annotated input sequences A and B we define structural score contributions for positions i and j by • The total (mis)match score is the weighted sum of the sequence score and the structure score using the equation • Ψ is the balance term that measures the relative contribution of sequence and structure similarity • For very similar sequences one should use ψ ≈ 1 whereas in case of very dissimilar sequences one should use a score dominated by the structural component.

Experiments and Results

Pairwise versus Three-Way Alignments • Sets of artificial sequences generated using the ROSE package • The quality of aln3nn alignments were compared to standard progressive alignments of three sequences using t_coffee • The same scoring model in aln3nn and t_coffee were used • The analysis indicated that as gaps increased aln3nn produced better scores

Comparisons for 3 and 10 sequences

Protein Alignments • Used three types of substitution matrices: BLOSUM, PAM and GONNET • aln3nn chooses the best suiting matrix of the given type according to sequence identity • The median BAliBASE score is used for each sequence set as a measurement for alignment quality • Although aln3nn does not employ any heuristic rules to alter scoring parameters it compares well with other common alignment programs

Comparison of different alignment programs

RNA Alignments • RNA sequences often evolve much faster than their secondary structure • Alignment quality can be increased dramatically by including structural information • Used six diverse families of RNA data sets from the BRaliBase for comparisons • Structure conservation index (SCI) was used to assess the quality of the calculated alignments • SCI is defined as the ratio of consensus folding energy of a set of aligned sequences and average unconstrained folding energies of the individual sequences • SCI is close to 0 for structurally divergent sequences and close to 1 for correctly aligned sequences with a common fold

Alignment accuracies on RNA samples

Influence of parameter ψ on SCI • The SCI decreases if structural information is completely ignored (ψ = 1) • On the other hand ignoring the sequence information (ψ = 0) yields even worse results. • The reason is that RNA secondary structure prediction has limited accuracy so that alignments based on predicted structures for individual sequences are based on very noisy data • Also the impact of the ψ parameter varies between different RNA families.

Impact of the balancing parameter on SCI

Gap Removals • In some data sets one fifth of the gaps in the early stages of the progressive alignment are later removed again • Following table shows the frequency f of gaps that are removed at intermediate division steps and that are not re-introduced at later stages

Discussion and Future Work • A direct comparison of aln3nn with progressive alignments of the same three sequences shows that the progressive approach leads to significantly suboptimal scores • Aln3nn incurs additional computational costs compared to pair-wise, guide-tree based, approaches but it achieves competitive alignment accuracies on both protein and nucleic acid data • Performance of t_coffee shows that the shortcoming of initial pairwise alignments cannot be fully overcome later on where as aln3nn overcomes this problem • Future work • Modifications in the division step for 3 way alignments • Improvements in branch and bound approach

Thanks

Progressive multiple sequence alignments from triplets