1 / 74

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms .

Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Outline. The importance of multiple string alignments in molecular biology. CLUSTAL W. Family representation.

jaguar
Download Presentation

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms .

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  2. Outline • The importance of multiple string alignments in molecular biology. • CLUSTAL W. • Family representation. • How to score multiple alignments. • The center star method for SP alignment. • consensus strings. • Approximating the optimal consensus multiple alignment. • Iterative pairwise alignment. • Progressive alignment and contemporary improvements. • Repeated-motif methods

  3. Motivation • Why multiple string comparison? • Because many important commonalties are faint or widely dispersed, they might not be apparent when comparing two strings alone but may become clear, or even obvious, when comparing a set of related strings. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  4. Defenition • Definition: A global multiple alignment of k>2 strings S={S1,S2,…,Sk} is a natural generalization of alignment for two strings. Chosen spaces are inserted into each of the k strings so that the resulting strings have the same length, defined to be l. Then the strings are arrayed in k rows of l columns each, so that each character and space of each string is in a unique column. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  5. Biological basis for multiple string comparison • The second fact of biological sequence comparison Evolutionarily and functionally related molecular strings can differ significantly throughout much of the string and yet preserve the same three-dimensional structure(s), or the same tow-dimensional substructure(s) (motifs, domains), or the same active sites, or the same or related dispersed residues (DNA or amino acid). Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  6. Three “big-picture” biological uses for multiple string comparison • The representation of protein families and superfamilies. • The identification and representation of conserved sequence features of DNA or protein that correlate with structure and/or function. • The deduction of evolutionary history from DNA or protein sequences. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  7. CLUSTAL W • Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. http://www.ebi.ac.uk/clustalw/ Sequences results Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  8. Family and superfamily representation • Often a set of strings (a family) is defined by biological similarity, and one wants to find subsequence commonalities that characterize or represent the family. • There are three common kinds of family representations that come from multiple string comparison: • Profile representation • Consensus sequence representation • Signature representation Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  9. Family representation and alignment with profiles • Definition: Given a multiple alignment of a set of strings, a profile for that multiple alignment specifies for each column the frequency that each character appears in the column. A profile is sometimes also called a weight matrix in the biological literature. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  10. Family representation and alignment with profiles • Often the values in the profile are converted to log-odds ratio – If p(y,j) is the frequency that character y appears in column j, and p(y) is the frequency that character y appears anywhere in the multiply aligned sequences, then log( p(y,j)/p(y) ) is commonly used as the y,j profile entry. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  11. Aligning a string to a profile • Given a profile P and a new string S, we want to answer the question: “How well S, or substring of S, fit the profile P” . • Since space is a legal character of a profile, a fit of S to P should also allow the insertion of spaces into S, and hence the question is naturally formalized as an easy generalization of pure string alignment. An alignment of string aabbc to the column positions of the previous alignment. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  12. How to optimally align a string to a profile • Recall that for two characters x and y, s(x,y) denotes the alphabet-weight value assigned to aligning x with y in the pure string alignment problem. • Definition: For character y and column j, let p(y,j) be the frequency that character y appears in column j of the profile, and let S(x,j) denote y[s(x,y) × p(y,j)], the score for aligning x with column j. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  13. How to optimally align a string to a profile • Definition: Let V(i,j) denote the value of the optimal alignment of substring S[1..i] with the first j columns of C. • The recurrence: V(i,0)=s(S1(k),_) V(0,j)=S(_,k) For I and j both strictly positive, the general recurrence is: V(i,j) = max [ V(i-1,j-1) + S(S1(i),j), V(i-1,j) + s(S1(i),__), V(i,j-1) + S(_,j) ]. • Time analysis: O(nm), where n is the length of S and  is the size of the alphabet. k≤i k≤j Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  14. Profile to profile alignment • Another way that profiles are used is to compare one protein set to another. In that case, the profile for one set is compared to the profile of the other. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  15. Introduction to computing multiple string alignments • Definition: Given a set of k > 2 strings S={S1, S2, ...,Sk},a local multiple alignment of S is obtained by selecting one substring Si’ from each string Si  S and then globally aligning those substrings. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  16. How to score multiple alignments • To date, there is no objective function that has been as well accepted for multiple alignment as edit distance or similarity has been for two-string alignment. • We will discuss three types of objective functions: • sum-of-pairs functions • consensus functions • tree functions Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  17. How to score multiple alignments • Definition: Given a multiple alignment M, the induced pairwise alignment of two strings Si and Sj is obtained from M by removing all rows except the two rows for Si and Sj. That is, the induced alignment is multiple alignment M restrict to Si and Sj. Any two opposing spaces in that induced alignment can be removed if desired. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  18. How to score multiple alignments • Definition: The score of an induced pairwise alignment is determined using any chosen scoring scheme for tow-string alignment in the standard manner. SP score 14 4 5 5 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  19. Multiple alignment with the sum-of-pairs (SP) objective function • Definition: The sum of pairs (SP) score of multiple alignment M is the sum of the scores of pairwise global alignments induced by M. • The SP alignment problem Compute a global multiple alignment M with minimum sum-of-pairs score. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  20. An exact solution to the SP alignment problem • Via dynamic programming – for k strings of length n, it takes (nk) time. • We will develop the dynamic programming recurrence only for the case of three strings. • We will develop an accelerant to the basic dynamic programming solution that somewhat increases the number of strings that can be optimally aligned. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  21. An exact solution to the SP alignment problem • Definition: Let S1, S2 and S3 denote three strings of length n1,n2 and n3, respectively, and let D(i,j,k) be the optimal SP score for aligning S1[1..i], s2[1..j] and s3[1..k]. The score for a match, mismatch, or space is specified by the variables smatch, smis and sspace respectively. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  22. Recurrences for a nonboundary cell(i,j) For i:=1 to n1 do for j:=1 to n2 do for k:=1 to n3 do begin if (S1(i)=S2(j) then sij:=smatch else cij:=smis; if (S1(i)=S3(k) then cik:=smatch else cik:=smis; if (S2(j)=S3(k) then cjk:=smatch else cjk:=smis; d1:=D(i-1,j-1,k-1)+cij+cik+cjk; d2:=D(i-1,j-1,k)+cij+2*sspace; d3:=D(i-1,j,k-1)+cik+2*sspace; d4:=D(i,j-1,k-1)+cjk+2*sspace; d5:=D(i-1,j,k)+2*sspace; d6:=D(i,j-1,k)+2*sspace; d7:=D(i,j,k-1)+2*sspace; D(i,j,k):=min[d1,d2,d3,d4,d5,d6,d7]; end; Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  23. D values for boundary cells • Let D1,2(i,j) denote the familiar pairwise distance between substrings S1[1..i] and S2[1..j], and let D1,3(i,k) and D2,3(j,k) denote the analogous pairwise distance. Then, • D(i,j,0)=D1,2(i,j)+(i+j)*sspace • D(i,0,k)=D1,3(i,k)+(i+k)*sspace • D(i,j,0)=D2,3(j,k)+(J+k)*sspace • D(0,0,0)=0 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  24. A speed up for the exact solution • The program for multiple alignment that was shown uses recurrences in backward direction. • In forward dynamic programming when D(i,j,k) is set, D(i,j,k) is sent forward the seven cells that can be influenced by it. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  25. A speed up for the exact solution • Definition: Let d1,2(i,j) be the edit distance between suffixes S1[i..n] and S2[j..n] of string S1 and S2. Define d1,3(i,k) and d2,3(j,k) analogously. • All these d values can be computed in O(n2) time by reversing the strings and computing three pairwise distances. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  26. A speed up for the exact solution • Suppose that some multiple alignment of S1, S2, and S3 is known and that the alignment has SP score z. • Key idea of the heuristic speed up Recall that D(i,j) is the optimal SP score for aligning S1[1..i], S2[1..j], and S3[1..k]. If D(i,j,k)+d1,2(i,j)+d1,3(i,k)+d2,3(j,k) is greater than z, then node (i,j,k) cannot be on any optimal path and so D(i,j,k) need not be sent forward to any cell. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  27. A bounded-error approximation method for SP alignment • The method is provably fast (runs in polynomial worst-case time) and yet produced alignments whose SP score is guaranteed to be less than twice the score of optimal SP alignment. • Recall that for two strings, D(Si,Sj) is the (optimal) weighted edit distance between Si and Sj. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  28. An initial key idea: alignments consistent with a tree • Definition: Let S be a set of strings, and let T be a tree where each node is labeled with a distinct string from S. Then, a multiple alignment M of S is called consistent with T if the induced pairwise alignment of Si and Sj has score D(Si,Sj) for each pair of strings (Si,Sj) that label adjacent nodes in T. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  29. A bounded-error approximation method for SP alignment AXZ AXZ 1 2 AXXZ 3 a) b) AYZ 4 AYXYZ 5 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  30. An initial key idea: alignments consistent with a tree • Theorem: For any set of strings S and for any tree T whose nodes are labeled by distinct strings of S, we can efficiently find a multiple alignment M(T) of S that is consistent with T Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  31. The center star method for SP alignment • We will describe the method in terms of an alphabet-weighted scoring scheme for two-string alignment, and let s(x,y) be the score contributed when a character x is aligned opposite a character y. • Definition: A scoring scheme satisfies the triangle inequality if for any three characters x,y and z, s(x,z)≤ s(x,y) + s(y,z). Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  32. The center star method for SP alignment • Definition: Given a set of k strings S, define a center string Sc S as a string in S that minimizes SjSD(Sc,Sj), and let M denote the minimum sum. Define the center star to be a star tree of k nodes, with the center node labeled Sc and with each of the k-1 remaining nodes labeled by a distinct string in S-Sc. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  33. The center star method for SP alignment S4 S2 S3 S3 S1 S6 A generic center star for six strings, where the center string Sc is S3 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  34. The center star method for SP alignment • Definition: Define the multiple alignment Mc of the set of strings S to be the multiple alignment consistent with the center star. • Definition: Define d(Si,Sj) as the score of the pairwise alignment of strings Si and Sjinduced by Mc. Denote the score of an alignment M as d(M). •  d(Si,Sj)≥D(Si,Sj), d(Mc)=i<jd(Si,Sj), d(Si,Sc)=D(Si,Sc) Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  35. The center star method for SP alignment • Lemma: Assume that the two-string scoring scheme satisfies the triangle inequality. Then for any strings Si and Sj in S, d(Si,Sj) ≤ d(Si,Sc) + d(Sc + Sj) = D(Si,Sc) + D(Sc + Sj) Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  36. The center star method for SP alignment • Definition: Let M* be the optimal multiple alignment of the k strings of S. Let d*(Si,Sj) be the score of the pairwise alignment of strings Si and Sj induced by M*. Then d(M*)=i<jd*(Si,Sj). Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  37. The center star method for SP alignment • Theorem: d(Mc)/d(M*) ≤ 2(k-1)/k <2. • Corollary: kM≤i<jD(Si,Sj)≤d(M*)≤d(Mc)≤[2(k-1)/ki<jD(Si,Sj). Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  38. Steiner consensus strings • Definition: Given a set of strings S, and given another string S’, the consensus error of a string S’ relative to S is E(S’)= Si S D (S’, Si). • Note that S’ need not be from S. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  39. Steiner consensus strings • Definition: Given a set of strings S, an optimal Steiner string S* for S is a string that minimizes the consensus error E(S*) over all possible strings. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  40. Steiner consensus strings • Lemma: Let S have k strings, and assume that the two-string scoring scheme satisfies the triangle inequality. Then there exists a string S S such that E(S) / E(S*) ≤ 2 – 2/k < 2 _  _ Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  41. Steiner consensus strings • Recall that Sc is a string that minimizes Si S D (Sc, Si) over all strings in S. • Theorem: Assuming that the scoring scheme satisfies the triangle inequality, E(Sc) / E(S*) ≤ 2 – 2/k < 2 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  42. Consensus strings from multiple alignment • Definition: Given a multiple alignment M of a set of strings S, the consensus character of column I of M is the character that minimizes the summed distance to it from all the characters in column i. let d(i) denote the minimum sum in column i. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  43. Consensus strings from multiple alignment • Definition: The consensus string SM derived from alignment M is the concatenation of the consensus characters for each column of M. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  44. Consensus strings from multiple alignment • Definition: Let M be a multiple alignment of a set of strings S, and let SM be its consensus string containing q characters. Then the alignment error of SM equals  d(i), and the alignment error of M is defined as the alignment error of SM. i=q i=1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  45. Consensus strings from multiple alignment • Definition: The optimal consensus multiplealignment is a multiple alignment M for input set S whose consensus string SM has smallest alignment error over all possible multiple alignments of S Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  46. Consensus strings from multiple alignment • Definition: Given set S of k strings, let T be the star tree with Steiner string S* at the root and each of the k strings at distinct leaves of T. Then the multiple alignment of SUS* consistent with T is said to be consistent with S*. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  47. Consensus strings from multiple alignment • Theorem: Let S’ denote the consensus string of the optimal consensus multiple alignment. Then, removal of the spaces from S’ creates the optimal Steiner string S*. Conversely’ removal of the row for S* from the multiple alignment consistent with S* creates the optimal consensus multiple alignment of S. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  48. Approximating the optimal consensus multiple alignment • Theorem: Assuming the triangle inequality, the multiple alignment Mc created by the center star method has an SP score that is never more than 2 – 2/k times the SP score of the optimal SP alignment, and it has a (consensus) alignment error that is never more than 2 – 2/k times the alignment error of the optimal consensus multiple alignment. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  49. Multiple alignment to a (phylogenetic) tree • Definition: Given an input tree T with a distinct string (from a set of strings S) written at each leaf, a phylogenetic alignment for T is an assignment of one string to each internal node of T. Note that the strings assigned to internal nodes need not be distinct and need not be from the input strings S. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

  50. Multiple alignment to a (phylogenetic) tree • Definition: If strings S and S’ are assigned to the endpoints of an edge (i,j), then (i,j) had edge distance D(S,S’). The distance along a path is the sum of the distances on the edges in the path. The distance of a phylogenetic alignment is the total of all the edge distances in the tree. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

More Related