Parametric Sequence Alignment: Exploring Penalty Variation to Optimize DNA and Amino Acid Sequences

Extending the Cοre Problems

Parametric sequence alignment • When using sequence alignment methods to study DNA or amino acid sequences, there is often considerable disagreement about how to weight matches, mismatches, insertions and deletions (indels), and gaps. • ... one must be able to vary the gap and gap size penalties independently and in a query dependent fashion in order to obtain the maximal sensitivity of the search. Sequence alignment is sensitive to the choices of gap penalty and the form of the relatedness matrix, and it is often desirable to vary these . . . One of the most prominent problems is the choice of parametric values, especially gap penalties. When very similar sequences are compared, the choice is not critical; but when the conservation is low, the resulting alignment is strongly affected. • Parametric sequence alignment is a tool that efficiently explores such penalty variation. It avoids the problem of choosing fixed parameter settings by computing the optimal alignment as a function of variable parameters for weights and penalties. The goal is to partition the parameter space into regions such that in each region one alignment is optimal throughout and such that each region is maximal for this property.

Definitions and first results • Definition For any alignment A of two strings, let mtA, msA, idA, and gpA respectively, denote the number of matches, mismatches, indels, and gaps contained in A • Without the use of character-specific scoring matrices, the value of Α is: where α, β, γ , and δ are parameters that can be varied to adjust the relative contributions of the matches, mismatches, indels, and gaps. • Lemma: If the planes for alignments A and A’, intersect and are distinct, then there is a line L in the γ, δ space along which A and A’, have equal value; A has larger value than A’, on one of the half-planes defined by L, and A has smaller value on the other half-plane. If the planes for A and A’, do not intersect, then one of the alignments has larger value than the other at every γ, δ point. • Corollary: If A is optimal for at least one point p in the y, δ space, then it is optimal only for point p, or it is optimal only for a line segment that contains p, or it is optimal only for a convex polygon that contains p‘. • Theorem: Given two strings S1 and S2, the γ, δ parameter space decomposes into conνex polygons such that any alignment that is optimal for some γ, δ point in the interior of a polygon P is optimal for all points in P and nowhere else.

Definitions and first results

Parametric alignment with the use of scoring matrices • Definition: For any alignment A of two strings, let smtA and smsA, respectively, denote the total score (obtained from the scoring matrix) for the specific matches in A and the total score for the specific mismatches in A. Αs before, idA and gpA denote the number of indels and gaps contained in A. • Using scoring matrices, the parametric value of alignment A is α x smtA+ β x smsA- γ x idA + δ gpA.

Efficient algorithms for computinga polygonal decomposition • Ray-search problem :Given an alignment A, a point p where A is optimal, and a ray h in γ, δ space starting at p, find the furthest point (call it r*) from p on ray h where A remains optimal. If A remains optimal until h reaches a border of the parameter space, then r* is that border point on h. It is also possible that r*=p.

Newtοn’s ray-search algorithm Set r to the (γ, δ) point where h intersects a border of the parameter space. While A is not an optimal alignment at point r do begin Find an optimal alignlnent A* at point r. Set r to be the unique point on h where the value of A equals the value of A*. end, Set r* to r. • Lemma: 1) Newton’s ray-search algorithm finds r* exactly. 2)Unless A is optimal at the initial setting of r , the last computed alignmentA* is cooptimal with Α at r* and yet is also optimal on h for some nonzero distance beyond r* 3) When Newtοn’s ray-search algorithm computes an alignmentat apoint r on h, none of the alignments computed previously (in this execution of Newton's algorithm ) are optimal at r. • Follows: if r* = p, then Newton’s method discovers this and returns an alignment A* that is optimal at p and also optimal for some nonzero distance along h. For any polygon Ρ intersected by h, a single ray-search computes alignments at no more than two points of P

Finding a polygon of the decomposition • Let A be an alignment that is optimal in the interior of an (unknown) polygon P(A), and let p be a known point where A is optimal. • First pick any ray from p and solve the ray-search problem along h. There are two degenerate cases that can occur: one is that the resulting r* lies on a border of the parameter space: the other is that r* is a vertex of the decοmpοsitiοn. Assιιme that they dο not occur. • The ray search along h will find a point r* that lies on an edge e of polygon P(Α). By Newton’s second law, the ray search will also return an alignment A* that is optimal in the interior of the polygon bordering edge e. The intersection of the two planes for A and A* describes a line l* that contains edge e; then the full extent of e can be determined by solving two mοre ray-Search problems using A. Ιn one problem, ray his the half-line οf l* starting at r* and running in one direction along l*, and in the other problem ray h is the remaining half-line οf l* in the other direction. These two ray searches find the opposite endpoints of edge e. Once edge e is fully described, we look for another edge of P(A) by selecting another ray h from p that does not intersect edge e. By linking identical endpoints of edges that are found in this way, it is easy to continue selecting rays from p that do not intersect previously discovered edges or vertices of P(Α). This method continues until all the discovered edges link together in a closed cycle, which then exactly describes the edges and vertices of P(A).

The special case of global alignment • Lemma: For any alignment A withcorresponding vector (mt,ms,id):2mt+2ms+id =N where N=n + mis the sum of the two sequence lengths, Hence mt + ms + id/2 = N/2 for any global alignment. Proof: A match or a mismatch invοlves two characters. Thus the total number of characters that form part of a match or mismatch is 2(mt + ms). An indel involves only one character from one sequence. Αll spaces are counted in a global alignment, so the number of characters involved in indels is id. Each of the N characters is counted once as part of a match, a mismatch, or an indel. • Corollary: Every global alignment has the same value (N) atthe point β = -1,γ=-1/2. Proof:Plugging into the objective function, we see that at point (-1, -1/2) every global alignment CΑ has value mtA - mSΑ – idA/2, which equals N by the previous Lemma.

The special case of global alignment • Theorem: Ιn the case of global alignment with no scοring matrices, there can be at most O(n) polygons in the parametric decomposition, where n<=m. Proof: Since each polygon boundary, radiates from the point (-1, -1/2) boundary in the positive β,γ quadrant (the area οf interest) must intersect either the horizontal or the vertical axis of the space. We willshow that the number οf intersections of the vertical axis cannot exceed n. Αlong the vertical (γ) axis, β is zero, so the parametric problem along that axis is tο maximize mtA – γ x idA as a function of the single parameter γ. Clearly, as γ increases, mt must decrease whenever the optimal changes (i.e., at each breakpoint along the γ axis). But since the number οf matches can only vary from zero to n, there can be at most npolygon boundaries that intersect the γ axis. The same upper bound of n boundaries holds (by the same reasoning) along the horizontal axis.

Uses fοr parametric alignment • Sensitivity analysis: check to see how sensitive the alignment is to changes in the parameters • Efficient computation of all cooptimals

Computing suboptimaΙ alignments • Optimal alignment, even with a wide range of models and parameter choices, does not always identify the biological phenomena that it is intended to reflect. • The available objective functions might not reflect the full range of biological forces that cause differences between strings • The objective functions might not induce the optimal alignment tο form the desired shape • The data might contain errors that confound in algorithms • There may be ties for the optimal alignment • There may be many nearly optimal alignments that are biologically more significant than any optimal one

First definitions and first results • Definition: Let R be a path in the alignment graph from the start node s=(0,0) to the destination node t=(n,m). Path R corresponds to some global alignment (not necessarily optimal) of strings S1 and S2. • Definition: Fοr any pair of nodes x, y in the alignment graph, let l(x,y) be the length of the longest path from x to y. Let P* be the path corresponding to an optimal global alignment, so that the length of R* is equal to l(s,t)=V(S1,S2). • Definition: For path R, let δ(R) be the length of R* minus the length of R. δ(R) is called the “deviation” of R (from the optimal), and R is called a δ(R)-near-optimal path. A path R is called δ-near-optimal if δ(R)=δ. • Definition: For an edge e=(u,v), let δ(e)=l(s,t)-[l(s,u)+s(u,v)+l(v,t)]. That is, δ(e) is the difference between the length of R* and the length of the longest s-to-t path that is forced to go through edge e. • Lemma: δ(e) can be computed for all edges in the time used to compute two optimal alignments plus time proportional to the number of edges.

Δ near-optimalalignments • One way tο study near-optimal paths(or alignments) is to specify a cutοff value Δ and then compute information about the set of all paths whose deviation from the optimal is at most Δ, that is, the set of all paths that are δ-near-optimal for some δ <= Δ. • Definitiοn: For an edge e=(u,v), let e(e) = l(u, t) - [s(u,v) + l(v, t)]. • The interpretation of e(e) is that it is the "additional penalty for using e on the path from u to t, rather than following the optimal (longest) path from u to t directly''

Δ near-optimalalignments • Theorem: For any s-to-t path R, • Corollary: Consider a path R’ from s to u and let δ denote . Then the s-to-t path R consisting of path R’ followed by the longest u-to-t path is a δ-near-optimal path. • Proof: By definition of e(e), e (e) = 0 for any edge e on the longest u-to-t path. Hence δ(R) = δ by the previous Theorem.

Counting and enumerating near-optimal paths - How to count • Definition: Let N(v, δ) be the number of δ-near-optimal s-to-t paths that go through node v. • For a given value Δ, the number of s-to-t paths whose deviation from R* is at most Δ is • We compute that sum by evaluating the following recurrence for each node v and for each “needed” value οf δ:

Counting and enumerating near-optimal paths - Enumeration • The δ-near-optimal paths can be enumerated in order of increasing δ, and the enumeration can be terminated when δ = Δ or when some fixed number of paths have been found. • Α tree enumerating partial paths is maintained.

A οne-dimensional chaining problem • Consider a set of r (possibly) overlapping intervals drawn on the line R, where each interval j has some associated value v(j). The problem is to select a subset of nonoverlapping intervals whose values sum to as large a number as possible

one-dimensional Algorithm • Let I be a list of all the 2r numbers representing the locations of the endpoints of the intervals in L. Sort the numbers in I, annotate each entry in I with the name of the interval it is part of and whether it is a left or a right endpoint. For convenience, let I be a one-dimensional array. • Set maxto zero. • Fοr i from 1 to 2r do • begin • Ιf I[i] represents the left end of an interval say interval j, then set V[j] to v(j)+mαx. • Ιf I[i] represents the right end of interval j, then set maxtο the maximum of max and V[j] . • end.

The two-dimensional chain problem

The two-dimensional chain problem • Definition Α subset of the rectangles is called a chain if no horizontal or vertical line intersects more than one rectangle in the subset and if the rectangles can be ordered so that each one is below and to the right of its predecessor. The value of a chain is the sum of the values of the rectangles in the chain. • The Chain Problem Find a chain with maximum value over all chains.

Τwο-dimensional chain aΙgorithm List Lbegins empty. For i frοm tο 2r do begin If I[i ] is the left end of a rectangle, say rectangle k, then begin search L for the last triple where lj is greater than hk, That is, find the clοsest (in the y dimension) rectangle j with a triple in L whose lowest point is strictly above the highest point of rectangle k Set V(k) to v(k) + V(j). end Else If I[i] is the right end of rectangle k, then begin Search L for the first triple where lj is less than or equal to lk . If lj < lk or lj = lk and V (k) > V(j), then insert the triple (lk , V (k), k) into L, in the proper location to keep the triples sorted by their l values. Delete from Lthe triple for every rectangle j’ where lj’ <= lk and V(k) > V(j’). end end.

Τwο-dimensional chain aΙgorithm • Theorem: Anoptimal chain canbe found in O(rlogr) time .

Parametric Sequence Alignment: Exploring Penalty Variation to Optimize DNA and Amino Acid Sequences

Parametric Sequence Alignment: Exploring Penalty Variation to Optimize DNA and Amino Acid Sequences

Presentation Transcript