Aligning Alignments

1 / 61

# Aligning Alignments - PowerPoint PPT Presentation

Aligning Alignments. Soni Mukherjee 11/11/04. Pairwise Alignment. Given two sequences, find their optimal alignment Score = (#matches) * m - (#mismatches) * s - (#gaps) * d Optimal alignment is the alignment with the maximum score. Dynamic Programming. We want to align

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Aligning Alignments' - caelan

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Aligning Alignments

Soni Mukherjee

11/11/04

Pairwise Alignment
• Given two sequences, find their optimal alignment
• Score = (#matches) * m - (#mismatches) * s - (#gaps) * d
• Optimal alignment is the alignment with the maximum score
Dynamic Programming
• We want to align

x1…xm and y1…yn

• D(i,j) = optimal score of aligning

x1…xi and y1…yj

• Solution is D(m, n)
Three possible cases for computing D(i,j):

xi aligns to yj

x1……xi-1 xi

y1……yj-1 yj

2. xi aligns to a gap

x1……xi-1 xi

y1……yj -

yj aligns to a gap

x1……xi -

y1……yj-1 yj

Dynamic Programming

C--GCCTAG-CT--AG

CT-GC-TAT-CTTTAG

Three possible cases for computing D(i,j):

xi aligns to yj

x1……xi-1 xi

y1……yj-1 yj

2. xi aligns to a gap

x1……xi-1 xi

y1……yj -

yj aligns to a gap

x1……xi -

y1……yj-1 yj

Dynamic Programming

C--GCCTAG-CT--AG

CT-GC-TAT-CTTTAG

D(i,j) = D(i-1, j-1) +

m, if xi = yj

-s, otherwise

Three possible cases for computing D(i,j):

xi aligns to yj

x1……xi-1 xi

y1……yj-1 yj

2. xi aligns to a gap

x1……xi-1 xi

y1……yj -

yj aligns to a gap

x1……xi -

y1……yj-1 yj

Dynamic Programming

C--GCCTAG-CT--AG

CT-GC-TAT-CTTTAG

D(i,j) = D(i-1, j-1) +

m, if xi = yj

-s, otherwise

D(i,j) = D(i-1, j) - d

Three possible cases for computing D(i,j):

xi aligns to yj

x1……xi-1 xi

y1……yj-1 yj

2. xi aligns to a gap

x1……xi-1 xi

y1……yj -

yj aligns to a gap

x1……xi -

y1……yj-1 yj

Dynamic Programming

C--GCCTAG-CT--AG

CT-GC-TAT-CTTTAG

D(i,j) = D(i-1, j-1) +

m, if xi = yj

-s, otherwise

D(i,j) = D(i-1, j) - d

D(i,j) = D(i, j-1) - d

Dynamic Programming
• Inductive assumption:
• D(i-1, j-1), D(i-1, j) and D(i, j-1) are optimal
• D(i, j) = max

Where s(xi, yj) = m if xi = yj; -s otherwise

• D(i-1, j-1) + s(xi, yj)
• D(i-1, j) - d
• D(i, j-1) - d
Dynamic Programming
• Matrix D

+s(X[i],Y[j])

-d

-d

Every non-decreasing path from (0,0) to (M,N) corresponds to an alignment of the two sequencesNeedleman-Wunsch

y1 ……………………………… yN

xM ……………………………… x1

Scoring Gaps More Accurately
• Linear gap model:

Gap of length n incurs penalty p(n) = n*d

Scoring Gaps More Accurately
• Linear gap model:

Gap of length n incurs penalty p(n) = n*d

• Convex gap model:

For all n, p(n+1) - p(n) < p(n) - p(n-1)

Scoring Gaps More Accurately
• Linear gap model:

Gap of length n incurs penalty p(n) = n*d

• Convex gap model:

For all n, p(n+1) - p(n) < p(n) - p(n-1)

D(i, j) = max

D(i-1, j-1) + s(xi, yj)

maxk=0…i-1 D(k, j) – p(i-k)

maxk=0…j-1 D(i, k) – p(j-k)

Scoring Gaps More Accurately
• Linear gap model:

Gap of length n incurs penalty p(n) = n*d

• Convex gap model:

For all n, p(n+1) - p(n) < p(n) - p(n-1)

D(i, j) = max

D(i-1, j-1) + s(xi, yj)

maxk=0…i-1 D(k, j) – p(i-k)

maxk=0…j-1 D(i, k) – p(j-k)

3

Running time = O(N )

Affine Gaps
• p(n) = d + n*e

d = gap open penalty

e = gap extend penalty

e

d

Affine Gaps
• p(n) = d + n*e

d = gap open penalty

e = gap extend penalty

• Now we need three matrices:

D(i, j) = score of alignment x1…xi to y1…yj ifxi aligns to yj

H(i, j) = score of alignment x1…xi to y1…yj ifyj aligns to a gap

V(i, j) = score of alignment x1…xi to y1…yj ifxi aligns to a gap

e

d

Needleman-Wunsch with Affine Gaps
• D(i,j) = max
• H(i,j) = max
• V(i,j) = max

D(i-1, j-1) + s(xi, yj)

H(i-1, j-1) + s(xi, yj)

V(i-1, j-1) + s(xi, yj)

D(i, j-1) - d

H(i, j-1) - e

V(i, j-1) - d

D(i-1, j) - d

H(i-1, j) - d

V(i-1, j) - e

Needleman-Wunsch with Affine Gaps
• D(i,j) = max
• H(i,j) = max
• V(i,j) = max

D(i-1, j-1) + s(xi, yj)

H(i-1, j-1) + s(xi, yj)

V(i-1, j-1) + s(xi, yj)

D(i, j-1) - d

H(i, j-1) - e

V(i, j-1) - d

Running time = O(MN)

D(i-1, j) - d

H(i-1, j) - d

V(i-1, j) - e

Affine Gaps
• Essentially, when there is a gap, the algorithm looks back one space to determine whether or not this gap opened a gap or continued a previous one:

- x Starts z x Starts z x Continues

y - new gap y - new gap - - old gap

Multiple Sequence Alignment
• Given N sequences x1, x2,…, xN, insert gaps in each sequence xi such that:
• All sequences have the same length L
• Global score is maximum
• Motivation:
• Faint similarity between two sequences becomes significant if present in many
• Multiple alignments can help improve pairwise alignments
Induced Pairwise Alignments
• Multiple alignment:

x:AC_GCGG_C

y:AC_GC_GAG

z:GCCGC_GAG

• Induces three pairwise alignments:

x: ACGCGG_C x: AC_GCGG_C y: AC_GCGAG

y: ACGC_GAC z: GCCGC_GAG z: GCCGCGAG

Sum of Pairs
• Sum of Pairs score of a multiple alignment is the sum of the scores of all induced pairwise alignments:

S(m) = k<l s(mk, ml)

wheres(mk, ml) = score of induced alignment (k, l)

Multidimensional Dynamic Programming
• Example in 3-D (3 sequences)
• 7 neighbors per cell

F(i,j,k) = max{ F(i-1,j-1,k-1)+S(xi, xj, xk),

F(i-1,j-1,k )+S(xi, xj, - ),

F(i-1,j ,k-1)+S(xi, -, xk),

F(i-1,j ,k )+S(xi, -, - ),

F(i ,j-1,k-1)+S( -, xj, xk),

F(i ,j-1,k )+S( -, xj, -),

F(i ,j ,k-1)+S( -, -, xk) }

Multidimensional Dynamic Programming
• L = length of each sequence
• N = number of sequences
• Size of matrix = LN
• Neighbors per cell = 2N – 1
• Running time = O(2N LN)
Progressive Alignment
• Align two of the sequences xi and xj
• Fix that alignment
• Align a third sequence/alignment to the alignment xixj
• Repeat until all sequences are aligned
Progressive Alignment
• When evolutionary tree is known:
• Align closest first, in order of the tree:
• Align (x, y)
• Align (w, z)
• Align (xy, wz)

x

y

z

w

Score at each entry adds score of aligning the column in y to the column in the alignment xzSequence vs Alignment

x1 ……………………………… xM

z1 ……………………………… zL

yN ……………………………… y1

Example
• ith Ietter of y: A
• jth column of xz:
• D(i, j) = max

-

A

D(i-1, j-1) – d + s(A, A)

D(i-1, j) – d – d

D(i, j-1) + 0 – d

Affine Gaps
• ith letter of y matched with jth column of xz
• (j-1)th column of xz gapped

y: - A

x: - -

z: A A

• This induces the yx alignment:

y: - A

x: - -

Affine Gaps
• Recall for pairwise alignment, there were three cases when determining whether a gap starts or continues a gap:

- x Starts z x Starts z x Continues

y - new gap y - new gap - - old gap

Affine Gaps
• Recall for pairwise alignment, there were three cases when determining whether a gap starts or continues a gap:

- x Starts z x Starts z x Continues

y - new gap y - new gap - - old gap

• When aligning a sequence and an alignment, a fourth case arises:

- x Starts or continues

- - a gap???

Affine Gaps
• Recall for pairwise alignment, there were three cases when determining whether a gap starts or continues a gap:

- x Starts z x Starts z x Continues

y - new gap y - new gap - - old gap

• When aligning a sequence and an alignment, a fourth case arises:

- x Starts or continues

- - a gap???

• Optimistic and pessimistic gap counts for sequence vs alignment
• Exact gap counts for sequence vs alignment
Sequence vs Alignment
• A = a1 … am is a sequence of length m
• B is a multiple alignment of length n of k sequences
• represented by a k x n matrix
• each entry bij is either a letter or gap
Optimistic and Pessimistic Gap Counts
• When we have

- x

- -

• Optimistic gap count assumes that this continues a previous gap
• Pessimistic gap count assumes this starts a new gap
• Running time = O(kmn)
Exact Gap Counts
• Recall matrices:

D(i, j) = score of alignment a1…ai to b1…bj if ai aligns to bj

H(i, j) = score of alignment a1…ai to b1…bj if bj aligns to a gap

V(i, j) = score of alignment a1…ai to b1…bj if ai aligns to a gap

• Only ways to get

are the cases HH, HV, and HD, generalized as HX

- x

- -

Exact Gap Counts
• Three possibilities:
• … DH…HX
• … VH…HX
• H………HX
Exact Gap Counts
• Three possibilities:
• … DH…HX
• … VH…HX
• H………HX
• Is bij the first character in its row encountered during the run?
Exact Gap Counts
• Three possibilities:
• … DH…HX
• … VH…HX
• H………HX
• Is bij the first character in its row encountered during the run?
• Algorithm with lots of matrices runs in O(kn + kmn + mn )

2

2

Sequence vs Alignment

Alignment vs Alignment

Comparison
Sequence vs Alignment

Only three types of paths can cause

Alignment vs Alignment

Comparison

… - - x

… - - -

Sequence vs Alignment

Only three types of paths can cause

Alignment vs Alignment

Any path can cause

Comparison

… - - x

… - - -

… - - x

… - - -

Aligning Alignments ExactlyJohn Kececioglu and Dean Starrett, 2003
• Aligning two alignments is NP-complete
• Exact algorithm
• Time and space complexity
• Pruning
• Results
NP-Completeness
• Reduction from the Maximum Cut Problem
• Still NP-compete if:
• Strings are of length at most 5
• Every row has at most 3 gaps
• At most 1 gap in the interior of each string
Exact Algorithm
• Sufficient to know relative order of the rightmost element in the row for each pair:

x: - A

y: - -

• If x’s rightmost element is to the right of y’s rightmost element, this is an extension
• Otherwise, it is a startup
Shapes

A: -AGGCTATCACCTGACCTCCAGG

B: TAG-CTATCAC--GACCGC----

C: CAG-CTATCAC--GACCGC----

D: CAGCCTATCACC-GAACGCCA--

Shapes

A: -AGGCTATCACCTGACCTCCAGG

B: TAG-CTATCAC--GACCGC----

C: CAG-CTATCAC--GACCGC----

D: CAGCCTATCACC-GAACGCCA--

• S1 = {B, C}
• S2 = {D}
• S3 = {A}
• S = (S1, S2, S3)
Shapes
• A shape s for an alignment with k rows is an ordered partition s =(s1, s2, … , sp) where 1 <= p <= k
• If we know s, we know for each gap whether it starts or continues a gap
Exact Algorithm
• A is a k x m multiple alignment
• B is a l x n multiple alignment
• C(i, j, s) = cost of an optimal alignment of a1…ai and b1…bj ending in shape s
• Instead of entries (i, j, s), think of entries (i, j), each with a shape list L(i, j)
Exact Algorithm
• For each s in L(i, j):
• For each next-entry (i, j+1), (i+1, j), and (i+1, j+1)
• Add resulting shape t to next-entry’s shape list.
• Find s in L(m, n) that minimizes C(i, j, s) to find optimum cost
Time and Space Complexity
• Time =
• Space = Time / k

O((3 + sqrt(2)) (n-k) k ), if k < n

O((3 + sqrt(2)) k n ), if k >= n

k

2

3/2

n

2

-1/2

Pruning
• Dominance Pruning - uses a dominance relation on pairs of shapes
• Bound Pruning - exploits upper and lower bounds on the cost of an optimal alignment
• Combining these yields fastest exact algorithm in practice
Dominance Pruning
• Extension - a series of insertions, deletions, and substitutions of columns that extend the alignment into an entry
• Shape s dominates shape t if, for all extensions p, C(s p) <= C(t p)
• s is at least as good as t on all extensions

o

o

Bound Pruning
• L(s) - lower bound on C(s p) for all p
• Optimistic algorithm on reverse of input
• U - upper bound on the cost of the optimal alignment of A and B
• Minimum of optimistic, pessimistic, and trivial alignment scores
• If L(s) > U, remove s

o

Reducing the Space
• Exact Algorithm with dominance pruning can be run in linear space in the number of columns of the input, without increasing the time complexity
• Not possible with bound pruning, which uses quadratic-size tables to lookup lower bounds
Results
• Tractable in practice
• Ceiling phenomenon - number of shapes does not grow once the number of rows exceeds a threshold