aligning alignments n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Aligning Alignments PowerPoint Presentation
Download Presentation
Aligning Alignments

Loading in 2 Seconds...

play fullscreen
1 / 61

Aligning Alignments - PowerPoint PPT Presentation


  • 196 Views
  • Uploaded on

Aligning Alignments. Soni Mukherjee 11/11/04. Pairwise Alignment. Given two sequences, find their optimal alignment Score = (#matches) * m - (#mismatches) * s - (#gaps) * d Optimal alignment is the alignment with the maximum score. Dynamic Programming. We want to align

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Aligning Alignments' - caelan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
aligning alignments

Aligning Alignments

Soni Mukherjee

11/11/04

pairwise alignment
Pairwise Alignment
  • Given two sequences, find their optimal alignment
  • Score = (#matches) * m - (#mismatches) * s - (#gaps) * d
  • Optimal alignment is the alignment with the maximum score
dynamic programming
Dynamic Programming
  • We want to align

x1…xm and y1…yn

  • D(i,j) = optimal score of aligning

x1…xi and y1…yj

  • Solution is D(m, n)
dynamic programming2
Three possible cases for computing D(i,j):

xi aligns to yj

x1……xi-1 xi

y1……yj-1 yj

2. xi aligns to a gap

x1……xi-1 xi

y1……yj -

yj aligns to a gap

x1……xi -

y1……yj-1 yj

Dynamic Programming

C--GCCTAG-CT--AG

CT-GC-TAT-CTTTAG

dynamic programming3
Three possible cases for computing D(i,j):

xi aligns to yj

x1……xi-1 xi

y1……yj-1 yj

2. xi aligns to a gap

x1……xi-1 xi

y1……yj -

yj aligns to a gap

x1……xi -

y1……yj-1 yj

Dynamic Programming

C--GCCTAG-CT--AG

CT-GC-TAT-CTTTAG

D(i,j) = D(i-1, j-1) +

m, if xi = yj

-s, otherwise

dynamic programming4
Three possible cases for computing D(i,j):

xi aligns to yj

x1……xi-1 xi

y1……yj-1 yj

2. xi aligns to a gap

x1……xi-1 xi

y1……yj -

yj aligns to a gap

x1……xi -

y1……yj-1 yj

Dynamic Programming

C--GCCTAG-CT--AG

CT-GC-TAT-CTTTAG

D(i,j) = D(i-1, j-1) +

m, if xi = yj

-s, otherwise

D(i,j) = D(i-1, j) - d

dynamic programming5
Three possible cases for computing D(i,j):

xi aligns to yj

x1……xi-1 xi

y1……yj-1 yj

2. xi aligns to a gap

x1……xi-1 xi

y1……yj -

yj aligns to a gap

x1……xi -

y1……yj-1 yj

Dynamic Programming

C--GCCTAG-CT--AG

CT-GC-TAT-CTTTAG

D(i,j) = D(i-1, j-1) +

m, if xi = yj

-s, otherwise

D(i,j) = D(i-1, j) - d

D(i,j) = D(i, j-1) - d

dynamic programming6
Dynamic Programming
  • Inductive assumption:
    • D(i-1, j-1), D(i-1, j) and D(i, j-1) are optimal
  • D(i, j) = max

Where s(xi, yj) = m if xi = yj; -s otherwise

  • D(i-1, j-1) + s(xi, yj)
  • D(i-1, j) - d
  • D(i, j-1) - d
dynamic programming7
Dynamic Programming
  • Matrix D

+s(X[i],Y[j])

-d

-d

needleman wunsch
Every non-decreasing path from (0,0) to (M,N) corresponds to an alignment of the two sequencesNeedleman-Wunsch

y1 ……………………………… yN

xM ……………………………… x1

scoring gaps more accurately
Scoring Gaps More Accurately
  • Linear gap model:

Gap of length n incurs penalty p(n) = n*d

scoring gaps more accurately1
Scoring Gaps More Accurately
  • Linear gap model:

Gap of length n incurs penalty p(n) = n*d

  • Convex gap model:

For all n, p(n+1) - p(n) < p(n) - p(n-1)

scoring gaps more accurately2
Scoring Gaps More Accurately
  • Linear gap model:

Gap of length n incurs penalty p(n) = n*d

  • Convex gap model:

For all n, p(n+1) - p(n) < p(n) - p(n-1)

D(i, j) = max

D(i-1, j-1) + s(xi, yj)

maxk=0…i-1 D(k, j) – p(i-k)

maxk=0…j-1 D(i, k) – p(j-k)

scoring gaps more accurately3
Scoring Gaps More Accurately
  • Linear gap model:

Gap of length n incurs penalty p(n) = n*d

  • Convex gap model:

For all n, p(n+1) - p(n) < p(n) - p(n-1)

D(i, j) = max

D(i-1, j-1) + s(xi, yj)

maxk=0…i-1 D(k, j) – p(i-k)

maxk=0…j-1 D(i, k) – p(j-k)

3

Running time = O(N )

affine gaps
Affine Gaps
  • p(n) = d + n*e

d = gap open penalty

e = gap extend penalty

e

d

affine gaps1
Affine Gaps
  • p(n) = d + n*e

d = gap open penalty

e = gap extend penalty

  • Now we need three matrices:

D(i, j) = score of alignment x1…xi to y1…yj ifxi aligns to yj

H(i, j) = score of alignment x1…xi to y1…yj ifyj aligns to a gap

V(i, j) = score of alignment x1…xi to y1…yj ifxi aligns to a gap

e

d

needleman wunsch with affine gaps
Needleman-Wunsch with Affine Gaps
  • D(i,j) = max
  • H(i,j) = max
  • V(i,j) = max

D(i-1, j-1) + s(xi, yj)

H(i-1, j-1) + s(xi, yj)

V(i-1, j-1) + s(xi, yj)

D(i, j-1) - d

H(i, j-1) - e

V(i, j-1) - d

D(i-1, j) - d

H(i-1, j) - d

V(i-1, j) - e

needleman wunsch with affine gaps1
Needleman-Wunsch with Affine Gaps
  • D(i,j) = max
  • H(i,j) = max
  • V(i,j) = max

D(i-1, j-1) + s(xi, yj)

H(i-1, j-1) + s(xi, yj)

V(i-1, j-1) + s(xi, yj)

D(i, j-1) - d

H(i, j-1) - e

V(i, j-1) - d

Running time = O(MN)

D(i-1, j) - d

H(i-1, j) - d

V(i-1, j) - e

affine gaps2
Affine Gaps
  • Essentially, when there is a gap, the algorithm looks back one space to determine whether or not this gap opened a gap or continued a previous one:

- x Starts z x Starts z x Continues

y - new gap y - new gap - - old gap

multiple sequence alignment
Multiple Sequence Alignment
  • Given N sequences x1, x2,…, xN, insert gaps in each sequence xi such that:
    • All sequences have the same length L
    • Global score is maximum
  • Motivation:
    • Faint similarity between two sequences becomes significant if present in many
    • Multiple alignments can help improve pairwise alignments
induced pairwise alignments
Induced Pairwise Alignments
  • Multiple alignment:

x:AC_GCGG_C

y:AC_GC_GAG

z:GCCGC_GAG

  • Induces three pairwise alignments:

x: ACGCGG_C x: AC_GCGG_C y: AC_GCGAG

y: ACGC_GAC z: GCCGC_GAG z: GCCGCGAG

sum of pairs
Sum of Pairs
  • Sum of Pairs score of a multiple alignment is the sum of the scores of all induced pairwise alignments:

S(m) = k<l s(mk, ml)

wheres(mk, ml) = score of induced alignment (k, l)

multidimensional dynamic programming
Multidimensional Dynamic Programming
  • Example in 3-D (3 sequences)
  • 7 neighbors per cell

F(i,j,k) = max{ F(i-1,j-1,k-1)+S(xi, xj, xk),

F(i-1,j-1,k )+S(xi, xj, - ),

F(i-1,j ,k-1)+S(xi, -, xk),

F(i-1,j ,k )+S(xi, -, - ),

F(i ,j-1,k-1)+S( -, xj, xk),

F(i ,j-1,k )+S( -, xj, -),

F(i ,j ,k-1)+S( -, -, xk) }

multidimensional dynamic programming1
Multidimensional Dynamic Programming
  • L = length of each sequence
  • N = number of sequences
  • Size of matrix = LN
  • Neighbors per cell = 2N – 1
  • Running time = O(2N LN)
progressive alignment
Progressive Alignment
  • Align two of the sequences xi and xj
  • Fix that alignment
  • Align a third sequence/alignment to the alignment xixj
  • Repeat until all sequences are aligned
progressive alignment1
Progressive Alignment
  • When evolutionary tree is known:
  • Align closest first, in order of the tree:
    • Align (x, y)
    • Align (w, z)
    • Align (xy, wz)

x

y

z

w

sequence vs alignment
Score at each entry adds score of aligning the column in y to the column in the alignment xzSequence vs Alignment

x1 ……………………………… xM

z1 ……………………………… zL

yN ……………………………… y1

example
Example
  • ith Ietter of y: A
  • jth column of xz:
  • D(i, j) = max

-

A

D(i-1, j-1) – d + s(A, A)

D(i-1, j) – d – d

D(i, j-1) + 0 – d

affine gaps3
Affine Gaps
  • ith letter of y matched with jth column of xz
  • (j-1)th column of xz gapped

y: - A

x: - -

z: A A

  • This induces the yx alignment:

y: - A

x: - -

affine gaps4
Affine Gaps
  • Recall for pairwise alignment, there were three cases when determining whether a gap starts or continues a gap:

- x Starts z x Starts z x Continues

y - new gap y - new gap - - old gap

affine gaps5
Affine Gaps
  • Recall for pairwise alignment, there were three cases when determining whether a gap starts or continues a gap:

- x Starts z x Starts z x Continues

y - new gap y - new gap - - old gap

  • When aligning a sequence and an alignment, a fourth case arises:

- x Starts or continues

- - a gap???

affine gaps6
Affine Gaps
  • Recall for pairwise alignment, there were three cases when determining whether a gap starts or continues a gap:

- x Starts z x Starts z x Continues

y - new gap y - new gap - - old gap

  • When aligning a sequence and an alignment, a fourth case arises:

- x Starts or continues

- - a gap???

aligning alignments john d kececioglu and weiqing zhang 1998
Aligning AlignmentsJohn D. Kececioglu and Weiqing Zhang, 1998
  • Optimistic and pessimistic gap counts for sequence vs alignment
  • Exact gap counts for sequence vs alignment
sequence vs alignment1
Sequence vs Alignment
  • A = a1 … am is a sequence of length m
  • B is a multiple alignment of length n of k sequences
    • represented by a k x n matrix
    • each entry bij is either a letter or gap
optimistic and pessimistic gap counts
Optimistic and Pessimistic Gap Counts
  • When we have

- x

- -

  • Optimistic gap count assumes that this continues a previous gap
  • Pessimistic gap count assumes this starts a new gap
  • Running time = O(kmn)
exact gap counts
Exact Gap Counts
  • Recall matrices:

D(i, j) = score of alignment a1…ai to b1…bj if ai aligns to bj

H(i, j) = score of alignment a1…ai to b1…bj if bj aligns to a gap

V(i, j) = score of alignment a1…ai to b1…bj if ai aligns to a gap

  • Only ways to get

are the cases HH, HV, and HD, generalized as HX

- x

- -

exact gap counts1
Exact Gap Counts
  • Three possibilities:
    • … DH…HX
    • … VH…HX
    • H………HX
exact gap counts2
Exact Gap Counts
  • Three possibilities:
    • … DH…HX
    • … VH…HX
    • H………HX
  • Is bij the first character in its row encountered during the run?
exact gap counts3
Exact Gap Counts
  • Three possibilities:
    • … DH…HX
    • … VH…HX
    • H………HX
  • Is bij the first character in its row encountered during the run?
  • Algorithm with lots of matrices runs in O(kn + kmn + mn )

2

2

comparison
Sequence vs Alignment

Alignment vs Alignment

Comparison
comparison1
Sequence vs Alignment

Only three types of paths can cause

Alignment vs Alignment

Comparison

… - - x

… - - -

comparison2
Sequence vs Alignment

Only three types of paths can cause

Alignment vs Alignment

Any path can cause

Comparison

… - - x

… - - -

… - - x

… - - -

aligning alignments exactly john kececioglu and dean starrett 2003
Aligning Alignments ExactlyJohn Kececioglu and Dean Starrett, 2003
  • Aligning two alignments is NP-complete
  • Exact algorithm
  • Time and space complexity
  • Pruning
  • Results
np completeness
NP-Completeness
  • Reduction from the Maximum Cut Problem
  • Still NP-compete if:
    • Strings are of length at most 5
    • Every row has at most 3 gaps
    • At most 1 gap in the interior of each string
exact algorithm
Exact Algorithm
  • Sufficient to know relative order of the rightmost element in the row for each pair:

x: - A

y: - -

  • If x’s rightmost element is to the right of y’s rightmost element, this is an extension
  • Otherwise, it is a startup
shapes
Shapes

A: -AGGCTATCACCTGACCTCCAGG

B: TAG-CTATCAC--GACCGC----

C: CAG-CTATCAC--GACCGC----

D: CAGCCTATCACC-GAACGCCA--

shapes1
Shapes

A: -AGGCTATCACCTGACCTCCAGG

B: TAG-CTATCAC--GACCGC----

C: CAG-CTATCAC--GACCGC----

D: CAGCCTATCACC-GAACGCCA--

  • S1 = {B, C}
  • S2 = {D}
  • S3 = {A}
  • S = (S1, S2, S3)
shapes2
Shapes
  • A shape s for an alignment with k rows is an ordered partition s =(s1, s2, … , sp) where 1 <= p <= k
  • If we know s, we know for each gap whether it starts or continues a gap
exact algorithm1
Exact Algorithm
  • A is a k x m multiple alignment
  • B is a l x n multiple alignment
  • C(i, j, s) = cost of an optimal alignment of a1…ai and b1…bj ending in shape s
  • Instead of entries (i, j, s), think of entries (i, j), each with a shape list L(i, j)
exact algorithm3
Exact Algorithm
  • For each s in L(i, j):
    • For each next-entry (i, j+1), (i+1, j), and (i+1, j+1)
      • Add resulting shape t to next-entry’s shape list.
  • Find s in L(m, n) that minimizes C(i, j, s) to find optimum cost
time and space complexity
Time and Space Complexity
  • Time =
  • Space = Time / k

O((3 + sqrt(2)) (n-k) k ), if k < n

O((3 + sqrt(2)) k n ), if k >= n

k

2

3/2

n

2

-1/2

pruning
Pruning
  • Dominance Pruning - uses a dominance relation on pairs of shapes
  • Bound Pruning - exploits upper and lower bounds on the cost of an optimal alignment
  • Combining these yields fastest exact algorithm in practice
dominance pruning
Dominance Pruning
  • Extension - a series of insertions, deletions, and substitutions of columns that extend the alignment into an entry
  • Shape s dominates shape t if, for all extensions p, C(s p) <= C(t p)
  • s is at least as good as t on all extensions

o

o

bound pruning
Bound Pruning
  • L(s) - lower bound on C(s p) for all p
    • Optimistic algorithm on reverse of input
  • U - upper bound on the cost of the optimal alignment of A and B
    • Minimum of optimistic, pessimistic, and trivial alignment scores
  • If L(s) > U, remove s

o

reducing the space
Reducing the Space
  • Exact Algorithm with dominance pruning can be run in linear space in the number of columns of the input, without increasing the time complexity
  • Not possible with bound pruning, which uses quadratic-size tables to lookup lower bounds
results
Results
  • Tractable in practice
  • Ceiling phenomenon - number of shapes does not grow once the number of rows exceeds a threshold