Sequence Alignment Tutorial #3

1 / 16

# Sequence Alignment Tutorial #3 - PowerPoint PPT Presentation

Sequence Alignment Tutorial #3. © Ydo Wexler & Dan Geiger. Sequence Alignment (Reminder). Global Alignment :. Input: two sequences S 1 , S 2 over the same alphabet Output: two sequences S’ 1 , S’ 2 of equal length ( S’ 1 , S’ 2 are S 1 , S 2 with possibly additional gaps)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Sequence Alignment Tutorial #3' - ethel

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Sequence AlignmentTutorial #3

.

© Ydo Wexler & Dan Geiger

Sequence Alignment (Reminder)

Global Alignment:

Input: two sequences S1, S2 over the same alphabet

Output: two sequences S’1, S’2 of equal length

(S’1, S’2 are S1, S2 with possibly additional gaps)

Example:

• S1= GCGCATGGATTGAGCGA
• S2= TGCGCCATTGATGACC
• A possible alignment:

S’1=-GCGC-ATGGATTGAGCGA

S’2= TGCGCCATTGAT-GACC--

Goal: How similar are two sequences S1 and S2

Sequence Alignment (Reminder)

Local Alignment:

Input: two sequences S1, S2 over the same alphabet

Output: two sequences S’1, S’2 of equal length

(S’1, S’2 are substrings of S1, S2 with possibly additional gaps)

Example:

• S1=GCGCATGGATTGAGCGA
• S2=TGCGCCATTGATGACC
• A possible alignment:

S’1=ATTGA-G

S’2= ATTGATG

Goal: Find the pair of substrings in two input sequences which have the highest similarity

Sequence Alignment (Reminder)

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

Three elements:

• Perfect matches
• Mismatches
• Insertions & deletions (indel)
• Score each position independently
• Score of an alignment is sum of position scores
Breaking Number
• Input: Two sequences M,E over the same alphabet (|M|≥|E|)
• Output: The smallest k, s.t. there exist partitions:

M=M1M2…Mk , E=E1E2…Ek s.t

Ei is a substring of Mi for all i = 1..k.

If no such k exists, then return ∞.

Example:

M=AAAATTTAAATTTA

E=AATTATA

M1=AAAATTT M2=AAATT M3=A

E1= AATT E2= AT E3=A

AAAATTTAAATTTA

--AATT---AT--A

Find an O(|M||E|) algorithm for finding the breaking number of M,E.

(d)

(e)

Affine gap penalty

Breaking Number (cont)
• Solution: Reduce the problem to global alignment with modifications:
• Do not allow mismatches
• Do not allow gaps in M
• No penalty for gaps in start/end of sequence
• Constant penalty for gaps (regardless of their length)
• Scoring scheme:
• Match – 0
• Mismatch - -∞
• Gap intr. - -1
• Gap elong. -0

AAAATTTAAATTTA

--AATT---AT--A

breaking number = -score of the alignment + 1.

Breaking Number (cont)
• Complexity: Standard O(|M||E|) Dynamic Programming
• Correctness: Two-way argument
• An alignment of score –(k-1) corresponds to a partition of M,E to k subsequences
• A partition of M,E to k subsequences has an alignment score of –(k-1)
• Optimal alignment has score of -∞ There is no valid partition(2)
• Optimal alignment has score –k 
• There is a valid partition to k+1 blocks (1)
• There is no valid partition to less blocks (2)

A

-

T

A

G

-

G

T

T

G

G

G

G

T

G

G

-

-

T

-

A

T

T

A

-

-

A

-

T

A

C

C

A

C

C

C

-

G

C

-

G

-

Possible alignment

Possible alignment

Multiple Sequence Alignment

S1=AGGTC

S2=GTTCG

S3=TGAAC

Multiple Sequence Alignment (cont)
• Input: Sequences S1, S2,…, Sk over the same alphabet
• Output: Gapped sequences S’1, S’2,…, S’k of equal length
• |S’1|= |S’2|=…= |S’k|
• Removal of spaces from S’iobtains Si

Sum-of-pairs (SP) score for a multiple global alignment is the sum of scores of all pairwise alignments induced by it.

Multiple Sequence Alignment Example

Consider the following alignment:

AC-CDB-

Scoring scheme: match - 0

mismatch/indel - -1

SP score:

-4

-3

-5

=-12

Multiple Sequence AlignmentComplexity

Given kstrings of length n, there is a generalization of the DP algorithm that finds an optimal SP alignment:

• Instead of a 2-dimensional table we have a k-dimensional table
• Each dimension is of length ‘n’+1
• Each entry depends on 2k-1 adjacent entries

Complexity:O(2knk)

This problem is known to be NP-hard (no polynomial-time algorithm)

Multiple Sequence Alignment Approximation Algorithm

We use cost instead of score

 Find alignment of minimal cost

Assumption:the cost function δ is a distance function

• δ(x,x) = 0
• δ(x,y) = δ(y,x) ≥ 0
• δ(x,y) + δ(y,z) ≥ δ(x,z) (triangle inequality)

(e.g. cost of MM ≤ cost of two indels)

D(S,T) - cost of minimum global alignment between S and T

Multiple Sequence Alignment Approximation Algorithm

The ‘star’ algorithm:

Input: Γ - set of k strings S1,…,Sk.

• Find the string S’ (center) that minimizes
• Denote S1=S’and the rest of the strings as S2,…,Sk
• Iteratively add S2,…,Sk to the alignment as follows:
• Suppose S1,…,Si-1are alreadyaligned as S’1,…,S’i-1
• AlignSi to S’1 to produce S’i and S’’1 aligned
• Replace S’1 by S’’1

total complexity

Multiple Sequence Alignment Approximation Algorithm

Time analysis:

• Choosing S1 – execute DP for all sequence-pairs - O(k2n2)
• Adding Si to the alignment -execute DP for Si , S’1 - O(i·n2).

(In the ith stage the length of S’1can be up-to i· n)

Multiple Sequence Alignment Approximation Algorithm

Approximation ratio:

• M* - optimal alignment
• M - The alignment produced by this algorithm
• d(i,j) - the distanceMinduces on the pair Si,Sj

For all i: d(1,i)=D(S1,Si)

(we perform optimal alignment between S’1 and Si and δ(-,-) = 0 )

Multiple Sequence Alignment Approximation Algorithm

Triangle inequality

Approximation ratio:

Definition of S1: