1 / 17

# Sequence Alignment Tutorial 2 - PowerPoint PPT Presentation

Sequence Alignment Tutorial #2. © Ydo Wexler & Dan Geiger. Sequence Comparison. Much of bioinformatics involves sequences DNA sequences RNA sequences Protein sequences We can think of these sequences as strings of letters DNA & RNA: |alphabet|=4 Protein: |alphabet|=20. Global Alignment.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Sequence Alignment Tutorial 2' - farren

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Sequence AlignmentTutorial #2

© Ydo Wexler & Dan Geiger

.

Much of bioinformatics involves sequences

• DNA sequences

• RNA sequences

• Protein sequences

We can think of these sequences as strings of letters

• DNA & RNA: |alphabet|=4

• Protein: |alphabet|=20

Input: two sequences over the same alphabet

Output: an alignment of the two sequences

Example:

• GCGCATGGATTGAGCGA and TGCGCCATTGATGACCA

• A possible alignment:

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

Best biological

explanaiton

Biological data

Global Alignment

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

Three elements:

• Perfect matches

• Mismatches

• Insertions & deletions (indel)

Example (cont):

Symmetric view of evolution

Global Alignmentscoring scheme

Score each position independently:

• Match: +1

• Mismatch: -1

• Indel: -2

Score of an alignment is sum of position scores

Example:-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

Score: (+1x13) + (-1x2) + (-2x4) = 3

------GCGCATGGATTGAGCGA

TGCGCC----ATTGATGACCA--

Score:(+1x5) + (-1x6) + (-2x11) = -23

Two basic variants of sequence alignment:

• Global alignment (Needelman-Wunsch)

• Local alignment (Smith-Waterman)

Today we’ll see :

• Overlap alignment

• Affine cost for gaps

We’ll use ideas of dynamic programming presented in the lecture

Consider the following problem:

• Find the most significant overlap between two sequences S,T ?

• Possible overlap relations: a.

b.

Difference from local alignment:

Here we require alignment between the endpoints of the two sequences.

Formally:

given S[1..n] , T[1..m] find i,j such that:

d=max{D(S[1..i],T[j..m]) , D(S[i..n],T[1..j]) , D(S[1..n],T[i..j]) , D(S[i..j],T[1..m]) }

is maximal.

Solution: Same asGlobal alignment except we don’t not penalise overhanging ends.

• Initialization:V[i,0]=0,V[0,j]=0

Recurrence:as in global alignment

Score:maximum value at the bottom line and rightmost line

Overlap Alignment (Example)

S =PAWHEAE

T =HEAGAWGHEE

Scoring scheme :

• Match: +4

• Mismatch: -1

• Indel: -5

Overlap Alignment (Example)

S =PAWHEAE

T =HEAGAWGHEE

Scoring scheme :

• Match: +4

• Mismatch: -1

• Indel: -5

Overlap Alignment (Example)

S =PAWHEAE

T =HEAGAWGHEE

Scoring scheme:

• Match: +4

• Mismatch: -1

• Indel: -5

• Match: +4

• Mismatch: -1

• Indel: -5 -2

Overlap Alignment (Example)

The best overlap is:

PAWHEAE------

---HEAGAWGHEE

Pay attention!

A different scoring scheme could yield a different result, such as:

---PAW-HEAE

HEAGAWGHEE-

• Observation: Insertions and deletions often occur in blocks longer than a single nucleotide.

• Consequence:

• Current scoring scheme gives a constant penalty per gap unit.

• This does not score well the above phenomenon.

Question: How do we modify the scheme to incorporate this?

• Penalty score for a gap of length g :

d - penalty for introduction of a gap

e - penalty for elongating the gap by one unit.

Typically d > e

• Problem:

When aligning S[i] to a gap we do not know whether to penalize by d or e.

Solution: we compute 3 matrices simultaneously

M(i,j) - the score obtained by aligning S[i] to T[j]

IS(i,j) - the score obtained by aligning S[i]to a gap

IT(i,j) - the score obtained by aligning T[j]to a gap

This can be obtained by using

Affine gap scores

• Initialization:depending on the problem (global, local,…)

• Recurrence:uses already known values - M(i’,j’), IS(i’,j’), IT(i’,j’)

Affine gap scores an insertion.

• Simplification:

Why are two matrices enough?