1 / 38

# Class 4: Sequence Alignment II Gaps, Heuristic Search - PowerPoint PPT Presentation

Class 4: Sequence Alignment II Gaps, Heuristic Search. Alignment with Gaps – Example. 1. 2. Gaps. Both alignments have the same number of matches and spaces but alignment 2 seems better Definition : A gap is any maximal, consecutive run of spaces in a single string.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Class 4: Sequence Alignment II Gaps, Heuristic Search' - brenden-tanner

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Class 4: Sequence Alignment IIGaps, Heuristic Search

.

• Both alignments have the same number of matches and spaces but alignment 2 seems better

• Definition: A gap is any maximal, consecutive run of spaces in a single string.

• The length of a gap = the number of spaces in it

• Example 1 has 11 gaps, example 2 only 2 gaps

• Idea: develop alignment scores that take gaps (not spaces) into account

• Number of mutational events:

• A single gap – due to a single event that removed a number of residues

• Each gap – due to distinct, independent event

• Protein structure:

• Protein secondary structure consists of alpha helices, beta sheets and loops

• Loops of varying size can lead to very similar structure

• cDNA is the sequence after splicing (introns have been removed) and editing

• We expect regions of high similarity, separated by long gaps

Constant model

• Gives each gap a constant score, spaces are free

• Maximize:

• Time: O(mn)

• Works well with cDNA matching

Affine model

• Penalty for starting a gap + penalty for each additional space

• Each gap costs: Wg + qWs

• Maximize:

• Time: O(mn)

• Widely used

Convex model

• Each extra space contributes less penalty

• Gap function is convex in its length

• Example: Ws + log(q)

• Time O(mnlogm)

• A better model of biology

General model

• The weight of a gap is some arbitrary w(q)

• Time O(mn2 + nm2)

1

Score: -6

2

Score: -6

Scoring Parameters

Match: +1

Indel: -2

1

Score: -6

2

Score: 12

Scoring Parameters

Match: +1

Open gap: -2

1

Score: -17

2

Score: 1

Scoring Parameters

Match: +1

Open gap: -2, each space: -1

1

Score: -6

2

Score: ~7

Scoring Parameters

Match: +1

Open gap: -2, gap length: -logn

We divide all possible alignments of the prefixes s[1..i] and t[1..j] into 3 types

s: i

t: j

s: i-----

t: j

s: i

t: j-----

Recurrence relations:

Initial condition:

Optimal alignment:

Complexity:

Time: O(mn)

Space: O(mn)

Wg+Ws

Ws

A

S(i,j)

S(i,j)

Wg+Ws

S(i,j)

C

Ws

Affine Weight Model

This model has a natural explanation as a finite state automata

• One of the major uses of alignments is to find sequences in a “database”

• Such collections contain massive number of sequences (order of 106)

• Finding homologies in these databases with the standard dynamic programming can take too long

• Example:

• query protein : 232 AAs

• NR protein DB: 2.7 million sequences; 748 million AAs

• m*n = ~ 1.7 *1011cells !

• Instead, most searches rely on heuristic procedures

• These are not guaranteed to find the best match

• Sometimes, they will completely miss a high-scoring match

• We now describe the main ideas used by some of these procedures

• Actual implementations often contain additional tricks and hacks

• The main resource consuming factor in the standard DP is decision of where the gaps are. If there were no gaps, life was easy!

• Almost all heuristic search procedures are based on the observation that real-life well-matching pairs of sequences often do contain long strings with gap-less matches.

• These heuristics try to find significant local gap-less matches and then extend them.

• Suppose that we have two strings s[1..n] and t[1..m] such that nm

• If the optimal global alignment of s and t has few gaps, then path of the alignment will be close to the diagonal

s

t

• To find such a path, it suffices to search in a diagonal region of the matrix

• If the diagonal band has presumed width a, then the dynamic programming step takes O(an)

• Much faster than O(n2) of standard DP in this case

s

a

t

Problem (for local alignment):

• If we know that t[i..j] matches the query s[p..q], then we can use banded DP to evaluate quality of the match

• However, we do not know i,j,p,q !

• How do we select which sub-sequences to align using banded DP?

• Main idea:

Find (fast!) “good” diagonals and extend them to complete matches

• Suppose that we have a relatively long gap-less local match (diagonal):

…AGCGCCATGGATTGAGCGA…

…TGCGACATTGATCGACCTA…

• Can we find “clues” that will let us find it quickly?

t

Signature of a Match

Assumption: good matches contain several “patches” of perfect matches

AGCGCCATGGATTGAGCGA

TGCGACATTGATCGACCTA

• Given s and t, and a parameter k

• Find all pairs (i,j) such that s[i..i+k] and t[j..j+k] match perfectly

• Locate sets of pairs that are on the same diagonal by sorting according to i-j thus…

• Locating diagonals that contain

many close pairs.

• This is faster than O(nm) !

s

i i+k

j

j+k

t

• Extend the “best” diagonal matches to imperfect (yet ungapped) matches, compute alignment scores per diagonal. Pick the best-scoring matches.

• Try to combine close diagonals to potential gapped matches, picking the best-scoring matches.

• Finally, run banded DP on the regions containing these matches, resulting in several good candidate alignments.

• Most applications of FASTA use very small k(2 for proteins, and 4-6 for DNA)

• FASTA drawback is its reliance on perfect matches

• BLAST (Basic Local Alignment Search Tool) uses similar intuition, but relies on high scoringmatches rather than exact matches

• Given parameters: length k, and threshold T

• Two strings s and t of length k are a high scoring pair (HSP) if d(s,t) > T

• Given a query string s, BLAST construct all words w (“neighborhood words”), such that w is an HSP with a k-substring of s.

• Note: not all k-mers have an HSP in s

• Phase 1: compile a list of word pairs (k=3) above threshold T

• Example: for the following query:

…FSGTWYA… (query word is in green)

• A list of words (k=3) is:

• FSG SGT GTW TWY WYA

• YSG TGT ATW SWY WFA

• FTG SVT GSW TWF WYS

scores

GTW 6,5,11 22

neighborhood ASW 6,1,11 18

word hits ATW 0,5,11 16

> threshold NTW 0,5,11 16

GTY 6,5,2 13

GNW 10

neighborhood GAW 9

word hits

below threshold

(T=11)

• Search the database for perfect matches with neighborhoodwords. Those are “hits” for further alignment.

• We can locate seed words in a large database in a single pass, given the database is properly preprocessed (using hashing techniques).

t

Extending Potential Matches

• Once a hit is found, BLAST attempts to find a local alignment that extends it.

• Seeds on the same diagonal tend to be combined (as in FASTA)

• An improvement: look for 2 HSPs on close diagonals

• Extend the alignment between them

• Fewer extensions considered

• There is a version of BLAST,

involving gapped

extensions.

• Generally faster then FASTA,

arguably better.

s

t

• blastn (nucleotide BLAST)

• blastp (protein BLAST)

• tblastn (protein query, translated DB BLAST)

• blastx (translated query, protein DB BLAST)

• tblastx (translated query, translated DB BLAST)

• bl2seq (pairwise alignment)

• Today, most of the biological information can be freely accessed on the web.

• One can:

• Search for information on a known gene

• Check if a sequence exists in a database

• Find a homologous protein, helping us guess:

• Structure

• Function

• Important gateways:

• National Center for Biotechnology (GenBank)

• http://www.ncbi.nlm.nih.gov/

• European Bioinformatics Institue (EMBL-Bank)

• http://www.ebi.ac.uk/

• Expert Protein Analysis System (SwissProt)

• http://www.expasy.org/

→ Different tools and DBs to allow biologists a rich suite of queries

• Nucleotide DBs (GenBank, EMBL-Bank):

• Contain any and every type of DNA fragment:

• Full cDNA, ESTs, repeats, fragments

• “Dirty” and redundant

• Protein DBs (SwissProt):

• Contain amino-acid sequences for full proteins

• High quality, strict screening process

• Lots of annotated information on each protein