class 4 fast sequence alignment
Download
Skip this Video
Download Presentation
Class 4: Fast Sequence Alignment

Loading in 2 Seconds...

play fullscreen
1 / 19

Class 4: Fast Sequence Alignment - PowerPoint PPT Presentation


  • 116 Views
  • Uploaded on

Class 4: Fast Sequence Alignment. Alignment in Real Life. One of the major uses of alignments is to find sequences in a “database” Such collections contain massive number of sequences (order of 10 6 )

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Class 4: Fast Sequence Alignment' - onan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
alignment in real life
Alignment in Real Life
  • One of the major uses of alignments is to find sequences in a “database”
  • Such collections contain massive number of sequences (order of 106)
  • Finding homologies in these databases with the standard dynamic programming can take too long
  • Example:
    • query protein : 232 AAs
    • NR protein DB: 2.7 million sequences; 748 million AAs
    • m*n = ~ 1.7 *1011cells !
heuristic search
Heuristic Search
  • Instead, most searches rely on heuristic procedures
  • These are not guaranteed to find the best match
  • Sometimes, they will completely miss a high-scoring match
  • We now describe the main ideas used by some of these procedures
    • Actual implementations often contain additional tricks and hacks
basic intuition
Basic Intuition
  • The main resource consuming factor in the standard DP is decision of where the gaps are. If there were no gaps, life was easy!
  • Almost all heuristic search procedures are based on the observation that real-life well-matching pairs of sequences often do contain long strings with gap-less matches.
  • These heuristics try to find significant local gap-less matches and then extend them.
banded dp
Banded DP
  • Suppose that we have two strings s[1..n] and t[1..m] such that nm
  • If the optimal global alignment of s and t has few gaps, then path of the alignment will be close to the diagonal

s

t

banded dp1
Banded DP
  • To find such a path, it suffices to search in a diagonal region of the matrix
  • If the diagonal band has presumed width a, then the dynamic programming step takes O(an)
  • Much faster than O(n2) of standard DP in this case

s

a

t

banded dp2
Banded DP

Problem (for local alignment):

  • If we know that t[i..j] matches the query s[p..q], then we can use banded DP to evaluate quality of the match
  • However, we do not know i,j,p,q !
  • How do we select which sub-sequences to align using banded DP?
fasta overview
FASTA Overview
  • Main idea:

Find (fast!) “good” diagonals and extend them to complete matches

  • Suppose that we have a relatively long gap-less local match (diagonal):

…AGCGCCATGGATTGAGCGA…

…TGCGACATTGATCGACCTA…

  • Can we find “clues” that will let us find it quickly?
signature of a match

s

t

Signature of a Match

Assumption: good matches contain several “patches” of perfect matches

AGCGCCATGGATTGAGCGA

TGCGACATTGATCGACCTA

fasta
FASTA
  • Given s and t, and a parameter k
  • Find all pairs (i,j) such that s[i..i+k] and t[j..j+k] match perfectly
  • Locate sets of pairs that are on the same diagonal by sorting according to i-j thus…
  • Locating diagonals that contain

many close pairs.

  • This is faster than O(nm) !

s

i i+k

j

j+k

t

fasta1
FASTA
  • Extend the “best” diagonal matches to imperfect (yet ungapped) matches, compute alignment scores per diagonal. Pick the best-scoring matches.
  • Try to combine close diagonals to potential gapped matches, picking the best-scoring matches.
  • Finally, run banded DP on the regions containing these matches, resulting in several good candidate alignments.
  • Most applications of FASTA use very small k(2 for proteins, and 4-6 for DNA)
blast overview
BLAST Overview
  • FASTA drawback is its reliance on perfect matches
  • BLAST (Basic Local Alignment Search Tool)uses similar intuition, but relies on high scoringmatches rather than exact matches
  • Given parameters: length k, and threshold T
  • Two strings s and t of length k are a high scoring pair (HSP) if d(s,t) > T
high scoring pair
High-Scoring Pair
  • Given a query string s, BLAST construct all words w (“neighborhood words”), such that w is an HSP with a k-substring of s.
  • Note: not all k-mers have an HSP in s
blast phase 1
BLAST: phase 1
  • Phase 1: compile a list of word pairs (k=3)
  • above threshold T
  • Example: for the following query:

…FSGTWYA… (query word is in green)

  • A list of words (k=3) is:
  • FSG SGT GTW TWY WYA
  • YSG TGT ATW SWY WFA
  • FTG SVT GSW TWF WYS
slide15

BLAST: phase 1

scores

GTW 6,5,11 22

neighborhood ASW 6,1,11 18

word hits ATW 0,5,11 16

> threshold NTW 0,5,11 16

GTY 6,5,2 13

GNW 10

neighborhood GAW 9

word hits

below threshold

(T=11)

blast phase 2
BLAST: phase 2
  • Search the database for perfect matches with neighborhoodwords. Those are “hits” for further alignment.
  • We can locate seed words in a large database in a single pass, given the database is properly preprocessed (using hashing techniques).
extending potential matches

s

t

Extending Potential Matches
  • Once a hit is found, BLAST attempts to find a local alignment that extends it.
  • Seeds on the same diagonal tend to be combined (as in FASTA)
two hsp diagonal
Two HSP diagonal
  • An improvement: look for 2 HSPs on close diagonals
  • Extend the alignment between them
  • Fewer extensions considered
  • There is a version of BLAST,

involving gapped

extensions.

  • Generally faster then FASTA,

arguably better.

s

t

blast variants
Blast Variants
  • blastn (nucleotide BLAST)
  • blastp (protein BLAST)
  • tblastn (protein query, translated DB BLAST)
  • blastx (translated query, protein DB BLAST)
  • tblastx (translated query, translated DB BLAST)
  • bl2seq (pairwise alignment)
ad