Database Searches - PowerPoint PPT Presentation

Database searches
1 / 13

  • Uploaded on
  • Presentation posted in: General

Database Searches. BLAST. BLAST. Basic Local Alignment Search Tool Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) Altschul, Madden, Schaffer, Zhang, Zhang, Miller, Lipman, Nucleic Acids Res . 25 (1997) Main ideas:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Database Searches

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Database searches

Database Searches




  • Basic Local Alignment Search Tool

    • Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990)

    • Altschul, Madden, Schaffer, Zhang, Zhang, Miller, Lipman, Nucleic Acids Res. 25 (1997)

  • Main ideas:

    • Increase search speed by finding fewer, but better, hot spots during initial screening phase

    • Uses longer word sizes

    • Integrate scoring matrix into first phase

      • Compare with FASTA, which requires exact matches

Blast terminology

BLAST Terminology

  • Segment pair: equal-length substrings of sequences S1 and S2

  • Locally maximal segment pair: segment pair whose alignment score cannot be improved by extending or shortening it

  • Maximum segment pair(MSP) = segment pair with maximum score over all segment pairs in the sequences S1 and S2

  • High-scoring segment pair (HSP): A segment pair with score higher than some cutoff score, s.

  • wis the length parameter; t is the threshold parameter

Blast hits


  • A hit is a w-length word in the database that aligns with a word from the query sequence with score > t

  • BLAST looks for hits instead of exact matches

    • Allows word size to be kept high for speed, without sacrificing sensitivity

  • Typically, w= 3-5 for amino acids, ~11-12 for DNA

  • t is the most critical parameter:

    • ↑t ↓ “background” hits (faster)

    • ↓t ↑ ability to detect more distant relationships (at cost of increased noise

Database searches


  • For each word, evaluate score of match (exact or not) according to BLOSUM62

    • E.g., for PQG, score is 7+5+6 = 18

  • There are 20wpossible w-length words, but considering only those with score >t, greatly reduces number of matches

    • E.g., there are 203 = 8000 possible matches to PQG, but only 50 achieve score > t = 13



Extending a hit

Extending a hit

  • After locating a hit, BLAST attempts to extend hit in both directions, until score has drops more than Xbelow the maximum score yet attained.

  • Extension step typically accounts for > 90% of execution time.

Extending a hit1

Extending a hit

Improvement 2 hit method

Improvement: 2-hit method

  • Do extensions only when there are two hits on the same diagonal within some distance Aof each other (e.g., A =40)

  • Reduces sensitivity (ability to detect distantly related sequences)

    • To compensate, use lower tvalue (e.g., 11 rather than 13)

  • Since we only extend when there are two nearby hits, many fewer regions are extended

Gapped blast

Gapped BLAST

  • Allows local alignments with indels (similar to FASTA)

  • Local alignments from different diagonal are merged into a different local alignment followed by some indels followed by a second local alignment, etc.

    • equivalent to a path through the dynamic programming matrix composed of alternating diagonal sections and paths connecting them

Gapped blast1

Gapped BLAST

  • Original BLAST implicitly handled gaps by finding several distinct HSPs and calculating a statistical assessment of the combined result

    • Two or more HSPs each below the cutoff value might in combination rise to statistical significance

  • Gapped BLAST, extend hits by allowing gaps when hits are promising (exceed sg):

    • Advantage: We can afford to miss some HSPs as long as at least one is found

  • Use dynamic programming, starting from center of each high-scoring region if s > sg

    • sgis chosen such that gapped alignment is triggered in about 1/50 of the sequences compared

Psi blast


  • Position-Specific Iterated BLAST

  • Generates a multiple alignment from statistically significant alignments produced by BLAST

  • Produces a position-specific score matrix (PSSM)

    • Can search the database using the PSSM

    • Match sequences to profile

    • Generate new profiles

    • Repeat (iteration)

    • Search gradually extends to increasingly divergent sequences

Flavors of blast

Flavors of BLAST

  • BLASTP - protein query against protein DB

  • BLASTN - DNA/RNA query against GenBank (DNA)

  • BLASTX - 6 frame trans. DNA query against proteinDB

  • TBLASTN - protein query against 6 frame GB transl.

  • TBLASTX - 6 frame DNA query to 6 frame GB transl.

  • PSI-BLAST - protein ‘profile’ query against protein DB

  • PHI-BLAST - protein pattern against protein DB

  • Login