1 / 13

Database Searches

Database Searches. BLAST. BLAST. Basic Local Alignment Search Tool Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) Altschul, Madden, Schaffer, Zhang, Zhang, Miller, Lipman, Nucleic Acids Res . 25 (1997) Main ideas:

beata
Download Presentation

Database Searches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Database Searches BLAST

  2. BLAST • Basic Local Alignment Search Tool • Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) • Altschul, Madden, Schaffer, Zhang, Zhang, Miller, Lipman, Nucleic Acids Res. 25 (1997) • Main ideas: • Increase search speed by finding fewer, but better, hot spots during initial screening phase • Uses longer word sizes • Integrate scoring matrix into first phase • Compare with FASTA, which requires exact matches

  3. BLAST Terminology • Segment pair: equal-length substrings of sequences S1 and S2 • Locally maximal segment pair: segment pair whose alignment score cannot be improved by extending or shortening it • Maximum segment pair(MSP) = segment pair with maximum score over all segment pairs in the sequences S1 and S2 • High-scoring segment pair (HSP): A segment pair with score higher than some cutoff score, s. • wis the length parameter; t is the threshold parameter

  4. BLAST: Hits • A hit is a w-length word in the database that aligns with a word from the query sequence with score > t • BLAST looks for hits instead of exact matches • Allows word size to be kept high for speed, without sacrificing sensitivity • Typically, w= 3-5 for amino acids, ~11-12 for DNA • t is the most critical parameter: • ↑t ↓ “background” hits (faster) • ↓t ↑ ability to detect more distant relationships (at cost of increased noise

  5. Hits • For each word, evaluate score of match (exact or not) according to BLOSUM62 • E.g., for PQG, score is 7+5+6 = 18 • There are 20wpossible w-length words, but considering only those with score >t, greatly reduces number of matches • E.g., there are 203 = 8000 possible matches to PQG, but only 50 achieve score > t = 13

  6. BLAST

  7. Extending a hit • After locating a hit, BLAST attempts to extend hit in both directions, until score has drops more than Xbelow the maximum score yet attained. • Extension step typically accounts for > 90% of execution time.

  8. Extending a hit

  9. Improvement: 2-hit method • Do extensions only when there are two hits on the same diagonal within some distance Aof each other (e.g., A =40) • Reduces sensitivity (ability to detect distantly related sequences) • To compensate, use lower tvalue (e.g., 11 rather than 13) • Since we only extend when there are two nearby hits, many fewer regions are extended

  10. Gapped BLAST • Allows local alignments with indels (similar to FASTA) • Local alignments from different diagonal are merged into a different local alignment followed by some indels followed by a second local alignment, etc. • equivalent to a path through the dynamic programming matrix composed of alternating diagonal sections and paths connecting them

  11. Gapped BLAST • Original BLAST implicitly handled gaps by finding several distinct HSPs and calculating a statistical assessment of the combined result • Two or more HSPs each below the cutoff value might in combination rise to statistical significance • Gapped BLAST, extend hits by allowing gaps when hits are promising (exceed sg): • Advantage: We can afford to miss some HSPs as long as at least one is found • Use dynamic programming, starting from center of each high-scoring region if s > sg • sgis chosen such that gapped alignment is triggered in about 1/50 of the sequences compared

  12. PSI-BLAST • Position-Specific Iterated BLAST • Generates a multiple alignment from statistically significant alignments produced by BLAST • Produces a position-specific score matrix (PSSM) • Can search the database using the PSSM • Match sequences to profile • Generate new profiles • Repeat (iteration) • Search gradually extends to increasingly divergent sequences

  13. Flavors of BLAST • BLASTP - protein query against protein DB • BLASTN - DNA/RNA query against GenBank (DNA) • BLASTX - 6 frame trans. DNA query against proteinDB • TBLASTN - protein query against 6 frame GB transl. • TBLASTX - 6 frame DNA query to 6 frame GB transl. • PSI-BLAST - protein ‘profile’ query against protein DB • PHI-BLAST - protein pattern against protein DB

More Related