Gapped BLAST and PSI-BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller and David J. Lipman
Introduction to BLAST • BLAST is a heuristic approximation to dynamic programming based local alignment. • Finds locally maximal segment pairs with scores over a cutoff. • Has a formal statistical theory to assess the significance of scores.
Basic Algorithm • Looks for words of length w with score greater than T. • These hits are then extended to check for segment pairs with score greater than S (>T.) • Tradeoff: Lowering T reduces probability of missing segment pairs (increases sensitivity) but increases number of hits to be extended.
Scanning for hits • Two Approaches: • Positions of length w words in query with score higher than T stored in a 20w sized array and hits detected by array lookup. • A DFA for the appropriate words is generated and used to scan the sequences. A Mealy machine (acceptance on transitions) is used for efficiency.
Other Issues • Hit extension is simplified by stopping when score falls below a threshold compared to the best score found for shorter extensions. • The various parameters are chosen based on experiments using random sequences. • Combinations of MSP’s can be used to get better scores for matching sequences.
Original BLAST: One-Hit Extend each hit to determine if it is in a high-scoring alignment Extension consumes >90% of processing time hit: short word pair whose aligned score ≥ T Two-Hit Method Extension invoked only if there are two non-overlapping word pairs on the same diagonal Lowering T yields more hits, but only a few are extended 3x faster Two-Hit Method T: threshold parameter; as T ↑, speed ↑, probability of missing weak similarities ↑
Two-Hit Method Algorithm • Scan db for hits (word pair scoring ≥ T) • Seek pairs of non-overlapping hits found with distance A of one another on same diagonal • Invoke (ungapped) extension to determine if hits lie within a statistically significant alignment with query. Extend until alignment score has dropped ≥ X below max score yet attained.
Original BLAST Implicitly treat gapped alignments: Locate several distinct HSPs within same db sequence Calculate statistical significance on combined result Gapped BLAST Trigger gapped extension for any HSP exceeding moderate score Sg Gapped extension longer to execute, few undergo this extension Gapped Alignments HSP: high-scoring segment pair; locally optimal
Advantage of New Heuristic for Generating Gapped Alignments • Two or more HSPs may each have low scores independently, but can have a statistically significance together • Only one of the constituent HSPs need to be found to generate a successful combined result – can increase T
Older Gapped Alignments • Confine the dynamic programming to a banded section of the full path graph • Optimal gapped alignment may be outside this band • As width of band ↑, speed ↓
New Heuristic for Generating Gapped Alignments • Starting from a seed HSP, dynamic programming proceeds both bidirectionally through the path graph • Consider only cells for which optimal local alignment score falls ≤ Xg below best score yet found • Region of path graph explored adapts to alignment being constructed • Seed: central residue pair of segment with highest alignment along HSP
New Gapped BLAST • Ungapped extension of second hit invoked for two non-overlapping hits of score ≥ T within distance A of one another • If HSP generated has normalized score ≥ Sg, gapped extension is triggered • Resulting gapped alignment reported if statistically significant (low enough E-value) • Runs on average 3x faster than original BLAST
PSI-BLAST: Overview • Results of initial BLAST search used to construct position-specific scores. • BLAST is repeated using the new scores till no more sequences are found. • Position-specific scores improve the ability of successive BLAST iterations for detecting remote homologs.
Position-specific score matrix • Dimensions: Lx20 • “Multiple Alignment” created using all segments with e-value above a threshold. • Alignment based on pairwise alignments. • Columns with gaps in query ignored. • For each column C a reduced alignment MC is created.
MC includes all sequences with a residue in C and all columns which have the above sequences. • Sequence weighting method used to generate observed residue frequencies. • Score for residue i in column C given by log(Qi/ Pi) • Qi is the weighted sum of observed frequencies and a pseudocount.