BLAST

BLAST

Database Searching for Similar Sequences • Searching a sequence database for sequences that are similar to a query sequence provides a list of database sequences with which the query sequence can be aligned. • Key issue: efficiency

CATCACCT GAT_ACCC empty C A T C A C C T 0 0 0 0 0 0 0 0 0 empty 0 0 0 0 0 0 0 0 0 0 G 1 0 0 1 0 0 0 0 0 A 0 2 0 0 0 0 0 0 1 T 1 0 1 1 1 0 0 0 0 A 0 0 1 0 2 1 0 0 1 C 1 0 1 3 1 0 1 0 0 C 1 0 1 2 2 0 0 0 C Local Alignment CATCACCT GATACCC • Let gap = -2 match = 1 mismatch = -1. O(n2) Computational Cost : ?

Database Searching for Similar Sequences Methods • Dynamic programming requires order N2L computations if database has L sequences • Popular database searching methods • FASTA [Pearson & Lipman, 1988] • BLAST [Altschul et al., 1990] • Tradeoffs of using the heuristic fast method • Accuracy (Sensitivity and Selectivity)

FASTA • Problem with dynamic programming: Too many calculations “wasted” by comparing regions that have nothing in common • Initial insight: Regions that are similarbetween two sequences are likely to share short stretches that are identical • Basic method: Look for similar regions only near short stretches that match exactly --- “Hit and extend” sequence searching

1 2 3 4 5 6 7 8 9 10 11 A s = H A R F Y A A Q I V L F H 1 2 3 4 5 6 7 8 I t = V D M A A Q I A L Look-up table Q R V Y … Offset vector -7 -6 -5 -4 -3 -2 -1 0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +10 Diagonal Method Example 2 ,6 ,7 4 1 9 query 11 8 +9 -2 +2 +3 -3 +1 +2 +2 +2 -6 -2 -1 3 offset 10 5 1 1 2 1 1 1 4 3 2 1 1 1

Limitations of FASTA • FASTA can miss significant similarity since: • For proteins, similar sequences do not have to share identical residues • Gly-Asp-Gly-Lys-Gly is quite similar to Gly-Glu-Gly-Arg-Gly but there is no match with ktuple of size 2 • Asp-Lys-Val is quite similar to Glu-Arg-Ile yet it is missed even with ktuple size of 1 • For nucleic acids, due to codon “wobble”, DNA sequences may look like XXy where X’s are conserved and y’s are not • GGuUCuACgAAg and GGcUCcACaAAA both code for the same peptide sequence (Gly-Ser-Thr-Lys) but they don’t match with ktuple size of 3 or higher

BLAST (Basic Local Alignment Search Tool) ? • BLAST searches for words which score above Trather than that match exactly.It is also faster because its implementation has been optimized to work with parallel UNIX architecture from an early stage. • Reference • S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J. Lipman. Basic Local Alignment Search Tool. J. Mol. Biol. 215:403-410 (1990)

BLAST basics • BLAST is mainly a 3-step algorithm: • Compile list of high-scoring strings (words) • Search for hits – each hit gives a seed • Extend seeds to obtain segment pairs (a segment pair is basically a gapless local alignment)

BLAST • For protein sequences, the list of high-scoring words consists of all words with w characters that scores at least T with some word in the query sequence (w = 3 or 4 for protein search, 11 or 12 for nucleotide sequences). • Search for “hits” using a hash table or a finite state machine. • Key concept: Searching for words which score above Trather than those match exactly

BLAST method for proteins 1. Compile a list of words which give a score above T when paired with the query sequence. • Example using PAM-120 for query sequence ACDE (w=4, T=18): A C D E ACDE = +3 +9 +5 +5 = 22 • try all possibilities: AAAA = +3 -3 0 0 = 0 no good AAAC = +3 -3 0 -7 = -7 no good • ...too slow, try directed change

Generating word list A C D E ACDE = +3 +9 +5 +5 = 22 • change 1st pos. to all acceptable substitutions gCDE = 1 9 5 5 = 20 ok (=pCDE,sCDE,tCDE) nCDE = 0 9 5 5 = 19 ok (=dCDE,eCDE, nCDE,vCDE) iCDE = -1 9 5 5 = 18 ok (=qCDE) kCDE = -2 9 5 5 = 17 not good (=mCDE) • change 2nd pos.: can't - all alternatives negative and the other three positions only add up to 13 • change 3rd pos. in combination with first position gCnE = 1 9 2 5 = 17 not good • continue - use recursion

BLAST method for proteins 2. Scan the database for hits with the compiled list of words. • Use finite state machine (actually used) • Calculate a state transition table that tells what state to go to based on the next character in the sequence 3. Extend hits in both directions to form segment pairs (without allowing gaps)

ACDEMKGLACDEQAYSMLTNSEFTP Query: ACDE ACDE, GCDE, NCDE, … AAAA Database Sequence: LANACDEGKGL… GTKLVGCDELV… GNCDEEEETDPG… AAAADRGG… … FSM BLAST • Compile a list of words which give a score above T • Scan database for hits with the compiled list of words. • Extend hits in both directions to form segment pairs

a a a a b a b a c a a 4 7 2 3 0 5 1 6 b b FSM for BLAST • Example of a finite state machine for string matching: (input alphabet: a,b,c) Word: ababaca Database sequence: bcabccaaababacababacabb

BLAST Method for DNA 1. Make list of all words of length w in the query sequence (often w=11 or 12) 2. Compress database by packing 4 nucleotides into a single byte (use auxiliary table to tell you where sequences start and stop within the compressed database) -- doesn't allow for unspecified bases (wildcards)

BLAST Method for DNA 3. Compress the words from the query sequence the same way. 4. Search the compressed database for matches with the compressed words Since all frames of the query sequence are considered separately, any match of length w>=11 must contain a match of length 8 that lies on a byte boundary of one of the words from the query sequence. Thus can scan a (packed) byte at a time, improving speed 4-fold over comparing one nucleotide at a time.

Low-Complexity Regions • Low-complexity regions are segments that contains certain bases or amino acid more often than one would expect in “normal” nucleotide or protein sequences. • Problem: if query sequence has a stretch of unusual base composition (e.g., A-T rich) or a repeated sequence element (e.g., Alu sequence) there will be many hits with "uninteresting" regions.

Low-Complexity Regions • Solution : • Make a list of the words occurring very frequently (more frequently than expected by chance). • Remove these words from the query list of words before searching database. (The words are replaced by strings of Xs.)

BLAST Statistical significance • A key to the utility of BLAST is the ability to calculate expected probabilities of occurrence of maximum segment pairs (MSPs) given w and T • This allows BLAST to rank matching sequences in order of “significance” and to cut off listings at a user-specified probability

Choosing Values for w and T • Trade-off: sensitivity vs. running-time • Choosing a value for w • Small w: many matches to expand • Big w: many words to be generated • w=3/4 is a good compromise • Choosing a value for T • Small T: greater sensitivity, more matches to expand

Basic BLAST Family • BLASTN • DNA to DNA database • BLASTP • protein to protein database • TBLASTN • DNA (translated) to protein database • BLASTX • protein to DNA database (translated) • TBLASTX • DNA (translated) to DNA database (translated)

BLAST demo • Example: Ebola sequence http://www.ncbi.nlm.nih.gov/nuccore/KM233090.1

BLAST Refinements • “two-hit” method for extending word pairs • gapped alignments • Iterate with position-specific matrix (PSI-BLAST) • Pattern-hit initiated BLAST (PHI-BLAST)

Multiple sequence alignment

BLAST

BLAST

Presentation Transcript

BLAST

BLAST

BLAST

BLAST

BLAST

BLAST

BLAST

BLAST

BLAST:

BLAST

BLAST

BLAST

BLAST

BLAST

BLAST

Blast

BLAST

BLAST

BLAST

BLAST

BLAST