
BLAST • Chris Cobb • Bioinformatics • April 2008
Overview • Sequence comparison • Compare regions of similarity • Global alignment not necessarily 'valuable' • Indicative of evolutionary relationships. • Not always... • Horizontal gene transfer • Convergent evolution • Similar function different sequence • Databases • Growing exponentially
Sequence similarity • Percent identity • Max matches in an alignment / alignment length • Best alignment gives us maximum number of matches • Example • Cytochrome c • Small electron carrier protein in humans • Remember 3D structure! • 104 nucleotides
Cytochrome c • Comparison between mouse and human • (95/104) = 91% match. • Evolutionary distance measured by similarity value.
Sequence Alignment • How do we perform this alignment? • What considerations do we take? • The intuitive solution is to line them up along x-y axis and compare. • In CS this is a dynamic programming problem. • Small example
Dynamic Programming • Set weights on different situations • +1 match • -1 mismatch • -2 insertion/deletion • Note: this gets expensive • Gaps resolved by insertion/deletion • If treated as individual events, carries a lot of weight • Instead use affine penalty function • penalty = G + nL • G : open penalty • L : extension penalty • n : size of gap
Smith-Waterman • Dynamic programming algorithm with affine gap penalty developed by Smith and Waterman in 1981 • Guarantees optimal local alignment • Local alignment avoids noise • Reliable statistical model for optical local alignments. • Expectation that optimal alignment would occur by chance. (Karlin-Altschul). • Low expectation = high chance they are homologous • Open source implementations available
Substitution Matrix • +1 for match not good metric. • Substitution matrix • Probability one amino acid mutates into another amino acid. • Built empirically with large dataset to reflect true probabilities of mutation during evolutionary process • BLOSUM, PAM common examples. • BLOSUM for local alignment • PAM for global
Karlin Altschul scoring • Expected frequencey of High Scoring Pair (HSP) versus random occurance: • E = K M N e ^ (-λS) • S: alignment score • λ: unique positive value used to normalize score • MN: search space (|query| * |target|) • K: constant, aprox .1 • E-(MN): linear change • E-(S): exponential change (small change in score leads to big effect on value). • Assumptions: • i.i.d. (indepentantly identically distributed) • Roughly equal in length
BLAST!!! • Basic Local Alignment Search Tool
BLAST • Pre-indexed database • Position of every 'word' is remembered • High Scoring segment Pair (HSP) • Local alignment with no gaps that scores high. • Query starts with best HSP and expands from there.
BLAST search • Query • Break query into 'words' ('k' character strings) • n – k + 1 words • n = query length • k = word length (default in Blast is 11 and 3) • Scan database for words • Don't just use exact word, use similar words • When two words within a certain (T) distance from each other match the target, this as a segment pair. • Extend the HSP until the score drops by 'X' below its max value • Report statistically significant scores.
BLAST statistics • HSP has a P value based on Poisson distribution • Small P value means significant score • Applies to ungapped segments • For gapped alignment, calculate E-Value • Karlin-Altschul method • E value decreases exponentially as Score increases! • Boils down to: • Results that aren't likely to happen by chance are best. • When we find these, report them as significant • Low E values mean more significant response.
Using BLAST • http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/tut1.html
BLAST customization • Lowering Neighborhood Word Threshold (T) finds more distantly related sequences • Raising the Segment Extension Cutoff (X) extends region that might be considered a HSP • Changing (E) Expectation just changes the score threshold (how good it has to be to show up).
BLAST scores • Raw score • Sum of substitution score + gap score • Substitution score given by sub matrix (BLOSUM, PAM) • Gap score calculated using affine penalty function • G: gap opening penalty • L: gap length penalty • Bit score • Normalized raw score so that scores from different substitution matrices can be compared. • E-value • Probability sequence with same or better score would occur randomly • Based on the Karlin-Altschul method
Sources • http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=bioinfo.chapter.A05 • http://www.ludwig.edu.au/course/course2002/talks/flegg02search/sld018.htm • http://en.wikipedia.org/wiki/Smith-Waterman_algorithm • http://math.la.asu.edu/~cbs/pdfs/projects/Fall_2005/Karlin-AltschulStatistics.pdf • http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html