Mathematics and computation behind BLAST and FASTA. Xuhua Xia [email protected] http://dambe.bio.uottawa.ca. Bioinformatics-enabled research. Sequence variation: UU C U C AA CC AA CC A U AAA G A U A U UU C U C U A C AAA CC A C AAA G A C A U UU C U C AA CC AA CC A U AAA G A U A U
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Mathematics and computation behind BLAST and FASTA
Difference in biochemical function
BLASTN 2.2.4 [Aug-26-2002]
Query= Seq1 38
480 sequences; 526,317 total letters
Sequences producing significant alignments: (bits) Value
MG001 1095 bases 34 7e-004
Score = 34.2 bits (17), Expect = 7e-004
Identities = 35/40 (87%), Gaps = 2/40 (5%)
Query: 1 atgaataacg--attatttccaacgacaaaacaaaaccac 38
|||||||||| ||||||||||| |||||| ||||||||
Sbjct: 1 atgaataacgttattatttccaataacaaaataaaaccac 40
Lambda K H
1.37 0.711 1.31
Matrix: blastn matrix:1 -3
Gap Penalties: Existence: 5, Extension: 2
effective length of query: 26
effective length of database: 520,557
Constant gap penalty vs affine function penalty
Typically one would count only 1 GE here.
Matches: 35*1 = 35
Mismatches: 3*(-3) = -9
Gap Open: 1*5 = 5
Gap extension: 2*2 =4
R = 35 - 9 - 5 - 4 = 17
S = [λR – ln(K)]/ln(2) =[1.37*17-ln(0.711)]/ln(2) = 34
E = mn2-S = 26 * 520557 * 2-34 = 7.878E-04
BLAST output includes lambda () and K. Mathematically, is defined as follows:
where pi, pj are nucleotide frequencies (i,j = A, C, G, or T), and sij is the match (when i = j) or mismatch (when i j) score. In nucleotide BLAST by default, we have sii = 1 and sij = -3. In the simplest case with equal nucleotide frequencies, i.e., when pi = 0.25, the equation above is reduced to
(for amino acid sequences)
See the updated Chapter 1 and BLASTParameter.xlsm on how to compute K.
Left and Right: -n means moving the query left by n sites and n means moving the query right by n sites.
From lecture on contig assembly:
From FASTA algorithm:
Which one is best based on YOUR judgment?
One of the three 3rd best
One of the three 2nd best
Accumulation of nucleotide and amino acid sequences:
Species-specific gene dictionaries, e.g., yeastgenome.org