Mathematics and computation behind blast and fasta
Sponsored Links
This presentation is the property of its rightful owner.
1 / 25

Mathematics and computation behind BLAST and FASTA PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Mathematics and computation behind BLAST and FASTA. Xuhua Xia Bioinformatics-enabled research. Sequence variation: UU C U C AA CC AA CC A U AAA G A U A U UU C U C U A C AAA CC A C AAA G A C A U UU C U C AA CC AA CC A U AAA G A U A U

Download Presentation

Mathematics and computation behind BLAST and FASTA

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Mathematics and computation behind BLAST and FASTA

Xuhua Xia

Bioinformatics-enabled research

Sequence variation:









  • Difference in

  • coding sequences

  • Regulatory sequences

  • transcription

  • splicing

  • translation

  • translated sequences

  • Difference in

  • protein abundance

  • protein structure

  • cellular localization

  • protein interaction partners

Difference in biochemical function

  • Difference in

  • susceptibility to diseases

  • response to medicine

  • Fitness (survival and reproductive success)

  • Difference in phenotype

  • morphological

  • physiological

  • behavioural

Personalized medicine

Conservation strategies

Evolutionary mechanisms


Nurturing environment

Slide 2

Why string matching?

  • Efficient search against large sequence databases

  • Practical significance from early applications

    • Sequence similarity between an oncogene (genes in viruses that cause a cancer-like transformation of the infected cells), v-sis, and the platelet-derived growth factor (PDGF)

      • M. D. Waterfield et al. 1983. Nature 304:35-39

      • R. F. Doolittle et al. 1983. Science 221:275-227

    • Contig assembly

    • Functional annotation by homology search

  • Fast computational methods in string matching

    • FASTA

    • BLAST

    • Local pair-wise alignment by dynamic programming

Slide 3

Basic stats in string matching

  • Given PA, PC, PG, PT in a target (database) sequence, the probability of a query sequence, say, ATTGCC, having a perfect match of the target sequence is:prob = PAPTPT PGPCPC = PA (PC)2 PG (PT)2

  • Let M be the target sequence length and N be the query sequence length, the “matching operation” can be performed (M – N +1) times, e.g., Query: ATGTarget CGATTGCCCG

  • The probability distribution of the number of matches follows (approximately) a binomial distribution with p = prob and n = (M – N +1)

Slide 4

Basic stats in string matching

  • Probability of having a sequence match: p

  • Probability of having no match: q = 1-p

  • Binomial distribution:

  • When np > 50, the binomial distribution can be approximated by the normal distribution with the mean = np and variance = npq

  • When np < 1 and n is very large, binomial distribution can be approximated by the Poisson distribution with mean and variance equal to np (i.e.,  = 2 = np).

Slide 5

From Binomial to Poisson

Slide 6

Matching two sequences without gap

  • Assuming equal nucleotide frequencies, the probability of a nucleotide site in the query sequence matching a site in the target sequence is p = 0.25.

  • The probability of finding an exact match of L letters is a = pL = 0.25L = 2-2L = 2-S, where S is called the bit score in BLAST.

  • M: query length; N: target length, e.g., M = 8, N = 5, L = 3AACGGTTCCGGTT

  • A sequence of length L can move at (M – L +1) distinct sites along the query and (N – L +1) distinct sites along the target.

  • m = (M-L+1) and n = (N-L+1) are called effective lengths of the two sequences.

  • The expected number of matches with length L is mn2-S, which is called E-value in ungapped BLAST.

  • S is calculated differently in the gapped BLAST

Slide 7

Blast Output (Nuc. Seq.)

BLASTN 2.2.4 [Aug-26-2002]


Query= Seq1 38

Database: MgCDS

480 sequences; 526,317 total letters

Score E

Sequences producing significant alignments: (bits) Value

MG001 1095 bases 34 7e-004

Score = 34.2 bits (17), Expect = 7e-004

Identities = 35/40 (87%), Gaps = 2/40 (5%)

Query: 1 atgaataacg--attatttccaacgacaaaacaaaaccac 38

|||||||||| ||||||||||| |||||| ||||||||

Sbjct: 1 atgaataacgttattatttccaataacaaaataaaaccac 40

Lambda K H

1.37 0.711 1.31

Matrix: blastn matrix:1 -3

Gap Penalties: Existence: 5, Extension: 2

effective length of query: 26

effective length of database: 520,557

Constant gap penalty vs affine function penalty

Typically one would count only 1 GE here.

Matches: 35*1 = 35

Mismatches: 3*(-3) = -9

Gap Open: 1*5 = 5

Gap extension: 2*2 =4

R = 35 - 9 - 5 - 4 = 17

S = [λR – ln(K)]/ln(2) =[1.37*17-ln(0.711)]/ln(2) = 34

E = mn2-S = 26 * 520557 * 2-34 = 7.878E-04






Slide 8

Lambda () and K

BLAST output includes lambda () and K. Mathematically,  is defined as follows:

where pi, pj are nucleotide frequencies (i,j = A, C, G, or T), and sij is the match (when i = j) or mismatch (when i  j) score. In nucleotide BLAST by default, we have sii = 1 and sij = -3. In the simplest case with equal nucleotide frequencies, i.e., when pi = 0.25, the equation above is reduced to

(for amino acid sequences)

See the updated Chapter 1 and BLASTParameter.xlsm on how to compute K.

E-Value in BLAST

  • The e-value is the expected number of random matches that is equally good or better than the reported match. It can be a number near zero or much larger than 1.

  • It is NOT the probability of finding the reported match.

  • Only when the e-value is extremely small can it be interpreted as the probability of finding 1 match that is as good as the reported one (see next slide).

Slide 10

E-value and P(1)

Slide 11

Gapped BLAST

  • Adapted from Crane & Raymer 2003

  • Input sequence: AILVPTVIGCTVPT

  • Algorithm:

    • Break the query sequence into words:AILV, ILVP, LVPT, VPTV, PTVI, TVIG, VIGC, IGCT, GCTV, CTVP, TVPT

    • Discard common words (i.e., words made entirely of common amino acids)

    • Search for matches against database sequences, assess significance and decide whether to discard to continue with extension using dynamic programming:AILVPTVIGCTVPTMVQGWALYDFLKCRAILVPTVIACTCVAMLALYDFLKC

Slide 12

BLAST Programs

Slide 13


  • Another commonly used family of alignment and search tools

  • Generally considered to be more sensitive than BLAST.

  • Illustration with two fictitious sequences used in the Contig Assembly lecture:Seq1: ACCGCGATGACGAATASeq2: GAATACGACTGACGATGGASeq1: ACCGCGATGACGAATASeq2: GAATACGACTGACGATGGA

Slide 14

String Match in FASTA

Left and Right: -n means moving the query left by n sites and n means moving the query right by n sites.

Slide 15

Alternative Matched Strings



From lecture on contig assembly:



From FASTA algorithm:





Which one is best based on YOUR judgment?

One of the three 3rd best


2nd best

Slide 16

Word length of 2





One of the three 2nd best


Slide 17

Comparison: BLAST and FASTA

  • BLAST starts with exact string matching, while FASTA starts with inexact string matching (or exact string matching with a shorter words). BLAST is faster than FASTA.

  • For the examples given, both BLAST and FASTA will find the same best match, i.e., shifting the query sequence by 2 sites to the right.

  • Both perform dynamic programming for extending the match after the initial match.

Slide 18

Optional: BLAST Parameters

  • Lambda  and Karlin-Altschul (K) parameters are important because they directly affect the computation of E value.

  • Both  and K depend on

    • nucleotide (or aminon acid) frequencies

    • match-mismatch matrix

  • All BLAST implementations generally assume that nucleotide (or amino acid) sequences have roughly equal frequencies.

  • For nucleotide (or amino acid) sequences with strongly biased frequencies, BLAST E value obtained with the assumption can be quite misleading, i.e., one should use appropriate  and K.

Case 1: equal , (-3,1)

Case 2: Different , (-3, 1)

Case 3: Different , s/v

K: case 1

K: Case 2

Bioinformatics research workflow

Accumulation of nucleotide and amino acid sequences:









Functional genomics

Systems biology

Digital cells

  • Storage and annotation of the sequences

  • Structural annotation with homology search and de novo gene prediction

  • Functional annotation with gene ontologies

Species-specific gene dictionaries, e.g.,

  • Gene/Protein families (e.g., Pfam)

  • Cluster of orthologous genes (e.g., COG)

  • Supermatrix of gene presence/absence

  • Genome-based pair-wise distance distributions

  • Comparative genomics (the origin of new genes, new features and new species)

  • Phylogenetics (cladogenic process, dating of speciation and gene duplication events)

  • Phylogeny-based inference.




Slide 25

  • Login