- 114 Views
- Uploaded on
- Presentation posted in: General

Mathematics and computation behind BLAST and FASTA

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Mathematics and computation behind BLAST and FASTA

Xuhua Xia

http://dambe.bio.uottawa.ca

Sequence variation:

UUCUCAACCAACCAUAAAGAUAU

UUCUCUACAAACCACAAAGACAU

UUCUCAACCAACCAUAAAGAUAU

UUCUCAACCAACCACAAAGACAU

UUCUCCACGAACCACAAAGAUAU

UUCUCUACAAACCACAAAGAUAU

UUCUCAACCAACCACAAAGACAU

UUCUCUACUAACCACAAAGACAU

- Difference in
- coding sequences
- Regulatory sequences
- transcription
- splicing
- translation
- translated sequences

- Difference in
- protein abundance
- protein structure
- cellular localization
- protein interaction partners

Difference in biochemical function

- Difference in
- susceptibility to diseases
- response to medicine
- Fitness (survival and reproductive success)

- Difference in phenotype
- morphological
- physiological
- behavioural

Personalized medicine

Conservation strategies

Evolutionary mechanisms

...

Nurturing environment

Slide 2

- Efficient search against large sequence databases
- Practical significance from early applications
- Sequence similarity between an oncogene (genes in viruses that cause a cancer-like transformation of the infected cells), v-sis, and the platelet-derived growth factor (PDGF)
- M. D. Waterfield et al. 1983. Nature 304:35-39
- R. F. Doolittle et al. 1983. Science 221:275-227

- Contig assembly
- Functional annotation by homology search

- Sequence similarity between an oncogene (genes in viruses that cause a cancer-like transformation of the infected cells), v-sis, and the platelet-derived growth factor (PDGF)
- Fast computational methods in string matching
- FASTA
- BLAST
- Local pair-wise alignment by dynamic programming

Slide 3

- Given PA, PC, PG, PT in a target (database) sequence, the probability of a query sequence, say, ATTGCC, having a perfect match of the target sequence is:prob = PAPTPT PGPCPC = PA (PC)2 PG (PT)2
- Let M be the target sequence length and N be the query sequence length, the “matching operation” can be performed (M – N +1) times, e.g., Query: ATGTarget CGATTGCCCG
- The probability distribution of the number of matches follows (approximately) a binomial distribution with p = prob and n = (M – N +1)

Slide 4

- Probability of having a sequence match: p
- Probability of having no match: q = 1-p
- Binomial distribution:
- When np > 50, the binomial distribution can be approximated by the normal distribution with the mean = np and variance = npq
- When np < 1 and n is very large, binomial distribution can be approximated by the Poisson distribution with mean and variance equal to np (i.e., = 2 = np).

Slide 5

Slide 6

- Assuming equal nucleotide frequencies, the probability of a nucleotide site in the query sequence matching a site in the target sequence is p = 0.25.
- The probability of finding an exact match of L letters is a = pL = 0.25L = 2-2L = 2-S, where S is called the bit score in BLAST.
- M: query length; N: target length, e.g., M = 8, N = 5, L = 3AACGGTTCCGGTT
- A sequence of length L can move at (M – L +1) distinct sites along the query and (N – L +1) distinct sites along the target.
- m = (M-L+1) and n = (N-L+1) are called effective lengths of the two sequences.
- The expected number of matches with length L is mn2-S, which is called E-value in ungapped BLAST.
- S is calculated differently in the gapped BLAST

Slide 7

BLASTN 2.2.4 [Aug-26-2002]

...

Query= Seq1 38

Database: MgCDS

480 sequences; 526,317 total letters

Score E

Sequences producing significant alignments: (bits) Value

MG001 1095 bases 34 7e-004

Score = 34.2 bits (17), Expect = 7e-004

Identities = 35/40 (87%), Gaps = 2/40 (5%)

Query: 1 atgaataacg--attatttccaacgacaaaacaaaaccac 38

|||||||||| ||||||||||| |||||| ||||||||

Sbjct: 1 atgaataacgttattatttccaataacaaaataaaaccac 40

Lambda K H

1.37 0.711 1.31

Matrix: blastn matrix:1 -3

Gap Penalties: Existence: 5, Extension: 2

…

effective length of query: 26

effective length of database: 520,557

Constant gap penalty vs affine function penalty

Typically one would count only 1 GE here.

Matches: 35*1 = 35

Mismatches: 3*(-3) = -9

Gap Open: 1*5 = 5

Gap extension: 2*2 =4

R = 35 - 9 - 5 - 4 = 17

S = [λR – ln(K)]/ln(2) =[1.37*17-ln(0.711)]/ln(2) = 34

E = mn2-S = 26 * 520557 * 2-34 = 7.878E-04

xp(x)

00.999265217

10.000734513

20.000000270

30.000000000

Slide 8

BLAST output includes lambda () and K. Mathematically, is defined as follows:

where pi, pj are nucleotide frequencies (i,j = A, C, G, or T), and sij is the match (when i = j) or mismatch (when i j) score. In nucleotide BLAST by default, we have sii = 1 and sij = -3. In the simplest case with equal nucleotide frequencies, i.e., when pi = 0.25, the equation above is reduced to

(for amino acid sequences)

See the updated Chapter 1 and BLASTParameter.xlsm on how to compute K.

- The e-value is the expected number of random matches that is equally good or better than the reported match. It can be a number near zero or much larger than 1.
- It is NOT the probability of finding the reported match.
- Only when the e-value is extremely small can it be interpreted as the probability of finding 1 match that is as good as the reported one (see next slide).

Slide 10

Slide 11

- Adapted from Crane & Raymer 2003
- Input sequence: AILVPTVIGCTVPT
- Algorithm:
- Break the query sequence into words:AILV, ILVP, LVPT, VPTV, PTVI, TVIG, VIGC, IGCT, GCTV, CTVP, TVPT
- Discard common words (i.e., words made entirely of common amino acids)
- Search for matches against database sequences, assess significance and decide whether to discard to continue with extension using dynamic programming:AILVPTVIGCTVPTMVQGWALYDFLKCRAILVPTVIACTCVAMLALYDFLKC

Slide 12

Slide 13

- Another commonly used family of alignment and search tools
- Generally considered to be more sensitive than BLAST.
- Illustration with two fictitious sequences used in the Contig Assembly lecture:Seq1: ACCGCGATGACGAATASeq2: GAATACGACTGACGATGGASeq1: ACCGCGATGACGAATASeq2: GAATACGACTGACGATGGA

Slide 14

Left and Right: -n means moving the query left by n sites and n means moving the query right by n sites.

Slide 15

Query: ACCGCGATGACGAATA

Target:GAATACGACTGACGATGGA

From lecture on contig assembly:

Query: ACCGCGATGACGAATA

Target: GAATACGACTGACGATGGA

From FASTA algorithm:

Query: ACCGCGATGACGAATA

Target: GAATACGACTGACGATGGA

Query: ACCGCGATGACGAATA

Target: GAATACGACTGACGATGGA

Which one is best based on YOUR judgment?

One of the three 3rd best

Best

2nd best

Slide 16

Query: ACCGCGATGACGAATA

Target: GAATACGACTGACGATGGA

Query: ACCGCGATGACGAATA

Target: GAATACGACTGACGATGGA

One of the three 2nd best

Best

Slide 17

- BLAST starts with exact string matching, while FASTA starts with inexact string matching (or exact string matching with a shorter words). BLAST is faster than FASTA.
- For the examples given, both BLAST and FASTA will find the same best match, i.e., shifting the query sequence by 2 sites to the right.
- Both perform dynamic programming for extending the match after the initial match.

Slide 18

- Lambda and Karlin-Altschul (K) parameters are important because they directly affect the computation of E value.
- Both and K depend on
- nucleotide (or aminon acid) frequencies
- match-mismatch matrix

- All BLAST implementations generally assume that nucleotide (or amino acid) sequences have roughly equal frequencies.
- For nucleotide (or amino acid) sequences with strongly biased frequencies, BLAST E value obtained with the assumption can be quite misleading, i.e., one should use appropriate and K.

Accumulation of nucleotide and amino acid sequences:

UUCUCAACCAACCAUAAAGAUAU

UUCUCUACAAACCACAAAGACAU

UUCUCAACCAACCAUAAAGAUAU

UUCUCAACCAACCACAAAGACAU

UUCUCCACGAACCACAAAGAUAU

UUCUCUACAAACCACAAAGAUAU

UUCUCAACCAACCACAAAGACAU

UUCUCUACUAACCACAAAGACAU

Functional genomics

Systems biology

Digital cells

- Storage and annotation of the sequences
- Structural annotation with homology search and de novo gene prediction
- Functional annotation with gene ontologies

Species-specific gene dictionaries, e.g., yeastgenome.org

- Gene/Protein families (e.g., Pfam)
- Cluster of orthologous genes (e.g., COG)
- Supermatrix of gene presence/absence
- Genome-based pair-wise distance distributions

- Comparative genomics (the origin of new genes, new features and new species)
- Phylogenetics (cladogenic process, dating of speciation and gene duplication events)
- Phylogeny-based inference.

Mutation

Selection

Adaptation

Slide 25