Mathematics and computation behind blast and fasta
This presentation is the property of its rightful owner.
Sponsored Links
1 / 25

Mathematics and computation behind BLAST and FASTA PowerPoint PPT Presentation


  • 107 Views
  • Uploaded on
  • Presentation posted in: General

Mathematics and computation behind BLAST and FASTA. Xuhua Xia [email protected] http://dambe.bio.uottawa.ca. Bioinformatics-enabled research. Sequence variation: UU C U C AA CC AA CC A U AAA G A U A U UU C U C U A C AAA CC A C AAA G A C A U UU C U C AA CC AA CC A U AAA G A U A U

Download Presentation

Mathematics and computation behind BLAST and FASTA

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Mathematics and computation behind blast and fasta

Mathematics and computation behind BLAST and FASTA

Xuhua Xia

[email protected]

http://dambe.bio.uottawa.ca


Bioinformatics enabled research

Bioinformatics-enabled research

Sequence variation:

UUCUCAACCAACCAUAAAGAUAU

UUCUCUACAAACCACAAAGACAU

UUCUCAACCAACCAUAAAGAUAU

UUCUCAACCAACCACAAAGACAU

UUCUCCACGAACCACAAAGAUAU

UUCUCUACAAACCACAAAGAUAU

UUCUCAACCAACCACAAAGACAU

UUCUCUACUAACCACAAAGACAU

  • Difference in

  • coding sequences

  • Regulatory sequences

  • transcription

  • splicing

  • translation

  • translated sequences

  • Difference in

  • protein abundance

  • protein structure

  • cellular localization

  • protein interaction partners

Difference in biochemical function

  • Difference in

  • susceptibility to diseases

  • response to medicine

  • Fitness (survival and reproductive success)

  • Difference in phenotype

  • morphological

  • physiological

  • behavioural

Personalized medicine

Conservation strategies

Evolutionary mechanisms

...

Nurturing environment

Slide 2


Why string matching

Why string matching?

  • Efficient search against large sequence databases

  • Practical significance from early applications

    • Sequence similarity between an oncogene (genes in viruses that cause a cancer-like transformation of the infected cells), v-sis, and the platelet-derived growth factor (PDGF)

      • M. D. Waterfield et al. 1983. Nature 304:35-39

      • R. F. Doolittle et al. 1983. Science 221:275-227

    • Contig assembly

    • Functional annotation by homology search

  • Fast computational methods in string matching

    • FASTA

    • BLAST

    • Local pair-wise alignment by dynamic programming

Slide 3


Basic stats in string matching

Basic stats in string matching

  • Given PA, PC, PG, PT in a target (database) sequence, the probability of a query sequence, say, ATTGCC, having a perfect match of the target sequence is:prob = PAPTPT PGPCPC = PA (PC)2 PG (PT)2

  • Let M be the target sequence length and N be the query sequence length, the “matching operation” can be performed (M – N +1) times, e.g., Query: ATGTarget CGATTGCCCG

  • The probability distribution of the number of matches follows (approximately) a binomial distribution with p = prob and n = (M – N +1)

Slide 4


Basic stats in string matching1

Basic stats in string matching

  • Probability of having a sequence match: p

  • Probability of having no match: q = 1-p

  • Binomial distribution:

  • When np > 50, the binomial distribution can be approximated by the normal distribution with the mean = np and variance = npq

  • When np < 1 and n is very large, binomial distribution can be approximated by the Poisson distribution with mean and variance equal to np (i.e.,  = 2 = np).

Slide 5


From binomial to poisson

From Binomial to Poisson

Slide 6


Matching two sequences without gap

Matching two sequences without gap

  • Assuming equal nucleotide frequencies, the probability of a nucleotide site in the query sequence matching a site in the target sequence is p = 0.25.

  • The probability of finding an exact match of L letters is a = pL = 0.25L = 2-2L = 2-S, where S is called the bit score in BLAST.

  • M: query length; N: target length, e.g., M = 8, N = 5, L = 3AACGGTTCCGGTT

  • A sequence of length L can move at (M – L +1) distinct sites along the query and (N – L +1) distinct sites along the target.

  • m = (M-L+1) and n = (N-L+1) are called effective lengths of the two sequences.

  • The expected number of matches with length L is mn2-S, which is called E-value in ungapped BLAST.

  • S is calculated differently in the gapped BLAST

Slide 7


Blast output nuc seq

Blast Output (Nuc. Seq.)

BLASTN 2.2.4 [Aug-26-2002]

...

Query= Seq1 38

Database: MgCDS

480 sequences; 526,317 total letters

Score E

Sequences producing significant alignments: (bits) Value

MG001 1095 bases 34 7e-004

Score = 34.2 bits (17), Expect = 7e-004

Identities = 35/40 (87%), Gaps = 2/40 (5%)

Query: 1 atgaataacg--attatttccaacgacaaaacaaaaccac 38

|||||||||| ||||||||||| |||||| ||||||||

Sbjct: 1 atgaataacgttattatttccaataacaaaataaaaccac 40

Lambda K H

1.37 0.711 1.31

Matrix: blastn matrix:1 -3

Gap Penalties: Existence: 5, Extension: 2

effective length of query: 26

effective length of database: 520,557

Constant gap penalty vs affine function penalty

Typically one would count only 1 GE here.

Matches: 35*1 = 35

Mismatches: 3*(-3) = -9

Gap Open: 1*5 = 5

Gap extension: 2*2 =4

R = 35 - 9 - 5 - 4 = 17

S = [λR – ln(K)]/ln(2) =[1.37*17-ln(0.711)]/ln(2) = 34

E = mn2-S = 26 * 520557 * 2-34 = 7.878E-04

xp(x)

00.999265217

10.000734513

20.000000270

30.000000000

Slide 8


Lambda and k

Lambda () and K

BLAST output includes lambda () and K. Mathematically,  is defined as follows:

where pi, pj are nucleotide frequencies (i,j = A, C, G, or T), and sij is the match (when i = j) or mismatch (when i  j) score. In nucleotide BLAST by default, we have sii = 1 and sij = -3. In the simplest case with equal nucleotide frequencies, i.e., when pi = 0.25, the equation above is reduced to

(for amino acid sequences)

See the updated Chapter 1 and BLASTParameter.xlsm on how to compute K.


E value in blast

E-Value in BLAST

  • The e-value is the expected number of random matches that is equally good or better than the reported match. It can be a number near zero or much larger than 1.

  • It is NOT the probability of finding the reported match.

  • Only when the e-value is extremely small can it be interpreted as the probability of finding 1 match that is as good as the reported one (see next slide).

Slide 10


E value and p 1

E-value and P(1)

Slide 11


Gapped blast

Gapped BLAST

  • Adapted from Crane & Raymer 2003

  • Input sequence: AILVPTVIGCTVPT

  • Algorithm:

    • Break the query sequence into words:AILV, ILVP, LVPT, VPTV, PTVI, TVIG, VIGC, IGCT, GCTV, CTVP, TVPT

    • Discard common words (i.e., words made entirely of common amino acids)

    • Search for matches against database sequences, assess significance and decide whether to discard to continue with extension using dynamic programming:AILVPTVIGCTVPTMVQGWALYDFLKCRAILVPTVIACTCVAMLALYDFLKC

Slide 12


Blast programs

BLAST Programs

Slide 13


Fasta

FASTA

  • Another commonly used family of alignment and search tools

  • Generally considered to be more sensitive than BLAST.

  • Illustration with two fictitious sequences used in the Contig Assembly lecture:Seq1: ACCGCGATGACGAATASeq2: GAATACGACTGACGATGGASeq1: ACCGCGATGACGAATASeq2: GAATACGACTGACGATGGA

Slide 14


String match in fasta

String Match in FASTA

Left and Right: -n means moving the query left by n sites and n means moving the query right by n sites.

Slide 15


Alternative matched strings

Alternative Matched Strings

Query: ACCGCGATGACGAATA

Target:GAATACGACTGACGATGGA

From lecture on contig assembly:

Query: ACCGCGATGACGAATA

Target: GAATACGACTGACGATGGA

From FASTA algorithm:

Query: ACCGCGATGACGAATA

Target: GAATACGACTGACGATGGA

Query: ACCGCGATGACGAATA

Target: GAATACGACTGACGATGGA

Which one is best based on YOUR judgment?

One of the three 3rd best

Best

2nd best

Slide 16


Word length of 2

Word length of 2

Query: ACCGCGATGACGAATA

Target: GAATACGACTGACGATGGA

Query: ACCGCGATGACGAATA

Target: GAATACGACTGACGATGGA

One of the three 2nd best

Best

Slide 17


Comparison blast and fasta

Comparison: BLAST and FASTA

  • BLAST starts with exact string matching, while FASTA starts with inexact string matching (or exact string matching with a shorter words). BLAST is faster than FASTA.

  • For the examples given, both BLAST and FASTA will find the same best match, i.e., shifting the query sequence by 2 sites to the right.

  • Both perform dynamic programming for extending the match after the initial match.

Slide 18


Optional blast parameters

Optional: BLAST Parameters

  • Lambda  and Karlin-Altschul (K) parameters are important because they directly affect the computation of E value.

  • Both  and K depend on

    • nucleotide (or aminon acid) frequencies

    • match-mismatch matrix

  • All BLAST implementations generally assume that nucleotide (or amino acid) sequences have roughly equal frequencies.

  • For nucleotide (or amino acid) sequences with strongly biased frequencies, BLAST E value obtained with the assumption can be quite misleading, i.e., one should use appropriate  and K.


Case 1 equal 3 1

Case 1: equal , (-3,1)


Case 2 different 3 1

Case 2: Different , (-3, 1)


Case 3 different s v

Case 3: Different , s/v


K case 1

K: case 1


K case 2

K: Case 2


Bioinformatics research workflow

Bioinformatics research workflow

Accumulation of nucleotide and amino acid sequences:

UUCUCAACCAACCAUAAAGAUAU

UUCUCUACAAACCACAAAGACAU

UUCUCAACCAACCAUAAAGAUAU

UUCUCAACCAACCACAAAGACAU

UUCUCCACGAACCACAAAGAUAU

UUCUCUACAAACCACAAAGAUAU

UUCUCAACCAACCACAAAGACAU

UUCUCUACUAACCACAAAGACAU

Functional genomics

Systems biology

Digital cells

  • Storage and annotation of the sequences

  • Structural annotation with homology search and de novo gene prediction

  • Functional annotation with gene ontologies

Species-specific gene dictionaries, e.g., yeastgenome.org

  • Gene/Protein families (e.g., Pfam)

  • Cluster of orthologous genes (e.g., COG)

  • Supermatrix of gene presence/absence

  • Genome-based pair-wise distance distributions

  • Comparative genomics (the origin of new genes, new features and new species)

  • Phylogenetics (cladogenic process, dating of speciation and gene duplication events)

  • Phylogeny-based inference.

Mutation

Selection

Adaptation

Slide 25


  • Login