Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures BLAST Lecturer: Dr. Rose BLAST Slides: Adaptation of Nir Friedman’s slides from the Computational Methods in Molecular Biology course (Spring 2001) at Hebrew University, Jerusalem, Israel February 21, 2007

BLAST Q: What is BLAST? A: A: Uhmmm, actually no, BLAST is an acronym: Basic Local Alignment Search Tool - a set of similarity search programs designed to explore all of the available sequence databases regardless of whether the query is protein or DNA You can find it at: http://www.ncbi.nlm.nih.gov/BLAST/

BLAST • Q: Why do you care? • A: Because you are going to do a project. • U51112 Membrane protein that transports sodium and hydrogen • J03581 Tyrosinase. . people lacking this are albino • NM_000245 MET, an oncogene. . .mutations in this cause cancer • NM_010849 MYC, another oncogene • NM_007409 Alcohol Dehydrogenase. . good to have when drinking • NM_002475 Myosin. . .one of the muscle proteins • XM_086788 Crystallin, the major protein in the lens • M30047 Myelin basic protein..protects the neurons • NM_000518 Hemoglobin, oxygen carrying protein in RBC • NM_000477 Albumin, major serum protein. . .does lot of things • NM_008476 Keratin, skin and integument protein

BLAST • BLAST is designed to efficiently find alignments of a target string s against large databases • Motivation: increase the speed of finding fewer and better hotspots. • Idea: Find high scoring matches using a substitution matrix rather than exact matches. • We are still searching only for gapless matches.

High-Scoring Pair • Two strings s and t are a high scoring pair(HSP) if d(s,t) > T • Given a query s[1..n], BLAST construct all words (fixed-length substrings) w, such that w scores > t with a k-substring of s • Each such match to such word in the database is called a hit • Typical k: 12 for nucleotides, 3-5 for amino acids.

High-Scoring Pair • Try to extend each such hit to an alignment with maximal score (still with no gaps). Keep all HSPs • Threshold is chosen so that a random match with such a score is unlikely .

Finding Potential Matches We can locate seed words in a large database in a single pass • Construct a FSA that recognizes seed words • Use hashing techniques to locate matching words

s t Extending Potential Matches • Once a seed is found, BLAST attempts to find a local alignment that extends the seed • Seeds on the same diagonalare combined (as in FASTA)

Which programs are used? • Originally Blast did not allow gaps. • Now people use gapped-Blast • Gapped blast joins different diagonals. • For proteins Blast is superior • For nucleotides Fasta is better.

Review: Unrelated Sequences • Our model of unrelated sequences is simple • Each position is sampled independently from a distribution over the alphabet  • We assume there is a distribution q() that describes the probability of letters in such positions • Then: • R denotes the assumption that s and t are random unrelated strings

Review: Related Sequences • We assume that each pair of aligned positions (s[i],t[i]) evolved from a common ancestor • Let p(a,b) be a distribution over pairs of letters. • p(a,b) is the probability that some ancestral letter evolved into this particular pair of letters • Here M denotes the assumption that s and t are related strings.

Review: Ratio Test for Alignment • Taking logarithm of both sides, we get

Review: Probabilistic Interpretation of Scoring Rule • If we take • then the score of an alignment is the log-ratio between the two models: • Score > 0R is more “probable” • Score < 0U is more “probable”

Problems with Scoring Rule When searching for an optimal alignment in a big database, there are a number of problems that arise with this simple scheme. • We are assuming P(M)=P(R), this assumes there are an equal number of related and unrelated sequences in the database. • When searching through a big database, there is high probability that an unrelated sequence will receive a high score • When searching for an optimal local alignment, we have many possible starting points, heavily biasing the score towards being a related sequence.

Prior Probability on the models • What we really wish to calculate is: • The log score being:

Prior Probability on the models • Our threshold should be:

The Hazard of Large Databases • Define • This is the probability that two unrelated sequences will match with score >  by chance • Assume there are N strings in our database • Assuming that they are independent of each other, and all are unrelated to s, we have

The Hazard of Large Databases 1 f(x,0.001) f(x,0.0001) f(x, 0.00001) f(x, 0.000001) 0.8 0.6 0.4 0.2 0 0 20000 40000 60000 80000 100000

Local Matching • Question: Which local alignment query is expected to give a higher score: • To a short sequence • To a long sequence? • A local match can begin at any of the nm entries in the DP matrix. • The score is the optimal of all these starting points. • If all starting points were independent we would need to calculate the probability of attaining such a score in nm trials.

Score Significance-Fasta • How meaningful is a score? • Calculate distribution of scores and related scores • Under reasonable assumptions the scores for un-gapped alignment behave according to the Extreme Value Distribution.

Extreme Value Distribution (BLAST) • We ask the following questions: Given a database of size n and a sequence of size m • What is the expected number of hits with score at least S? This number is called an E-score • Notice this is a Poisson distribution. • K corrects for the dependencies •  depends on the scoring matrix • Doubling n, the length of sequence, doubles expectation • Doubling S, the score, causes E() to decrease exponentially

Blast P-value • Recall the Poisson distribution: • Probability of finding no hits with a score => S • Therefore probability of finding at least one hit with score => S is • This is called the P-value.

A Typical Genebank entry

Sequence Information

The Sequence

BLAST programs • BLASTN - Nucleotide query searching a nucleotide database. • BLASTP - Protein query searching a protein database. • BLASTX - Translated nucleotide query sequence (6 frames) searching a protein database. • TBLASTN - Protein query searching a translated nucleotide (6 frames) database. • TBLASTX - Translated nucleotide query (6 frames) searching a translated nucleotide (6 frames) database

BLAST Search

BLAST Output • List of hits • Database accession codes, name, description. • Score in bits (Usually >30 bits is significant ) • Expectation value E() • For each hit • A header including hit name, description, length • Each hit may contain several HSPs • Score and expectation value • how many identical residues • how many residues contributing positively to the score • The local alignment itself

BLAST Output

PSI- BLAST (Position Specific Iterated) • BLAST provides a new automatic “profile like” search. • Iterative procedure: • Perform BLAST on database. • Use Significant alignments to construct a “position specific” score matrix. • This matrix replaces the query sequence in the next round of database searching. • The program may be iterated until no new significant alignments are found. • Most commonly used search method today.

Multiple Alignment • Proteins can be classified into families: • Common structure. • Common function. • Common evolutionary origin. • For a set of sequences belonging to some family • Each pair has some differences • But, there are some common motifs in almost all sequences of the family • A multiple alignment carries more information than pairwise alignment

Protein Families • Consider Zinc Fingers: • All have the same function: • Bind to DNA • All have similar structure • They constitute a Protein Family • In a protein family some parts of the sequence (the functional parts) are more conserved than others.

Definition A multiple alignment of strings S1,S2,…,Skis a series of strings with blanksS’1,S’2,…,S’k such that: • |S’1|=|S’2|=…=|S’k| • S’j is an extension of Sjobtained by insertion of blanks.

Example AGT..CTT.ACGCG AGTAGCTT...GCG ..TAGC.T..GGCG .CTA.C.TAACCCG ACTA...TAAC...

Example

Sum of Pairs • The sum of pairwise distances between all pairs of sequences for some scoring matrix • Not only assumes that alignment of each column is independent, but also each pair of sequences. • Each sequence is scored as if descended from k-1 sequences instead of one common ancestor.

Calculation of Multiple Alignment • The optimal alignment can be calculated exactly using k-dimensional dynamic programming. • Space complexity O(nk) • Time complexity O(2knk) • A Heuristic Program called ClustalW quickly finds a good multiple alignment.

Creating a PSSM • After aligning the sequences we see that there are some conserved regions. • We use the multiple alignment of Blast results to create a Position Specific Scoring Matrix. • This matrix represents information from a whole family, it is more strict in highly conserved regions.

PSI- BLAST (Position Specific Iterated) • BLAST provides a new automatic “profile like” search. • Iterative procedure: • Perform BLAST on database. • Use Significant alignments to construct a “position specific” score matrix. • This matrix replaces the query sequence in the next round of database searching. • The program may be iterated until no new significant alignments are found. • Most commonly used search method today.

Bioinformatics Algorithms and Data Structures