Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle
The BLAST Topics • Exactly What is BLAST? • A Quick Recap of Profiles • A Few Statistics Behind the BLAST Program • The Progression to Gapped BLAST • The advancements in PSI BLAST
Exactly What Is BLAST? • Blast Programs are used for searching both protein and DNA databases for sequence similarities. • BLAST programs can compare protein to protein, DNA to DNA, Protein to DNA, or DNA to protein. • The DNA sequences used in comparison are usually conceptually transcribed before comparison. • BLAST programs use a threshold value which can be adjusted to alter speed and probability. A higher value of T will give greater speed, but also a larger probability of missing weaker similarities. • Can use various substitution matrices such as Blosum(62) or PAM 250.
A Quick Recap of Profiles • A sequence profile is a position specific scoring matrix generated from a group of aligned sequences and a basic scoring matrix. • A profile will have L rows and 22 columns or vice versa. • Amino acid matrix scores are multiplied by the ratio of that amino acid in the sequences being compared over the entire number of amino acid possibilities in the matrix. • A consensus sequence or profile is then derived and used in future comparisons.
A Few Statistics Used in BLAST • Firstly we require that the expected score for two random amino acids ΣPiPjSij to be negative. • Now we can calculate two parameters λ and K. These two variables allow for a normalized scoring system through the equation S‘ = (λS – ln K) / (ln 2). • S’ can now be plugged into the equation E = N/2^s’. • E-Value > 0.01 = will return more loosely related similarities. • E-Value <= 1*10^-5 will return more strictly related similarities.
The Progression to Gapped BLAST • Original BLAST program did not take gaps into account. • BLAST used to look for single alignments of at least length T. Each positive alignment “hit” was then extended. • Gapped BLAST now allows for two non-overlapping alignments of length T within distance A of one another. These alignments “hits” are then extended. • Gapped BLAST allows for gap initiation and extension. • ABCDE ABCDE ACD - - A–CD– (Original Blast) (Gapped Blast)
PSI BLAST • Position-Specific Iterated BLAST • Incorporates position specific matrices “profiles” • Often much better at detecting weak similarities • Before PSI BLAST the same techniques were used, but a large degree of expertise and human intervention was required
Score Matrix Architecture • Profiles very similar to scoring matrix • Protein or nucleotide aligns to profile position • New profile created with every iteration • Profiles created in turn i used in turn i+1 • Gap costs may be position-specific with profiles. • How position specific protein score matrices draw their power • Improved estimation of the probabilities with which amino acids occur at various pattern positions • Relatively precise definition of the boundaries of important motifs • Every matrix constructed has a length exactly the same as the original query sequence
Multiple Alignment Construction & Sequence Weights • All database sequences whose aligned E-value is below a specific threshold are added to the query • Any row (or column) which is >= 98% identical to a previously added alignment is kept out of the profile • Allows for better searching on later iterations • Poor restrictions could lead to large scale profile sequence insertion • Sequences are given different weights depending on evolutionary importance
PSI BLAST Overview • Start off with query and initial score matrix (BLOSUM 62) • Homologs are found using BLAST (align DB to query) • E-Value is used as criteria for sequence insertion into profile • A profile(p1) is constructed from the passing sequences and score matrix • Once again search for homologs using BLAST(align DB to profile) • Once again use E-Value as criteria for insertion into profile • A profile(p2) is constructed from the approved sequences and score matirx.