670 likes | 1.67k Views
BLAST. “ Basic Local Alignment Search Tool ” • specialized software to compare a query sequence (protein or DNA) against a database. • it is designed to be fast, accurate and available to many computing platforms including the internet
E N D
BLAST “Basic Local Alignment Search Tool” • specialized software to compare a query sequence (protein or DNA) against a database. • it is designed to be fast, accurate and available to many computing platforms including the internet • published in 1990 (Atschul et al, J Mol Biol) – one of the highest cited papers of that decade Pevsner J. Introduction to Bioinformatics / NCBI BLAST site
Research using BLAST • discovering the evolution or origin of a particular sequence • identifying homologous sequences • orthologs and paralogs • identifying proteins that have the same motifs or domains • discovering new genes and proteins • exploring protein structure and function Pevsner J. Introduction to Bioinformatics / NCBI BLAST site
Research using BLAST Pertsemlidis et al. 2001. Having a BLAST with bioinformatics (and avoiding BLASTphemy)
Homologs • homologs — have a common ancestry but may not have a common activity • orthologs — are produced only by speciation and tend to have a similar function • paralogs — are produced by gene duplications and tend to have different functions
Motifs and Domains • motif — a short sequence that is a hallmark of a certain function • a cluster of amino acids in an enzyme • a series of cysteines and histidines that coordinate zinc ions • domain — a sequence of that autonomously folds to produce a certain activity • DNA binding domain • phosphotyrosine binding domain
Performing a BLAST search • choose your sequence, often called the “query” • select the appropriate program in the BLAST suite • choose the appropriate database to search • choose the appropriate parameters for the search • BLAST !
choose an organism or database choose a method
insert your query here choose your database BLAST !
Choosing a sequence • can be an accession number from a database • can be a sequence in FASTA format >my-favorite-sequence AGGAGGATAACATATGCATCATCACCATCACCACCTGGTCCCGCGCGGA
Choosing a program • DNA specific / protein specific / translated DNA to protein
Choosing a program • DNA specific / protein specific / translated DNA to protein
Choosing a program • DNA specific / protein specific / translated DNA to protein
Choosing a database • the nr database used to stand for the nonredundant set but is more general than that now • it includes all GenBank (NIH major database) sequences , EMBL (European nucleotide database) sequences, DDBJ (Japanese nucleotide database) sequences, and PDB (Protein Data Bank) sequences
Choosing an alignment matrix • when searching the database, a score has to be determined according to how similar the query sequence with a given entry • this similarity score includes a positive value for matches, lesser values for close matches and other, possibly negative, scores for any gaps or mismatches • two common alignment matrices are PAM and BLOSUM • as a general guide: • PAM10 / BLOSUM80 are used to compare closely related sequences • PAM250 / BLOSUM30 are used to compare more distantly related sequences
Low Complexity Regions Before BLAST performs its sequence comparison, it must filter out sequences that have low complexity Low complexity sequences will lead the program astray, and produce results that do not generally meet the objectives of the researcher
Low Complexity Regions A program called SEG performs the complexity analysis www.ncbi.nlm.nih.gov/Education/BLASTinfo/Seg.html
BLAST methodology BLAST works quickly because everything is precomputed at the level of “words”. Words are typically three amino sequences. BLAST searches words and uses the comparison matrix to compute a score. If the score is higher than a certain threshold, BLAST calls it a “hit. It then extends the sequence around the hit until the score drops again. Thus, a bigger region of homology is potentially built up. This is repeated over many hits.
Sample BLAST output • the Expect (E) value helps you determine if the result is significant • typically, an E value of < 10-6 with > 70% sequence identity for DNA signifies a strong hit • typically, an E value of < 10-3 with >25% sequence identity for proteins signifies a strong hit
E values and bit values • There are two kinds of scores: • • raw scores (calculated from a substitution matrix) • • bit scores (normalized scores) • Bit scores are comparable between different searches because they are normalized to account for the use of different scoring matrices and different database sizes • S’ = bit score = (lS - lnK) / ln2 • The E value corresponding to a given bit score is: • E = mn 2 –S’
E values and bit values • The expect value E is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. • The value of E decreases exponentially with increasing S (higher S values correspond to better alignments). • Very high scores correspond to very low E values. • An E value is related to a probability value p. • For E = 1, one match with a similar score is expected to occur by chance. For a very much larger or smaller database, you would expect E to vary accordingly
Conserved Domain Database • there are several databases of protein domains that you can search against • NIH CDD (31608 position specific score matrices – PSSMs) • Pfam (10340 PSSMs from UK Sanger Institute) • SMART (791 PSSMs from EMBL Heidelberg) • COG (4873 PSSMs from NIH) • PRK (7504 PSSMs from NIH, curated protein clusters) • TIGR (3603 PSSMs from The Venter Institute)
Conserved Domain Database • Here is a search with a sequence from a protein called AIDA1. It contains a Sterile Alpha Motif, or SAM domain • I am using the CDD search feature of NCBI BLAST
Conserved Domain Database • Yay ! It identified the SAM domain very well (as NIH CDD #166)
Conserved Domain Database • When you wish to delve further, you can toggle an alignment of your query against the PSSM that was selected
Conserved Domain Database • The CDART web server can take the domain you are interested in and show you other proteins that also possess that domain (in this case, 11 pages worth)
Position Specific Iterated (PSI) BLAST • PSI-BLAST is an excellent choice for when you wish to search databases for matches that are highly customized to your query sequence • its strength is in the way it iteratively performs and refines searches • you begin the usual way by entering your query sequence • from the results, PSI-BLAST builds a profile or PSSM for that protein • you can choose which hits to include for the next iteration
Position Specific Iterated (PSI) BLAST • all of the hits here are very similar, but there are some differences. These differences can be used to refine the search and find more relatives R,I,K C D,E,T K,R,T N,L,Y,G
Position Specific Iterated (PSI) BLAST Iteration# hitshits above threshold 1 104 49 2 173 96 3 236 178 4 301 240 5 344 283 6 342 298 7 378 310 8 382 320
The universe of lipocalins (each dot is a protein) retinol-binding protein odorant-binding protein apolipoprotein D
Scoring matrices let you focus on the big (or small) picture retinol-binding protein your query
Scoring matrices let you focus on the big (or small) picture PAM250 PAM30 retinol-binding protein retinol-binding protein Blosum80 Blosum45
PSI-BLAST generates scoring matrices more powerful than PAM or BLOSUM retinol-binding protein retinol-binding protein
PSI-BLAST issues • PSI-BLAST is especially good at detecting weak relationships in proteins • often have biological meaning – this is good ! • but its sensitivity has a drawback • a strong (E = 10-4) hit that doesn’t belong will contaminate your results • the problem is worse if the contaminant is well represented • this problem can be alleviated by • manually inspecting results • lowering the E-value threshold • use filters to not consider any regions that may have bias
Pattern Hit Initiated (PHI) BLAST • PHI-BLAST uses regular expressions and local alignments to find homologous proteins with functional relevance • consider three proteins from the lipocalin family (retinoic acid binding protein and two distantly related bacterial proteins) 1 50 ecblc MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD vc MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLD hsrbp ...MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD
Pattern Hit Initiated (PHI) BLAST • PHI-BLAST uses regular expressions and local alignments to find homologous proteins with functional relevance • consider three proteins from the lipocalin family (retinoic acid binding protein and two distantly related bacterial proteins) 1 50 ecblc MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD vc MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLD hsrbp ...MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD GTWYEI K AV M pattern: GXW[YF][EA][IVLM]
• the additional pattern is easily entered and used at the BLAST website
Protein Data Bank BLAST • you can search your sequence against the PDB database of high resolution NMR and X-ray structures • great way to begin a molecular modeling project >pdb|1SHC|A Chain A, Shc Ptb Domain Complexed With A Trka Receptor Phosphopeptide, Nmr, Minimized Average Structure Length=195 Score = 56.2 bits (134), Expect = 5e-09, Method: Compositional matrix adjust. Identities = 46/173 (26%), Positives = 80/173 (46%), Gaps = 32/173 (18%) Query 3 PVQYWQHHPEKLIFQSCDYKAFYLGSM-LIKELRG------TESTQDACA---------- 45 P + W H +K++ Y Y+G + +++ +R T+ T++A + Sbjct 22 PTRGWLHPNDKVMGPGVSYLVRYMGCVEVLQSMRALDFNTRTQVTREAISLVCEAVPGAK 81 Query 46 ---KMRANCQK--------STEQMKKVPTIILSVSYKGVKFIDATNKNIIAEHEIRNISC 94 + R C + S + +P I L+VS + + A K IIA H +++IS Sbjct 82 GATRRRKPCSRPLSSILGRSNLKFAGMP-ITLTVSTSSLNLMAADCKQIIANHHMQSISF 140 Query 95 A-AQDPEDLSTFAYITKDLKSNHHYCHVFTAFDVNLAYEIILTLGQAFEVAYQ 146 A DP+ AY+ KD N CH+ + LA ++I T+GQAFE+ ++ Sbjct 141 ASGGDPDTAEYVAYVAKD-PVNQRACHILECPE-GLAQDVISTIGQAFELRFK 191
Protein Data Bank BLAST • in the previous slide, I aligned the sequence of a Phosphotyrosine Binding Domain (PTB) from AIDA1 with a few representatives from the PDB • I threaded the sequence of AIDA1 into the structure of “1SHC” and refined it • on the model, I highlighted some aromatics (Phe, Tyr, Trp) that could possibly be solvent exposed • solvent exposed hydrophobic amino acids tend to make proteins “sticky”, not a good quality for the kind of structural biology that I perform