260 likes | 392 Views
Learn about the Basic Local Alignment Search Tool (BLAST), its databases, and how to optimize searches for genetic data analysis. Understand scoring systems, E-values, and practical exercises.
E N D
Tutorial 3 BLAST • What is BLAST? • Basic Local Alignment Search Tool • Is a set of similarity search programs designed to explore sequence databases. • What are similarity searches good for? • One sequence by itself is not informative; it must be analyzed by comparative methods against existing sequence databases to develop hypothesis concerning relatives and function Database Query BLAST program
Place Query Choose Database ?
BLASTN Databases http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#nucleotide_databases
? Place Query Choose Database Optimize similarity level of the search Limit output size Threshold for results significance Primary word match (16-64 nt) Reward and penalty for matching and mismatching bases Cost to create and extend a gap Limit search to specific organism Remove low information content
Global Alignment Local Alignments Query sequence Matched Areas of database sequences
Sequence description E value Score(bits) Sequence Identifier Identity Coverage
Score andE value Identities and gaps Strand
Multiple hits on a same subject
Design of the BLAST survey • Consider your research question: • Are you looking for an particular gene in a particular species?: BLAST against the genome of that species. • Are you looking for additional members of a gene family across all species? : BLAST against the gene collection database. • Are you looking for exact motif matches? : increase gap penalty or use megablast.
Score and E-value Score (S): (identities + mismatches)-gaps Bit Score (S’): Score Depends on search space Depends on scoring system Database length(bp) Query length(bp)
Score and E-value • The score is a measure of the similarity of the query • to the sequence shown. • The E-value is a measure of the reliability of the score. • The definition of the E-value is: The probability due to chance, that there is another alignment with a similarity greater than the given S score.
Score and E-value • The Size of the E-value • The typical threshold for a good E-value from a BLAST • search is E=10-6≈e-6 or lower. • The reason for such low values is that an E=0.001 in a • million entry database would still leave 1000 entries due • to chance. An E=e-6 would only leave one entry due to • chance.
Blast Program Scoring system Scoring system
Exercise Calculate the S, S’ and E for the following BLAST hit: ACGTCGATCGAGCT |||||||| ||||| AGGTCGTC-GAGGT Given the following parameters: Query length: 150 • =1.37 K=0.711 Average Sequence length in database: 270 Number of sequences in database: 4,554,026 S: (Id+MM)-GP S = 13-1 = 12 S’= (1.37*12 – ln(0.711))/ln(2) S’= 16.44 + 0.341 /0.693 S’= 24.2
Exercise Calculate the S, S’ and E for the following BLAST hit: ACGTCGATCGAGCT |||||||| ||||| AGGTCGTC-GAGGT Given the following parameters: Query length: 150 • =1.37 K=0.711 Average Sequence length in database: 270 Number of sequences in database: 4,554,026 E= 0.711x150x270x4,554,026xe-1.37*12 E= 131135455683x7.24e-8 E= 9504.27
Exercise What will be the minimal score in order to achieve a significant E value (e-6~10-6)? 131135455683e-1.37S=10-6 ln (131135455683e-1.37S)=ln(10-6) ln (131135455683)+ln(e-1.37S)=-13.81 25.6-1.37S=-13.81 S= =-13.81-25.6/-1.37 S≈ 28.76