Introduction to Blast Prepared by Nurul Annasuha Binti Mohd Baharudin
Basic Local Alignment Search Tool(BLAST) • An algorithm for comparing primary biological sequence information (e.g. amino-acid sequences of different proteins , nucleotides of DNA sequences). • A widely used bioinformatics' program. • availability on the World Wide Web through a large server at the NCBI (National Center for Biotechnology Information) and at many other sites.
BLAST is more time-efficient than FASTA • comparative sensitivity - searching only for the more significant patterns in the sequences. • freely available to run on many computer platform.
Background • Designed by Eugene Myers, Stephen Altschul, Warren Gish, David J.Lipman and Webb Miller. • Was published in the Journal of Molecular Biology in 1990. • Smith-Waterman algorithm was used before fast algorithm such as BLAST and FASTA were developed.
Uses of BLAST • Identifying species - Identify a species and find homologous species. e.g. working with a DNA sequence from an unknown species. • Locating domains - To locate known domains within the sequence of interest. e.g. working with a protein sequence as an input
DNA mapping - To compare the chromosomal position of the sequence of interest, to relevant sequences in the database. e.g. working with a known species, and looking to sequence a gene at an unknown location. • Comparison - To locate common gene in two related species, and can be used to map annotations from one organism to another. e.g. working with genes
Input • Input Sequences: • FASTA format • Genbank format
FASTA • Also known as Pearson format • A text-based format for representing : • nucleotidesequences (nucleic acid) • peptide sequences (amino acids) • Advantages: • easy to manipulate and parse sequences using text-processing tools and scripting languages like Python, Ruby, and Perl. Expressed by single-letter codes.
Nucleotide Sequences Nucleic acids code supported: Refer for an unknown nucleic acid residue. • Treated as mismatches in nucleotide alignment. • Too many input sequence will cause rejected. ‘ – ‘ not included in query. To represent gaps ‘ N ‘.
Peptide Sequences Amino acids code supported: ‘U’ is replaced by ‘X’. Refer for an unknown nucleic acid residue. Nucleotide-like code. Too many input sequence rejection. ‘ – ‘ not included in query. To represent gaps ‘ X ‘
Example: • Greater than symbol • Usually start with ‘<‘ symbol. • Useful when doing multiple sequence FASTA format. (concatenating several single sequence FASTA file) description line sequence data • Recommendation : all lines of text be < 80 characters length. • Blank lines are not allowed in the middle of FASTA input. • -->single hyphen or dash (‘-’) can be used to represent a gap • of indeterminate length.
Example: Also starter symbol. This line will be ignored by the software. • lower-case letters are accepted and are mapped into upper-case.
Bare sequence : The sequence portion of a GenBank/GenPept flatfile report.
Example: Multi Sequence FASTA file
Genebank • Genbank sequence database is an open access • Annotated collection of all publicly available nucleotide sequences and their protein translation. • Contained over 65 billion nucleotide bases in more than 61 million sequences
Example: Legal input of Genebank identifier: CAA89576 CAA89576.1 1015707 gi|1015707 e.g Mega BLAST: 1015707 129295
Exercise: Following are illigal input of Genebank identifier. Why they are illigal?
Must be removed. Answer: Should be NO space.
Output HTML : NCBI default format • Agraphical format showing the hits found • Atable showing sequence identifiers for the hits with scoring related data, as well as alignments for the sequence of interest and the hits received with corresponding BLAST scores.
Types of BLAST Prepared by Nurrul Shafiqah Binti Abdullah
Nucleotide BLAST • Protein BLAST • blastx • tblastn • tblastx
Nucleotide BLAST • Search a nucleotide database using a nucleotide query • Algorithm • blastn • megablast • discontinous megablast
blastn • compares a nucleotide query sequence against a nucleotide sequence database
megablast - identify an unknown sequence is to see if that sequence already exists in a public database.
Protein BLAST • Algorithm • blastp, • psi-blast, • phi-blast, • delta-blast
blastp • compares an amino acid query sequence against a protein sequence database
blastx • compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database
tblastn • compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands).
tblastx • compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.
On Net http://www.ncbi.nlm.nih.gov/BLAST/
On Computer UNIX/MacOS/Windows ftp://ncbi.nlm.nih.gov/blast/executables/
Question??? What is/are the significance of translation BLAST (blastx, tblastn and tblastx)? • protein sequences are better conserved evolutionarily than nucleotide sequences, • produce more reliable and accurate results when dealing with coding DNA • enable one to be able to directly see the function of the protein sequence, since by translating the sequence of interest before searching often gives the annotated protein hits.
Common Databases for Use with BLAST available at NCBI Prepared by Roslan Bin Armina
A BLAST search has four components: 1. query 2.database 3.program 4.search purpose/goal • Blast database are grouped into protein nucleotide database according to their content. These can shown in table below.
combination of several database into one • example of other database use in blast.
NCBI • Can choose what database to use, whether a combination of databasenr (nr) or specific type of database (refseq_protein swissprot, pat, etc)
NCBI also provides specialized BLAST databases. 1. vector screening database 2. variety of genome databases for different organisms. 3. trace databases