Introduction to Bioinformatics

Introduction to Bioinformatics BLAST

BLAST • Introduction • What is BLAST? • Query Sequence Formats • What does BLAST tell you? • Choices • Variety of BLAST • BLAST Programs: Which One to Use? • Commonly Used BLAST programs • BLAST Databases: Which One to Search? • Understanding the Output • Database Search with BLAST • Blast Steps – How It Works Acknowledgement: The presentation includes adaptations from NCBI’s Introduction to Molecular Biology Information Resources Modules

What is BLAST? • Basic Local Alignment Search Tool • The GoogleTM of bioinformatics • Query is a DNA or protein sequence, not a text term • Character string comparison against all the sequences in the target database • Rigorous statistics used to identify statistically significant matches

Query Sequence Formats • Bare sequence • QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP • 1 qikdllvsss tdldttlvlv naiyfkgmwk tafnaedtre mpfhvtkqes kpvqmmcmnn 61 sfnvatlpae kmkilelpfa sgdlsmlvll pdevsdleri ektinfeklt ewtnpntmek 121 rrvkvylpqm kieekynlts vlmalgmtdl fipsanltgi ssaeslkisq avhgafmels 181 edgiemagst gviedikhsp eseqfradhp flflikhnpt ntivyfgryw sp • Identifiers • accession, accession.version or gi's • e.g., p01013, AAA68881.1, 129295, gi|129295 • FASTA format

Query Sequence in FASTA Format • FASTA definition line ("def line") that begins with a >, followed by some text that briefly describes the query sequence on a single line • Up to 80 nucleotide bases or amino acids per line • Blank lines not allowed in the middle • Example • >gi|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED) QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP • Additional information

What does BLAST tell you? • Putative identity and function of your query sequence • Helps to direct experimental design to prove the function • Find similar sequences in model organisms (e.g., yeast, C. elegans, mouse), which can be used to further study the gene • Compare complete genomes against each other to identify similarities and differences among organisms

Variety of BLASTs: http://www.ncbi.nlm.nih.gov/BLAST/

BLAST Programs: Which One to Use? Depends on: • What type of query sequence you have (nucleotide or protein) • What type of database you will search against (nucleotide or protein) • BLAST program descriptions • brief list • BLAST program selection guide

Commonly Used BLAST Programs • Examples of BLAST programs • BLASTN • Nucleic acids against nucleic acids • BLASTP • Protein query against protein database • Usually better to use than nucleotide-nucleotide BLAST • Since the genetic code is degenerate, blastn can often give less specific results than blastp • ...but... what if we don't have a protein query sequence. What are our options? • BLASTX • Translated nucleic acids against protein database • One way to do a protein BLAST search if you have a nucleotide query sequence • The BLAST program does the translating for you, in all 6 reading frames

BLAST Databases: Which One to Search? What type of data do you want to search against? For example: • Characterized sequences? • Specialized sequences? • Complete genomes or chromosomes? • BLAST database descriptions are available in the: • BLAST help document • BLAST program selection guide

Request ID: RID • An RID is like a ticket number that allows you to retrieve your search results and format them in many different ways over the next 24 hours. • If you've saved RIDs from your recent searches, you can enter the RIDs directly using the Retrieve results with a Request ID page, which is accessible from the bottom of the BLAST home page

Search Results: Understanding the Output • Reference to BLAST paper • Reminders about your specific query • RID • query sequence reminder (contains the information from your FASTA def line) • what database you searched against • Graphical summary • shows where the hits aligned to your query • colors indicate score range • mouse over a colored bar to see info about that hit • Text summary (GI numbers and Def lines) • GI links to complete record in Entrez • Score links to pairwise alignment between your query sequence and the hit • Pairwise alignments • BLAST statistics for your search

Database Search w/ BLAST • Primary use of bioinformatics • Finding similar sequences • BLAST Acknowledgement: Slides 15 – 19 are adapted from lecture notes of Professor Chau-Wen Tseng of CS Department at the University of Maryland with permission.

Database Search w/ BLAST • Set up format options and hit the Format button RID Click button!

Database Search w/ BLAST • Versions of BLAST • BLASTN • Nucleic acids against nucleic acids • BLASTP • Protein query against protein database • BLASTX • Translated nucleic acids against protein database • TBLAST • Protein query against translated nucleic acid database • TBLASTX • Translated nucleic acids against translated nucleic acids

Database Search w/ BLAST

Database Search w/ BLAST • BLAST graphic result

Database Search w/ BLAST • BLAST result 0Matching sequences w/ bit-score & E-value 0Hyperlinks to database entry for sequence • Example gi|17330420|gb|BH384278.1|BH384278... 153 3e-36 gi|17320126|gb|BH373984.1|BH373984... 140 9e-34 gi|17338337|gb|BH392196.1|BH392196... 112 8e-25 gi|20373967|gb|BH771010.1|BH771010... 105 1e-21 gi|17314411|gb|BH368367.1|BH368367... 104 2e-21 gi|17332712|gb|BH386570.1|BH386570... 64 3e-21 Hyperlinks to sequences Bit Score E-value

BLAST – Statistical Evaluation • E Value • The number of different alignments with scores equivalent to or better than alignment score that are expected to occur in a database search by chance. • The lower the E value, the more significant the score.

BLAST – How It Works • Find high scoring local alignments between query sequence and target database • Assumption • True match alignments very likely to contain within them very high scoring matches • Steps • Seeding • Searching • Extension • Evaluation

BLAST Steps • Seeding • For each word of length w in the query (w-mer), generate a list of all possible words (neighbors) with a score of at least threshold T (determined by using the scoring matrix) • Default • w = 3 for protein • w =11 for DNA

Query word (w = 3) Query:GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFGCATSWPI PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 PMG 13 PSG 13 PQA 12 PQN 12 … Neighborhood words Neighborhood score threshold (T = 13) This example uses BLOSUM 62.

BLOSUM 62

BLAST Steps • Searching • Determine the locations of all common “words” between the query and the database (“word hits”) • Identifies all word hits

Query word (w = 3) Query:GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFGCATSWPI PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 PMG 13 PSG 13 PQA 12 PQN 12 … Neighborhood words Neighborhood score threshold (T = 13) Hit Query: SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA Subject: TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA

BLAST Steps • Extension • Extend hits to find HSPs (high-scoring segment pairs) that have scores higher than a threshold • Introduce gaps using dynamic programming • Problem of extension • Time-consuming to find the highest score • Solution (heuristic) • Extend until score drops a value of X Example: ABCDEFGHIJKLMNOPQRST |||||| ||||| | ABCDEFZYIJKLMXWVUTAB 1234565456789876565  Score 00000012100001234345 Drop off score Match = 1 Mismatch = -1 X = 5

Query word (W = 3) Query:GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFGCATSWPI PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 PMG 13 PSG 13 PQA 12 PQN 12 … Neighborhood words Neighborhood score threshold (T = 13) Hit Query: SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA +LA++L+ TP G R++ +W+ P+ D + ER + A Subject: TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA

BLAST Steps • Evaluation • Maximal segment pairs (MSPs) – maximum-scoring HSPs • Evaluate the statistical significance of extended hits (HSPs) • Report only those above the determined threshold (MSPs)

For local, ungapped alignments: m: size of query n: size of database E: expected # of HSPs with scores at least S p: prob of finding at least one HSP with S good tutorial at: http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html BLAST – Statistical Evaluation

Interpretations of Expected Value • Expected value ranges • E < 10-100 → very low, homologs or identical genes • E < 10-3 → moderate, may be related genes • E > 1 → high, probably / may be unrelated • 0 0.5 < E < 1 → ??? In the “twilight zone” Try detailed search • If database search • Long list of gradually declining of E values → large gene family • Long regions of moderate similarity → more significant than short regions of high identity • Biological relevance • Still need to determine biological significance!!!

Introduction to Bioinformatics