Searching molecular databases with blast
Download
1 / 54

searching molecular databases with blast - PowerPoint PPT Presentation


  • 296 Views
  • Uploaded on

Searching Molecular Databases with BLAST. Searching Molecular Databases with BLAST. B asic L ocal A lignment S earch T ool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration and exercises. Why learn sequence database searching?. What have I cloned ?

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'searching molecular databases with blast' - MikeCarlo


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Searching molecular databases with blast2 l.jpg
Searching Molecular Databases with BLAST

  • Basic Local Alignment Search Tool

  • How BLAST works

  • Interpreting search results

  • The NCBI Web BLAST interface

  • Demonstration and exercises


Why learn sequence database searching l.jpg
Why learn sequence database searching?

  • What have I cloned ?

  • Is this really “my gene” ?

  • Has someone else already found it ?

  • What is this protein’s function ?

  • What is it related to ?

  • Can I get more sequence easily ?


Search programs are sequence alignment programs l.jpg
Search programs are sequence alignment programs

  • They try to find the best alignment between your probe sequence and every target sequence in the database

  • Finding optimal alignments is computationally a very resource intensive process

  • It is usually not necessary to find optimal alignments, particularly for large databases

  • Alignments are ranked and only top scores are reported


Practical database search methods incorporate shortcuts l.jpg
Practical database search methods incorporate shortcuts

  • The fastest sequence database searching programs use heuristic algorithms

  • The basic concept is to break the search and alignment process down into several steps

  • At each step, only a best scoring subset is retained for further analysis


What does heuristic mean l.jpg
What does ‘HEURISTIC’ mean?

  • “a commonsense rule (or set of rules) intended to increase the probability of solving some problem”

  • Why consider every possible alignment once a reasonably good alignment is found?


Heuristic programs find approximate alignments l.jpg
Heuristic programs find approximate alignments

  • They are less sensitive than “dynamic programming” algorithms such as Smith-Waterman for detecting weak similarity

  • In practice, they run much faster and are usually adequate

  • The BLAST program developed by Stephen Altschul and coworkers at the NCBI is the most widely used heuristic program


Blast is a collection of five programs for different combinations of query and database sequences l.jpg
BLAST is a collection of five programs for different combinations of query and database sequences


Why blast is great l.jpg
Why BLAST is great combinations of query and database sequences

  • Very fast and can be used to search extremely large databases

  • Sufficiently sensitive and selective for most purposes

  • Robust - the default parameters can usually be used


Blast scores are reported in two columns l.jpg
BLAST scores are reported combinations of query and database sequencesin two columns

  • Raw values based on the specific scoring matrix employed

  • As bits, which are matrix independent normalized values (bigger = better)

  • Significance is represented by E values (smaller = better)


Typical blast output l.jpg

Sorted by E value combinations of query and database sequences

Typical BLAST Output


The expect e threshold is used to control score reporting l.jpg
The EXPECT (E) threshold is used to control score reporting combinations of query and database sequences

  • A match will only be reported if its E value falls below the threshold set

  • The default value for E is 10, which means that 10 matches with scores this high are expected to be found by chance

  • Lower EXPECT thresholds are more stringent, and report fewer matches


Interpreting blast scores l.jpg
Interpreting BLAST scores combinations of query and database sequences

  • Score interpretation is based on context

    • What is the question?

    • What else do you know about the sequences?

    • Scoring is highly dependent on probe length

  • Exact matches will usually have the highest scores (and lowest E values)

    • Short exact matches may score lower than longer partial matches


Interpreting blast scores15 l.jpg
Interpreting BLAST scores combinations of query and database sequences

  • Short exact matches are expected to occur at random.

  • Partial matches over the entire length of a query are stronger evidence for homology than are short exact matches.


Homology vs identity l.jpg
Homology vs Identity combinations of query and database sequences

  • Homologous sequences are descended from a common ancestral sequence.

  • Homology is either true or false. It can never be partial! Saying two sequences are 45% homologous is a misuse of the term.

  • Sequence identity and similarity can be described as a percentage and are used as evidence of homology.


Blast example l.jpg
BLAST Example combinations of query and database sequences

Is this sequence known? What does it encode?


Search strategy l.jpg
Search Strategy combinations of query and database sequences

  • Choose the BLAST program:

    • nucleotide query vs. nucleotide db

    • megabalst: optimized to find identical sequences

    • blastn: will find identical and similar sequences

  • Choose the Database

    • nr (non-redundant) – everything

    • genome specific


Blastn options l.jpg
blastn Options combinations of query and database sequences

Paste Query

Sequence

HERE

Choose Database

HERE

Choose search program

HERE


Slide21 l.jpg

Each line is a hit combinations of query and database sequences

in the database

sorted vertically

by E value

Colored rectangles along the X axis show where in the query sequence

a similarity in the database has been found. Color indicates degree of similarity


Slide22 l.jpg

Output sorted by E value combinations of query and database sequences


Slide23 l.jpg

Link to GenBank file combinations of query and database sequences


Slide24 l.jpg

Link to alignment combinations of query and database sequences


Slide25 l.jpg

Link to Entrez Gene combinations of query and database sequences


Blastn alignment l.jpg
blastn Alignment combinations of query and database sequences


Blastp example l.jpg
BLASTP Example combinations of query and database sequences


Blastp input l.jpg
blastp input combinations of query and database sequences


Blastp databases l.jpg
blastp Databases combinations of query and database sequences


Blastp databases31 l.jpg
blastp Databases combinations of query and database sequences

  • nr - All non-redundant GenBank CDS translations + PDB + SwissProt+PIR

  • swissprot - the last major release of the SWISS-PROT protein sequence database

  • pat - patented sequences

  • pdb - Sequences derived from the 3-dimensional structure Protein Data Bank

  • env_nr - Non-redundant environmental samples


Blastp output l.jpg
BLASTP Output combinations of query and database sequences

Conserved Domain Search

Conserved domains are shown

graphically. Link to explanation

of the domain.


Blastp output33 l.jpg
blastp Output combinations of query and database sequences


Blastp alignment l.jpg
blastp Alignment combinations of query and database sequences


Protein scoring matrices l.jpg
Protein Scoring Matrices combinations of query and database sequences

Blosom 62 is the default BLASTP scoring matrix


Different matrices produce slightly different alignments l.jpg
Different Matrices Produce slightly different alignments combinations of query and database sequences


Other blast programs psi blast l.jpg
Other BLAST Programs: combinations of query and database sequencesPsi-BLAST

4.6 PSI-BLAST is designed for more sensitive protein-protein similarity searches.

Position-Specific Iterated (PSI)-BLAST is the most sensitive BLAST program, making it useful for finding very distantly related proteins or new members of a protein family. Use PSI-BLAST when your standard protein-protein BLAST search either failed to find significant hits, or returned hits with descriptions such as "hypothetical protein" or "similar to...".


Other blast programs phi blast l.jpg
Other BLAST Programs: combinations of query and database sequencesPhi-BLAST

4.7 PHI-BLAST can do a restricted protein pattern search.

Pattern-Hit Initiated (PHI)-BLAST is designed to search for proteins that contain a pattern specified by the user AND are similar to the query sequence in the vicinity of the pattern. This dual requirement is intended to reduce the number of database hits that contain the pattern, but are likely to have no true homology to the query.


Sequence filters l.jpg
Sequence filters combinations of query and database sequences

  • Since only a limited number of matches are reported, hits to simple repeats and other low complexity sequences can obscure other more biologically meaningful similarities

  • Filters are used to remove low complexity sequences from the probe

  • Low Complexity, human repeats (blastn)


Low complexity sequences are filtered out l.jpg
Low Complexity Sequences are Filtered Out combinations of query and database sequences


Blastn vs blastp l.jpg
BLASTN vs BLASTP combinations of query and database sequences

  • Protein sequences have much higher information content than nucleotide sequence

  • To find evidence for sequence homology, use BLASTP and search protein sequences

  • Is my sequence already in the database?

  • To find identical sequences, search nucleotide databases


Translated blast searches l.jpg
Translated BLAST Searches combinations of query and database sequences

  • translations use all 6 frames

  • computationally intensive

  • tblastx searches can be very slow with some large databases

  • must specify genetic code


Alternate genetic codes l.jpg
Alternate Genetic Codes combinations of query and database sequences


Translated blast searches44 l.jpg
Translated BLAST Searches combinations of query and database sequences


Taxonomy reports l.jpg
Taxonomy Reports combinations of query and database sequences


Taxonomy reports46 l.jpg
Taxonomy Reports combinations of query and database sequences


Blast genomes l.jpg
BLAST Genomes combinations of query and database sequences


Align 2 sequences with blast l.jpg
Align 2 Sequences with BLAST combinations of query and database sequences


Blast from orf finder l.jpg
BLAST from ORF Finder combinations of query and database sequences


Primer blast l.jpg
Primer BLAST combinations of query and database sequences


Blast tutorial l.jpg
BLAST Tutorial combinations of query and database sequences

  • BLAST tutorial on Biocomp Web page

  • Goal: demonstrate utility and difference between BLASTN and BLASTP searches

  • BLASTN: is my DNA sequence in the database?

  • BLASTP: are there related (homologs) proteins in the database?


ad