School b i tcd bioinformatics
This presentation is the property of its rightful owner.
Sponsored Links
1 / 33

School B&I TCD Bioinformatics PowerPoint PPT Presentation


  • 73 Views
  • Uploaded on
  • Presentation posted in: General

School B&I TCD Bioinformatics. Database homology searching May 2010. Why search a Database. To find homologous sequences to your unknown to determine function To find other related sequences to do evolutionary studies (trees) or to make specialised database (nematode 16sRNA)

Download Presentation

School B&I TCD Bioinformatics

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


School b i tcd bioinformatics

School B&I TCD Bioinformatics

Database homology searching

May 2010


Why search a database

Why search a Database

  • To find homologous sequences to your unknown to determine function

  • To find other related sequences to do evolutionary studies (trees) or to make specialised database (nematode 16sRNA)

  • To find the mouse or E.coli homolog of your gene of interest

  • To find genes in a newly sequenced genome

  • To predict 3-D structure (blast vs PDB)


Blast

BLAST

  • For many molecular biologists

  • a Blast search

    • with a DNA sequence

    • at NCBI

    • accepting default parameters

  • IS bioinformatics

    • 70% of searches at NCBI at blastN


Blast1

BLAST

  • Basic local alignment search tool

  • More or less unchanged for 20 years

  • Extraordinarily quick:

    • Submit seq via satellite to Bethesda MD

    • Crunch thro 400 GB of sequence

    • Return results via satellite

    • ALL in 30 seconds

  • Available at other sites (EBI, ExPaSy etc.)

    • with better response times, fewer conditions


Ncbi penalties

NCBI penalties

  • 1st request: current time

  • 2nd request: current time + 60 seconds

  • 3rd request: current time + 120 seconds

  • 4th request: current time + 180 seconds

  • 5th request: current time + 240 second

  • DON’T submit multiple simultaneous queries


Reliable servers

Reliable servers

  • http://www.ch.embnet.org/software/bBLAST.html

  • http://www.ch.embnet.org/software/aBLAST.html

    • Basic and advanced SwissBlast

  • http://www.ebi.ac.uk/blast2/index.html

    • EBI (Cambridge, UK) NCBI Blast

  • http://www.ncbi.nlm.nih.gov/blast/

    • Not the only Blast server!


Blast varieties

BLAST varieties

  • blastn: searches a DNA sequence against a DNA database like EMBL, Genbank, or dbEST. also megablast for exact matches

  • blastp: searches a protein sequence against a protein database such as UniProt, or "nr" a non-redundant database which ideally contains one copy of every available sequence.

    Then you have:

  • blastx: searches a DNA sequence (translated in all six reading frames) against a protein database.

  • tblastn: searches a protein sequence against a DNA database (translated in all six reading frames) – essential for searching EST databases.

    and in the interests of completeness there is:

  • tblastx: searches a DNA sequence (translated in all six reading frames) against a DNA database (translated in all six reading frames).

    finally

  • Psi-blast an iterative process using position specific subst matrix PSSM to find distantly related sequences


Algorithm

Algorithm

  • Five steps

    • Break the query seq into short “words” – typically 3 consecutive residues for protein

    • Search the database for (nearly) exact matches to these words

    • Extend match “hits” out on either side until score stops going up – these are HSPs (high scoring segment pairs)

    • Sort the HSPs by some “optimum” criterion

    • Significant hits are then formally scored, aligned and displayed


Alternatives to blast

Alternatives to BLAST

  • FASTA

    • A little slower than Blast

    • More sensitive (so recommended) for DNA to DNA searches

  • Smith-Waterman (Blitz)

    • Much slower than Blast (20x slower)

    • Much more sensitive

  • But Blast is standard because it gives a “good enough” answer quickly.


Blast blast at ncbi

Blast Blast at NCBI

Paste your query sequence here

Parallel search in Conserved domain DB


Input sequence parameters

Input sequence parameters

Optional parameters to change include

  • Parameters that alter the hits found

    • Database searched

    • Word size

    • Substitution matrix

    • Gap penalties

    • Low complexity masking

  • Parameters that alter the results delivered

    • Expectation cut-off

    • Limit by organism or taxonomic group

    • Number of hits reported

    • Number of alignments shown


Database

Database

  • If query is a coding gene: translate and search protein database

  • Search PDB if you want a 3-D structure

  • Search NR if you want any hit

  • Search UniProt to know what the hits are

  • Search dbEST to know if your sequence is expressed

  • UniProt90: no seq is more than 90% ident to any other (for an uncluttered tree) also UniProt50


Word size

Word size

  • Default at NCBI Blast

    • 11 for DNA; option of 7 and 5

    • 3 for protein; option of 2

  • Increasing wordsize will speed search

    … but will lose sensitivity

    • It will miss some useful but distant hits

  • Decreasing wordsize will be more sensitive … but take longer


Substitution matrix

Substitution matrix

  • By default BLAST uses BLOSUM62

  • Can change this

    • Blosum90 mirrors changes in closely related sequences

    • Blosum30 changes in distant sequences

  • Should run 3 blast searches with different substitution matrices.


Substitution matrices

Changes 30-62-90 not linear

Substitution matrices

Top left part of a BLOSUM 90 matrix

A R N D C Q E G H I L

A 5 -2 -2 -3 -1 -1 -1 0 -2 -2 -2

R -2 6 -1 -3 -5 1 -1 -3 0 -4 -3

N -2 -1 7 1 -4 0 -1 -1 0 -4 -4

D -3 -3 1 7 -5 -1 1 -2 -2 -5 -5

C -1 -5 -4 -5 9 -4 -6 -4 -5 -2 -2

Q -1 1 0 -1 -4 7 2 -3 1 -4 -3

E -1 -1 -1 1 -6 2 6 -3 -1 -4 -4

G 0 -3 -1 -2 -4 -3 -3 6 -3 -5 -5

H -2 0 0 -2 -5 1 -1 -3 8 -4 -4

I -2 -4 -4 -5 -2 -4 -4 -5 -4 5 1

L -2 -3 -4 -5 -2 -3 -4 -5 -4 1 5

Positive off-diagonals aresimilar


Reality of matrix change

Reality of matrix change

  • Query: GHDEICI

  • GH + C

  • Sbjct: GHACNCG

    • Scores 39 with Blosum30; 5 with Blosum90

  • Query: HEQCRLEN

  • +E LEN

  • Sbjct: QENAHLEN

    • Scores 19 with Blosum30; 24 with Blosum90

  • So GHDEICIHEQCRLEN will find different hits


Matrix families

Matrix families

  • PAM – point accepted mutation

    • Margaret Dayhoff 1970s (before Genbank)

  • BLOSUM – based on aligned blocks

    • Henikoff and Henikoff 1992

    • Blosum 62 default (based on aligned seqs 62% identical – don’t ask why 62: it works)

  • Low PAM = High Blosum

    • PAM 250 == Blosum 30

    • PAM 30 == Blosum 90


Gap penalty

Gap penalty

  • Original Blast was gap-free

    • Because gapped aligns much more CPU

  • Now can insert “affine” gaps

    • Gap open 10; gap extend 1

  • Raise gap penalty to discourage gaps

    • Preferentially gets closely related hits

  • Lower gap penalty for sensitive search for distant relatives


Low complexity masking

Low Complexity Masking

  • Seg masks low-complexity regions

    • “too many” serines, prolines etc.

  • What a masked sequence looks like:

    >P04729 Wheat gamma gliadin

    MKTFLVFALIAVVATSAIAQMETSCISGLERPWQQQPLPPQQSFSQQPPFSQQQQQPLPQ

    QPSFSQQQPPFSQQQPILSQQPPFSQQQQPVLPQQSPFSQQQQLVLPPQQQQQQLVQQQI

    PIVQPSVLQQLNPCKVFLQQQCSPVAMPQRLARSQMWQQSSCHVMQQQCCQQLQQIPEQS

    RYEAIRAIIYSIILQEQQQGFVQPQQQQPQQSGQGVSQSQQQSQQQLGQCSFQQPQQQLG

    QQPQQQQQQQVLQGTFLQPHQIAHLEAVTSIALRTLPTMCSVNVPLYSATTSVPFGVGTG

    VGAY*

    and after low complexity masking:

    >P04729 SEG low-complexity masked

    MKTFLVFALIAVVATSAIAQMETSCISGLERPWXXXXXXXXXXXXXXXXXXXXXXXXXXX

    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    XXXXXXXXXXLNPCKVFLQQQCSPVAMPQRLARSQMWXXXXXXXXXXXXXXXXXXXXXXX

    RYEAIRAIIYSIIXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    XXXXXXXXXXXXXXXXXXXHQIAHLEAVTSIALRTLPTMCSVNVPLYSATTSVPFGVGTG

    VGAY*

  • Xnu masks repeats (from database of common repeats)


Masking

Masking

  • Check if masking on/off by default (differs at sites)

  • Run 2 searches with masking on, masking off ANNYway

  • Masking by hand:

    • If you know about the DNA binding domain already

      • Which is really common and will occupy the top 100 hits against any database

    • Replace the region with Xs

    • Re-run blast to find other motifs/domains/information

    • NCBI blast allows select subsequences

  • DUST masks low info DNA sequences

    • Like polyA tails


Advanced

Advanced


Expectation cut off

Expectation cut-off

  • E value

    • Expected number of chance hits

      • Given the database searched

      • Query HSP length

  • Related to probability

  • Default is 10: “expect to find 10 hits as good as this by chance alone”

  • E values less than 10-4 unreliable

  • So different from p < 0.05

  • For short seqs crank E up to 100 or 1000


Data deluge hits delivered

Data deluge – hits delivered

  • Default suits general purpose

  • V = 50 num of “hits” one line descriptions

  • B = 50 num of alignments between query and hit which are displayed


Swiss blast

Swiss Blast


Ebi blast

EBI blast


Limit by taxon organism

Limit by taxon/organism

Also search at www.ensembl.org for “genomed” organisms


Database choice

Database choice

NR for getting aNNy hit

swissprot or refseq for getting hits that have annotation

Month for recent hits


The output

The output

  • Is in five parts

  • Admin – date, size of database, id of query

  • Graphic display of query and hits

  • List of hits with links to database and to…

  • …alignments (may be > 1 HSP per hit)

  • More admin, including errors/warnings


Notes

Notes!

  • Record the top admin stuff. Your search will be different

    • Next week

    • On a different server

    • With different DB

    • With different input parameters

  • Mouse-over any graphic display

    • Shows domain structure

    • Shows if hit is global or local

  • Read the bottom admin stuff


Blast output

Blast output

sp|P06471|HOR3_HORVU B3-HORDEIN

Length = 264

Score = 62.5 bits (149), Expect = 1e-09

Identities = 32/63 (50%), Positives = 38/63 (59%)

Query: 131 LNPCARSQMWXXXXXXXXXXXXXXXXXXXXXXXRYEAIY

LNPCARSQM R+EA+Y

Sbjct: 111 LNPCARSQMLQQSSCHVLQQQCCQQLPQIPEQLRHEAVY

Query: 191 SII 193

SI+

Sbjct: 171 SIV 173

Low-complexity masked region

What sort of residues masked in

The query sequence??

May be more HSPs for same hit

If big deletion in either seq


General blast protocol

General blast protocol

  • Find server, choose DB, paste seq, GO

  • Rerun search with/out masking

  • Rerun search with two diff subs matrix

  • 2 x 3 for six searches

  • If top N hits all same family/domain then XXX this region and resubmit

  • LOOK at the results; esp strange ones

  • Limit results by organism


Blast notes

Blast notes

  • Twilight zone <25% protein <70% NA

  • Because two sequences give high blast scores to a third, doesn’t mean they are related

    • NNNNDOMAINANNNNDOMAINBNNN

    • NNNNDAMIANANNNNNNNNNNNNNN

    • NNNNNNNNNNNNNNNDIMIAMBNNN

  • E-value vs % ID. >10-4 unreliable

  • Bit score <50 unreliable


Psi blast

PSI-BLAST

  • Uses a PSSM rather than BloSum/PAM

  • Iterative …

    • Can find very distant relatives

    • …so deep insight

  • BUT can iterate off with the fairies


  • Login