School b i tcd bioinformatics
This presentation is the property of its rightful owner.
Sponsored Links
1 / 33

School B&I TCD Bioinformatics PowerPoint PPT Presentation


  • 77 Views
  • Uploaded on
  • Presentation posted in: General

School B&I TCD Bioinformatics. Database homology searching May 2010. Why search a Database. To find homologous sequences to your unknown to determine function To find other related sequences to do evolutionary studies (trees) or to make specialised database (nematode 16sRNA)

Download Presentation

School B&I TCD Bioinformatics

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


School B&I TCD Bioinformatics

Database homology searching

May 2010


Why search a Database

  • To find homologous sequences to your unknown to determine function

  • To find other related sequences to do evolutionary studies (trees) or to make specialised database (nematode 16sRNA)

  • To find the mouse or E.coli homolog of your gene of interest

  • To find genes in a newly sequenced genome

  • To predict 3-D structure (blast vs PDB)


BLAST

  • For many molecular biologists

  • a Blast search

    • with a DNA sequence

    • at NCBI

    • accepting default parameters

  • IS bioinformatics

    • 70% of searches at NCBI at blastN


BLAST

  • Basic local alignment search tool

  • More or less unchanged for 20 years

  • Extraordinarily quick:

    • Submit seq via satellite to Bethesda MD

    • Crunch thro 400 GB of sequence

    • Return results via satellite

    • ALL in 30 seconds

  • Available at other sites (EBI, ExPaSy etc.)

    • with better response times, fewer conditions


NCBI penalties

  • 1st request: current time

  • 2nd request: current time + 60 seconds

  • 3rd request: current time + 120 seconds

  • 4th request: current time + 180 seconds

  • 5th request: current time + 240 second

  • DON’T submit multiple simultaneous queries


Reliable servers

  • http://www.ch.embnet.org/software/bBLAST.html

  • http://www.ch.embnet.org/software/aBLAST.html

    • Basic and advanced SwissBlast

  • http://www.ebi.ac.uk/blast2/index.html

    • EBI (Cambridge, UK) NCBI Blast

  • http://www.ncbi.nlm.nih.gov/blast/

    • Not the only Blast server!


BLAST varieties

  • blastn: searches a DNA sequence against a DNA database like EMBL, Genbank, or dbEST. also megablast for exact matches

  • blastp: searches a protein sequence against a protein database such as UniProt, or "nr" a non-redundant database which ideally contains one copy of every available sequence.

    Then you have:

  • blastx: searches a DNA sequence (translated in all six reading frames) against a protein database.

  • tblastn: searches a protein sequence against a DNA database (translated in all six reading frames) – essential for searching EST databases.

    and in the interests of completeness there is:

  • tblastx: searches a DNA sequence (translated in all six reading frames) against a DNA database (translated in all six reading frames).

    finally

  • Psi-blast an iterative process using position specific subst matrix PSSM to find distantly related sequences


Algorithm

  • Five steps

    • Break the query seq into short “words” – typically 3 consecutive residues for protein

    • Search the database for (nearly) exact matches to these words

    • Extend match “hits” out on either side until score stops going up – these are HSPs (high scoring segment pairs)

    • Sort the HSPs by some “optimum” criterion

    • Significant hits are then formally scored, aligned and displayed


Alternatives to BLAST

  • FASTA

    • A little slower than Blast

    • More sensitive (so recommended) for DNA to DNA searches

  • Smith-Waterman (Blitz)

    • Much slower than Blast (20x slower)

    • Much more sensitive

  • But Blast is standard because it gives a “good enough” answer quickly.


Blast Blast at NCBI

Paste your query sequence here

Parallel search in Conserved domain DB


Input sequence parameters

Optional parameters to change include

  • Parameters that alter the hits found

    • Database searched

    • Word size

    • Substitution matrix

    • Gap penalties

    • Low complexity masking

  • Parameters that alter the results delivered

    • Expectation cut-off

    • Limit by organism or taxonomic group

    • Number of hits reported

    • Number of alignments shown


Database

  • If query is a coding gene: translate and search protein database

  • Search PDB if you want a 3-D structure

  • Search NR if you want any hit

  • Search UniProt to know what the hits are

  • Search dbEST to know if your sequence is expressed

  • UniProt90: no seq is more than 90% ident to any other (for an uncluttered tree) also UniProt50


Word size

  • Default at NCBI Blast

    • 11 for DNA; option of 7 and 5

    • 3 for protein; option of 2

  • Increasing wordsize will speed search

    … but will lose sensitivity

    • It will miss some useful but distant hits

  • Decreasing wordsize will be more sensitive … but take longer


Substitution matrix

  • By default BLAST uses BLOSUM62

  • Can change this

    • Blosum90 mirrors changes in closely related sequences

    • Blosum30 changes in distant sequences

  • Should run 3 blast searches with different substitution matrices.


Changes 30-62-90 not linear

Substitution matrices

Top left part of a BLOSUM 90 matrix

A R N D C Q E G H I L

A 5 -2 -2 -3 -1 -1 -1 0 -2 -2 -2

R -2 6 -1 -3 -5 1 -1 -3 0 -4 -3

N -2 -1 7 1 -4 0 -1 -1 0 -4 -4

D -3 -3 1 7 -5 -1 1 -2 -2 -5 -5

C -1 -5 -4 -5 9 -4 -6 -4 -5 -2 -2

Q -1 1 0 -1 -4 7 2 -3 1 -4 -3

E -1 -1 -1 1 -6 2 6 -3 -1 -4 -4

G 0 -3 -1 -2 -4 -3 -3 6 -3 -5 -5

H -2 0 0 -2 -5 1 -1 -3 8 -4 -4

I -2 -4 -4 -5 -2 -4 -4 -5 -4 5 1

L -2 -3 -4 -5 -2 -3 -4 -5 -4 1 5

Positive off-diagonals aresimilar


Reality of matrix change

  • Query: GHDEICI

  • GH + C

  • Sbjct: GHACNCG

    • Scores 39 with Blosum30; 5 with Blosum90

  • Query: HEQCRLEN

  • +E LEN

  • Sbjct: QENAHLEN

    • Scores 19 with Blosum30; 24 with Blosum90

  • So GHDEICIHEQCRLEN will find different hits


Matrix families

  • PAM – point accepted mutation

    • Margaret Dayhoff 1970s (before Genbank)

  • BLOSUM – based on aligned blocks

    • Henikoff and Henikoff 1992

    • Blosum 62 default (based on aligned seqs 62% identical – don’t ask why 62: it works)

  • Low PAM = High Blosum

    • PAM 250 == Blosum 30

    • PAM 30 == Blosum 90


Gap penalty

  • Original Blast was gap-free

    • Because gapped aligns much more CPU

  • Now can insert “affine” gaps

    • Gap open 10; gap extend 1

  • Raise gap penalty to discourage gaps

    • Preferentially gets closely related hits

  • Lower gap penalty for sensitive search for distant relatives


Low Complexity Masking

  • Seg masks low-complexity regions

    • “too many” serines, prolines etc.

  • What a masked sequence looks like:

    >P04729 Wheat gamma gliadin

    MKTFLVFALIAVVATSAIAQMETSCISGLERPWQQQPLPPQQSFSQQPPFSQQQQQPLPQ

    QPSFSQQQPPFSQQQPILSQQPPFSQQQQPVLPQQSPFSQQQQLVLPPQQQQQQLVQQQI

    PIVQPSVLQQLNPCKVFLQQQCSPVAMPQRLARSQMWQQSSCHVMQQQCCQQLQQIPEQS

    RYEAIRAIIYSIILQEQQQGFVQPQQQQPQQSGQGVSQSQQQSQQQLGQCSFQQPQQQLG

    QQPQQQQQQQVLQGTFLQPHQIAHLEAVTSIALRTLPTMCSVNVPLYSATTSVPFGVGTG

    VGAY*

    and after low complexity masking:

    >P04729 SEG low-complexity masked

    MKTFLVFALIAVVATSAIAQMETSCISGLERPWXXXXXXXXXXXXXXXXXXXXXXXXXXX

    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    XXXXXXXXXXLNPCKVFLQQQCSPVAMPQRLARSQMWXXXXXXXXXXXXXXXXXXXXXXX

    RYEAIRAIIYSIIXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    XXXXXXXXXXXXXXXXXXXHQIAHLEAVTSIALRTLPTMCSVNVPLYSATTSVPFGVGTG

    VGAY*

  • Xnu masks repeats (from database of common repeats)


Masking

  • Check if masking on/off by default (differs at sites)

  • Run 2 searches with masking on, masking off ANNYway

  • Masking by hand:

    • If you know about the DNA binding domain already

      • Which is really common and will occupy the top 100 hits against any database

    • Replace the region with Xs

    • Re-run blast to find other motifs/domains/information

    • NCBI blast allows select subsequences

  • DUST masks low info DNA sequences

    • Like polyA tails


Advanced


Expectation cut-off

  • E value

    • Expected number of chance hits

      • Given the database searched

      • Query HSP length

  • Related to probability

  • Default is 10: “expect to find 10 hits as good as this by chance alone”

  • E values less than 10-4 unreliable

  • So different from p < 0.05

  • For short seqs crank E up to 100 or 1000


Data deluge – hits delivered

  • Default suits general purpose

  • V = 50 num of “hits” one line descriptions

  • B = 50 num of alignments between query and hit which are displayed


Swiss Blast


EBI blast


Limit by taxon/organism

Also search at www.ensembl.org for “genomed” organisms


Database choice

NR for getting aNNy hit

swissprot or refseq for getting hits that have annotation

Month for recent hits


The output

  • Is in five parts

  • Admin – date, size of database, id of query

  • Graphic display of query and hits

  • List of hits with links to database and to…

  • …alignments (may be > 1 HSP per hit)

  • More admin, including errors/warnings


Notes!

  • Record the top admin stuff. Your search will be different

    • Next week

    • On a different server

    • With different DB

    • With different input parameters

  • Mouse-over any graphic display

    • Shows domain structure

    • Shows if hit is global or local

  • Read the bottom admin stuff


Blast output

sp|P06471|HOR3_HORVU B3-HORDEIN

Length = 264

Score = 62.5 bits (149), Expect = 1e-09

Identities = 32/63 (50%), Positives = 38/63 (59%)

Query: 131 LNPCARSQMWXXXXXXXXXXXXXXXXXXXXXXXRYEAIY

LNPCARSQM R+EA+Y

Sbjct: 111 LNPCARSQMLQQSSCHVLQQQCCQQLPQIPEQLRHEAVY

Query: 191 SII 193

SI+

Sbjct: 171 SIV 173

Low-complexity masked region

What sort of residues masked in

The query sequence??

May be more HSPs for same hit

If big deletion in either seq


General blast protocol

  • Find server, choose DB, paste seq, GO

  • Rerun search with/out masking

  • Rerun search with two diff subs matrix

  • 2 x 3 for six searches

  • If top N hits all same family/domain then XXX this region and resubmit

  • LOOK at the results; esp strange ones

  • Limit results by organism


Blast notes

  • Twilight zone <25% protein <70% NA

  • Because two sequences give high blast scores to a third, doesn’t mean they are related

    • NNNNDOMAINANNNNDOMAINBNNN

    • NNNNDAMIANANNNNNNNNNNNNNN

    • NNNNNNNNNNNNNNNDIMIAMBNNN

  • E-value vs % ID. >10-4 unreliable

  • Bit score <50 unreliable


PSI-BLAST

  • Uses a PSSM rather than BloSum/PAM

  • Iterative …

    • Can find very distant relatives

    • …so deep insight

  • BUT can iterate off with the fairies


  • Login