databases indexation n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Databases indexation PowerPoint Presentation
Download Presentation
Databases indexation

Loading in 2 Seconds...

play fullscreen
1 / 25

Databases indexation - PowerPoint PPT Presentation


  • 87 Views
  • Uploaded on

Databases indexation. Laurent Falquet, EPFL March, 2005. Swiss Institute of Bioinformatics Swiss EMBnet node. Data access concept sequential direct Indexing EMBOSS Fetch Other. BLAST Why indexing? formatdb Parsing output Excel import/export Tab delimited Coma delimited. Overview.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Databases indexation' - raghnall


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
databases indexation

Databasesindexation

Laurent Falquet, EPFL March, 2005

Swiss Institute of Bioinformatics

Swiss EMBnet node

overview
Data access concept

sequential

direct

Indexing

EMBOSS

Fetch

Other

BLAST

Why indexing?

formatdb

Parsing output

Excel import/export

Tab delimited

Coma delimited

Overview
data access sequential vs direct
Sequential access

Direct access

track

sector

head

Data access: sequential vs direct

Vary from very short to very long

Very small variations

similar concept for databases
Flat files = sequential

Indexing = simulated direct

Similar concept for databases

>seq1

cgatgtcatgtg

>seq2

cgatcgtagctgtagctgtag

>seq3

catgtgcatgcgacgt

tools
EMBOSS

dbiflat

dbifasta

dbiblast

seqret

seqretsplit

entret

Other examples

SRS (icarus language)

http://srs.ebi.ac.uk

http://www.lionbioscience.com/

indexer & fetch (warning local SIB tool)

Relational (MySQL, Oracle…)

Tools
emboss how to index
Where is your file?

What is the format?

Where should be the indices?

Where is the emboss.default file? (.embossrc)

Other EMBOSS tools

textsearch

whichdb

EMBOSS how to index?
emboss example
Input file and directory

~/embossidx/ECOLI.dat

cd embossidx

Index creation

dbiflat -idformat swiss -dbname ECOLI.dat -directory . -release 1.0 -date 12/02/05 -fields AC

Generates 4 files

acnum.hit

acnum.trg

division.lkp

entrynam.idx

Don’t forget to modify ~/.embossrc

EMBOSS example
embossrc
Example of queries

seqret ecoli:thio_ecoli

seqret ecoli:P00274

entret ecoli:thio_ecoli

and even

seqret ‘ecoli:*_ECOLI’

set emboss_filter 1

# Ecoli

DB ecoli [

type: P

comment: "E.coli proteome"

method: emblcd

format: swiss

dir:  "~/embossidx"

file: "ECOLI.dat"

release: "1.0"

indexdir:  "~/embossidx"

]

.embossrc
indexer fetch
Warning this is a local SIB tool!!

Input file and directory

~/embossidx/ECOLI.dat

cd embossidx

Index creation

indexer -h '^ID' -t '^//' -i -p '^ID\s+(\S+)' ECOLI.dat ecoli.idx

Generates 1 file

ecoli.idx

Don’t forget to modify config file

Indexer & fetch
config file fetch conf
Example of queries

fetch -c fetch.conf ecoli:thio_ecoli

fetch -c fetch.conf -f ‘ecoli:thio_ecoli[20..50]’

fetch.conf

#dbkey format indexfile datafile

ecoli sp ~/embossidx/ecoli.idx ~/embossidx/ECOLI.dat

Config file: fetch.conf
blast
Maintained at NCBI

Source distributed freely with several accessory tools

ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/ncbi.tar.gz

Requires compilation to install on your local computer

blastall contains

blastp

blastn

blastx

tblastn

tblastx

Other tools

blastpgp

megablast

formatdb

BLAST
available blast programs

Program

Query

Database

protein

protein

VS

blastp

nucleotide

blastn

nucleotide

VS

blastx

nucleotide

protein

VS

protein

tblastn

nucleotide

protein

VS

protein

nucleotide

nucleotide

tblastx

protein

VS

protein

Available Blast programs
what makes blast so fast
Indexing all words of 3 aa or 11 bp in the sequence database

Searching the query for all words of a score > T

Search the indexed database for all perfect matches

Try to align matches that are on the same diagonal

What makes BLAST so fast?
indexing for blast 1

A substitution matrix is used to compute the word scores

Query

REL

RSL

score > T

LKP

AAA

AAC

AAD

YYY

score < T

ACT

RSL

TVF

...

...

...

List of all possible words with

3 amino acid residues (8000)

List of words matching the

query with a score > T

LKP

Indexing for Blast (1)
indexing for blast 2

Database sequences

ACT

ACT

RSL

Search for

exact matches

ACT

RSL

TVF

RSL

...

RSL

TVF

...

RSL

TVF

List of words matching the

query with a score > T

  • List of sequences containing words similar to the query (hits)
Indexing for Blast (2)
indexing for blast 3

Database sequence

Query

A

Extension using dynamic programming

limited to a restricted region

limited through a score drop-off threshold

Indexing for Blast (3)

Database sequence

Query

A

Ungapped extension if:

2 "Hits" are on the same diagonal but at a distance less than A

blast indexing with formatdb
Formatdb

mydb.seq must contain sequences in FASTA format

formatdb -i mydb.seq -p T -n mydb

Generates 3 files

mydb.psq

mydb.pin

mydb.phr

Then start a Blast:

blastall -p blastp -d mydb -i myseq (-optional parameters)

BLAST indexing with formatdb
blast local vs remote
blastall

Executed locally

Slow

No need to transfert db

blastall.remote

Executed remotely

Fast

Requires special priviledges and db transfert

Blast local vs remote
multiple blasts
1 seq vs db seq

1 FASTA seq as input

db seq vs db seq

Several single FASTA seq files as input or

1 Multiple FASTA seq file as input

Possibility to export results as XML

Use Perl to automatize the queries and parse the output

Multiple Blasts?
parsing blast output
Parsing Blast output

BLASTP 2.2.10 [Oct-19-2004]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,

Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),

"Gapped BLAST and PSI-BLAST: a new generation of protein database search

programs", Nucleic Acids Res. 25:3389-3402.

Query= ACCA_BACSU O34847 Acetyl-coenzyme A carboxylase carboxyl

transferase subunit alpha (EC 6.4.1.2).

(325 letters)

Database: ecoli_blast

4339 sequences; 1,373,039 total letters

Searching.........done

Score E

Sequences producing significant alignments: (bits) Value

ACCA_ECOLI P30867 Acetyl-coenzyme A carboxylase carboxyl transfe... 266 1e-72

parsing blast output 2
Parsing Blast output (2)

>ACCA_ECOLI P30867 Acetyl-coenzyme A carboxylase carboxyl

transferase subunit alpha (EC 6.4.1.2).

Length = 318

Score = 266 bits (681), Expect = 1e-72

Identities = 143/312 (45%), Positives = 188/312 (60%), Gaps = 3/312 (0%)

Query: 5 LEFEKPVIELQTKIAELKKFTQDS---DMDLSAEIERLEDRLAKLQDDIYKNLKPWDRVQ 61

L+FE+P+ EL+ KI L ++ D+++ E+ RL ++ +L I+ +L W Q

Sbjct: 5 LDFEQPIAELEAKIDSLTAVSRQDEKLDINIDEEVHRLREKSVELTRKIFADLGAWQIAQ 64

Query: 62 IARLADRPTTLDYIEHLFTDFFECHGDRAYGDDEAIVGGIAKFHGLPVTVIGHQRGKDTK 121

+AR RP TLDY+ F +F E GDRAY DD+AIVGGIA+ G PV +IGHQ+G++TK

Sbjct: 65 LARHPQRPYTLDYVRLAFDEFDELAGDRAYADDKAIVGGIARLDGRPVMIIGHQKGRETK 124

Query: 122 ENLVRNFGMPHPEGYRKALRLMKQADKFNRPIICFIDTKGAYPGRAAEERGQSEAIAKNL 181

E + RNFGMP PEGYRKALRLM+ A++F PII FIDT GAYPG AEERGQSEAIA+NL

Sbjct: 125 EKIRRNFGMPAPEGYRKALRLMQMAERFKMPIITFIDTPGAYPGVGAEERGQSEAIARNL 184

Query: 182 FEMAGLRVPXXXXXXXXXXXXXXXXXXXXXXXHMLENSTYSVISPEGAAALLWKDSSLAK 241

EM+ L VP +ML+ STYSVISPEG A++LWK + A

Sbjct: 185 REMSRLGVPVVCTVIGEGGSGGALAIGVGDKVNMLQYSTYSVISPEGCASILWKSADKAP 244

Query: 242 KAAETMKITAPDLKELGIIDHMIKEVKGGAHHDVKLQASYMDXXXXXXXXXXXXXXXXXX 301

AAE M I AP LKEL +ID +I E GGAH + + A+ +

Sbjct: 245 LAAEAMGIIAPRLKELKLIDSIIPEPLGGAHRNPEAMAASLKAQLLADLADLDVLSTEDL 304

Query: 302 VQQRYEKYKAIG 313

+RY++ + G

Sbjct: 305 KNRRYQRLMSYG 316

parsing blast output 3
With BioPerl:

#!/usr/local/bin/perl

use Bio::SearchIO;

my $blast_report = new Bio::SearchIO ('-format' => 'blast',

'-file' => $ARGV[0]);

print "Query name:\tQuery description:\tHit name:\tHit description:\tE-value\tScore\n";

while( my $result = $blast_report->next_result) {

print $result->query_name(), "\t", $result->query_description(), "\n";

while( my $hit = $result->next_hit()) {   

print "\t\t", $hit->name(), "\t", $hit->description();   

while( my $hsp = $hit->next_hsp()) { 

print "\t", $hsp->evalue(), "\t", $hsp->score();   

}

print "\n";

}

}

exit 0;

Parsing Blast output (3)
ms excel import export
Excel can import

Tab delimited

Coma delimited

Excel can export

Tab delimited

Space delimited

MS-Excel import/export

AC/ID desc score e-value

THIO_ECOLI thioredoxin Escherichia coli 234 2.1e-5

THIO_HUMAN thioredoxin Homo sapiens 120 0.001

ms excel import export1
Tab delimited file:

\t delimits the columns

\n delimits the lines

Optional first line contains columns title

Example:

AC/ID\tdesc\tscore\te-value\n

THIO_ECOLI\tthioredoxinEscherichia coli\t234\t2.1e-5\n

THIO_HUMAN\tthioredoxinHomo sapiens\t120\t0.001\n

MS-Excel import/export
ms excel import export2
Coma delimited file:

, delimits the columns, each value is surrounded by ‘ ’

\n delimits the lines

Optional first line contains columns title

Example:

‘AC/ID’,’desc’,’score’,’e-value’\n

’THIO_ECOLI’,’thioredoxinEscherichia coli’,’234’,’2.1e-5’\n

’THIO_HUMAN’,’thioredoxinHomo sapiens’,’120’,’0.001’\n

MS-Excel import/export