1.74k likes | 2.25k Views
Genome, Protein and Model Organism Databases. Anne Estreicher Swiss-Prot Group Swiss Institute of Bioinformatics Geneva – Switzerland Anne.Estreicher@isb-sib.ch. Bioinformatic and Comparative Genome Analysis Course HKU-Pasteur Research Centre - Hong Kong, China August 17 - August 29, 2009.
E N D
Genome, ProteinandModel Organism Databases Anne Estreicher Swiss-Prot Group Swiss Institute of Bioinformatics Geneva – Switzerland Anne.Estreicher@isb-sib.ch Bioinformatic and Comparative Genome Analysis Course HKU-Pasteur Research Centre - Hong Kong, China August 17 - August 29, 2009
Outline • Introduction (definitions, history…) • From DNA sequence to genomic tools • The flow of information: from DNA to proteins • Protein sequence databases • MODs at a glance
A collection of related data, which are structured searchable updated periodically cross-referenced Includes also associated tools necessary for access/query, download, etc. What is a database ?
Why do we need databases ? • Data need to be stored, curated and made available for analysis and knowledge discovery • Efficient way of sharing data, independently of regular publications • Essential resources for both experimental and computational biologists
Databases in biology : not a new issue … • 1954First protein sequence (insulin by F. Sanger) • 1965Atlas of Protein Sequence and Structure (65 proteins)
The first protein sequence "database" by Margaret Dayhoff (1965) contained 65 proteins
Databases: not a new issue… • 1954First protein sequence (insulin by F. Sanger) • 1965Atlas of Protein Sequence and Structure (65 proteins) • Mid 70s Improvements in DNA sequencing • 1979 Los Alamos Sequence Library (Walter Goad) • 1980~ 80 genes fully sequenced -> Need to store the data and to make them available for analysis (in format acceptable for human eyes and machines) -> ARCHIVE -> RACE for the central position in life sciences… And the winner is…
Databases: not a new issue… EMBL-Bank - Europe 1980 GenBank - USA 1982 DDBJ - Asia 1986 leading to the establishment of the INSDC(International Nucleotide Sequence Database Collaboration) -> daily exchanges of data
EMBL-BANK - GenBank - DDBJ • Main resources for DNA and RNA sequences; • Used to be retrieved from publications -> direct submissions from individual researchers, genome sequencing projects and patent applications: • “Journal publishers generally require sequence deposition prior to publication so that an accession number can be included in the paper.” • 1. True for nucleic acid, not for protein sequences; • 2. Not always put into practice • => Not submitted sequences are LOST!!! • Archives (primary databases) • data belong to submitters
EMBL-BANK - GenBank - DDBJ Archive (primary databases) => data belong to the submitter • Minimal checks, such as vector contamination • Annotation by the submitters
Databases: not a new issue… • 1954First protein sequence (insulin by F. Sanger) • 1965Atlas of Protein Sequence and Structure (65 proteins) • 1979 Los Alamos Sequence Library (Walter Goad) – DNA • 1982EMBL-Bank - DNA • 1984 GenBank – DNA • 1986 DDBJ - DNA
Databases: not a new issue… • 1954First protein sequence (insulin by F. Sanger) • 1965Atlas of Protein Sequence and Structure (65 proteins) • 1979 Los Alamos Sequence Library (Walter Goad) – DNA • 1982EMBL-Bank - DNA • 1984 GenBank – DNA • 1986 DDBJ - DNA -> ARCHIVES (primary databases) may not be sufficient -> need to annotate the data to produce KNOWLEDGE • 1986 Swiss-Prot – protein sequences – a paradigm for annotated (secondary) databases
The Swiss-Prot concept • non-redundant: Protein products of 1 gene / 1 species -> 1 entry, • Manually annotated (=> curator judgement on data!), • Highly cross-referenced (1st life-science database to provide cross-references) (links to > 130 databases from www.uniprot.org).
Databases: not a new issue… • 1954First protein sequence (insulin by F. Sanger) • 1965Atlas of Protein Sequence and Structure (65 proteins) • 1979 Los Alamos Sequence Library (Walter Goad) – DNA • 1982EMBL-Bank - DNA • 1984 GenBank – DNA Protein information resource (PIR) – Protein sequences • 1986 DDBJ – DNA Swiss-Prot – protein sequences • 1996TrEMBL (Translated EMBL) – Protein sequences Complement of Swiss-Prot to cope with the increasing amount of new sequences; AUTOMATIC ANNOTATION !
UniProtKB/Swiss-Prot growth Swiss-Prot rel. 57.5 (07-Jul-2009): 470’369entries 1996: creation of TrEMBL Swiss-Prot: 52’205 entries TrEMBL: 61’137 entries Number of entries Release number 1986 3’939 entries
UniProtKB growth TrEMBL rel.40.5 (07-Jul-2009): 8’594’382entries Swiss-Prot rel.57.5 (07-Jul-2009): 470’369 entries • TrEMBL growth(sequences/day) • 2004 1’500 • 2006-2007 3’500 • >5’000 • ~8’000 Number of entries TrEMBL Automated curation Swiss-Prot Manual curation Release number 1986 1996 2009
New challenge • Flood of data -> need to be stored, curated and made available for analysis and knowledge discovery
Life sciences used to be rich in hypotheses, well-off in knowledge and poor in data; Today they are very rich in data, not so well-off in knowledge and very poor in hypotheses. ? List of parts Complex system (R)evolution of these last 20 years
Danger ! EMBL Database Growth http://www.ebi.ac.uk/embl/Services/DBStats/
http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html http://www.ncbi.nlm.nih.gov/genomes/GenomesHome.cgi?taxid=10239&hopt=stat In 4 months, 374 new genomes and 77 were completed ~ 100 genomes/month (in 2008 -> ~50 genomes/month) + ~2’360 viral (& viroid) genomes => Total ~ 5’600 genomes
Metagenomics:study of genetic material recovered directly from environmental samples Global Ocean Sampling (C. Venter) Whale fall Soil, sand beach, New-York air, … Human fluids, mouse gut … Venter’s Sorcerer II
Flood in the world of proteins… • 1965: first protein sequence "database" by Margaret Dayhoff (65 proteins) • July 2009: ~ 20 millions unique protein sequence (source UniParc - http://www.uniprot.org/uniparc/) UniParc: non-redundant database that contains most of the publicly available protein sequences in the world (includes sequences from EMBL-Bank/DDBJ/GenBank nucleotide sequence databases, Ensembl, FlyBase, H-Invitational Database (H-Inv), International Protein Index (IPI), Patent Offices (EPO, JPO and USPTO), PIR-PSD, Protein Data Bank (PDB), Protein Research Foundation (PRF), RefSeq, Saccharomyces Genome database (SGD), TAIR Arabidopsis thaliana Information Resource, TROME, UniProtKB/Swiss-Prot and TrEMBL, Vertebrate Genome Annotation database (VEGA) and WormBase).
New challenge • Flood of data • Flood of databases…
NAR 1st issue of the year is always dedicated to databases + "clean" list of databases provided (! not exhaustive !)
The NAR Online Molecular Biology Database collection in 2009 A total of 1’170 databases (19 obsolete removed) http://www.oxfordjournals.org/nar/database/a/
NAR "clean" list of databases http://www.oxfordjournals.org/nar/database/a/
Most recent NAR paper about the database (not available for all db, some described in other journals)
A "clean" list of can be found in the NAR online molecular biology database collection http://www.oxfordjournals.org/nar/database/a/
BIOLOGICAL DATABASE CATEGORIES • Databases of nucleic acid sequences (RNA, DNA) • Databases of protein sequences • Databases of protein motifs and protein domains • Databases of structures • Databases of genomes • Databases of genes • Databases of expression profiles • Databases of SNPs and mutations • Databases of metabolic pathways • Databases of protein interactions • Databases of taxonomy • … Databases containing sequences or data directly derived from sequences.
DNA sequences : What ? Where ? How ? & genomic tools NCBI UCSC
Stable accession number (should always be cited in publications) Possible molecule types: genomic DNA and RNA mRNA other DNA and RNA rRNA transcribed RNA tRNA unassigned DNA and RNA viral cRNA Accession number Molecule type Date of submission Definition GenBank entry AF415175 http://www.ncbi.nlm.nih.gov/nuccore/16589063 Nucleotide sequence
Accession number Molecule type Date of submission Definition Taxonomy Nucleotide sequence
Accession number Molecule type Date of submission Definition Taxonomy References Nucleotide sequence
Accession number Molecule type Date of submission Definition Taxonomy References Organism Molecule type Chromosomal location Tissue type Gene name CDS annotation => protein sequence + Protein IDentifier (PID: stable identifier & version number) Features: Information provided by the submitter May include annotation of the sequence Nucleotide sequence
Gives access to the nucleic acid sequence of the CDS (not of the entire mRNA) Protein sequence
"Features" may provide much more information depending upon the sequence and the submitter… 3’end of chromosome Y EMBL #AJ271736
Very similar view, links and options from the 3 sites: EMBL-Bank – GenBank - DDBJ http://www.ebi.ac.uk/embl/ http://www.ncbi.nlm.nih.gov/ http://www.ddbj.nig.ac.jp/
Databases @ NCBI http://www.ncbi.nlm.nih.gov/Database/datamodel/index.html The Entrez system: integrated, text-based search and retrieval system used at NCBI for the major databases, including PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, and others => Maximalinterconnectivity
Databases @ NCBI http://www.ncbi.nlm.nih.gov/Database/datamodel/index.html
Simple search with a EMBL-Bank/GenBank/DDBJ accession number