Biomolecular databases

Bioinformatics Biomolecular databases Jacques van Helden Jacques.van-Helden@univ-amu.fr Université d’Aix-Marseille, France Lab. Technological Advances for Genomics and Clinics (TAGC, INSERM Unit U1090) http://tagc.univ-mrs.fr/ FORMER ADDRESS (1999-2011) Université Libre de Bruxelles, Belgique Bioinformatique des Génomes et des Réseaux (BiGRe lab) http://www.bigre.ulb.ac.be/

Contents • Examples of biological databases • Nucleic sequences: Genbank, EMBL, and DDBJ • Protein sequences: UniProt • The Gene Ontology (GO) project • Issues and perspectives for biological databases

Biomolecular Databases Examples of biomolecular databases

Examples of biomolecular databases • Sequence and structure databases • Protein sequences (UniProt) • DNA sequences (EMBL, Genbank, DDBJ) • 3D structures (PDB) • Structural motifs (CATH) • Sequence motifs (PROSITE, PRODOM) • Genome sequences and annotations • Genome-specific databases (SGD, FlyBase, AceDB, PlasmoDB, …) • Multiple genomes (Integr8, NCBI, KEGG, TIGR, …) • Molecular functions • Transcriptional regulation (TRANSFAC, RegulonDB, InteractDB) • Enzymatic catalysis (Expasy, LIGAND/KEGG, BRENDA) • Transport (YTPdb) • Biological processes • Metabolic pathways (EcoCyc, LIGAND/KEGG, Biocatalysis/biodegradation) • Signal transduction pathways (CSNdb, Transpath) • Protein-protein interactions (DIP, BIND, MINT) • Gene networks (GeneNet, FlyNets)

Databases of databases • There are hundreds of databases related to molecular biology and biochemistry. New databases are created every year. • Every year, the first issue of Nucleic Acids Research is dedicated to biological databases • http://nar.oupjournals.org/ • 2011 Issue: http://nar.oxfordjournals.org/content/39/suppl_1 • The same journal maintains a database of databases: the Molecular Biology Database Collection • http://www.oxfordjournals.org/nar/database/c/ • Some bioinformatics centres maintain multiple database, with cross-links between them. The SRS server at EBI holds an impressive collection of databases. • http://srs.ebi.ac.uk/

Biomolecular Databases Nucleic sequence databases: GenBank, EMBL, and DDBJ

Nucleic sequence databases Okubo et al. (2006) NAR 34: D6-D9 • To publish an article dealing with a sequence, scientific journals impose to have previously deposited this sequence in a reference database. • There are 3 main repositories for nucleic acid sequences. • Sequences deposited in any of these 3 databases are automatically synchronized in the 2 other ones.

The sequencing pace • Nucleic sequences • Genbank (April 2011) http://www.ncbi.nlm.nih.gov/genbank/ • 126,551,501,141 bases in 135,440,924 sequence records in the traditional GenBank divisions • 191,401,393,188 bases in 62,715,288 sequence records in the Whole Genome Ssequencing • Entire genomes • GOLD Release V.2 (Oct 2011) contains ~2000 completely sequenced genomes. • http://www.genomesonline.org/gold_statistics.htm • Protein sequences • Essentially obtained by translation of putative genes in nucleic sequences (almost no direct protein sequencing). • UniProtKB/TrEMBL (2011) contains 17 millions of protein sequences. • http://www.ebi.ac.uk/swissprot/sptr_stats/index.html Adapted from Didier Gonze

Size of the nucleotide database EMBL Nucleotide Sequence Database: Release Notes - Release 113 September 2012 http://www.ebi.ac.uk/embl/Documentation/Release_notes/current/relnotes.html Class entries nucleotides ------------------------------------------------------------------ CON:Constructed 7,236,371 359,112,791,043 EST:Expressed Sequence Tag 73,715,376 40,997,082,803 GSS:Genome Sequence Scan 34,528,104 21,985,922,905 HTC:High Throughput CDNA sequencing 491,770 594,229,662 HTG:High Throughput Genome sequencing 152,599 25,159,746,658 PAT:Patents 24,364,832 12,117,896,594 STD:Standard 13,920,617 37,665,112,606 STS:Sequence Tagged Site 1,322,570 636,037,867 TSA:Transcriptome Shotgun Assembly 8,085,693 5,663,938,279 WGS:Whole Genome Shotgun 88,288,431 305,661,696,545 ----------- --------------- Total 252,106,363 450,481,663,919 Division entries nucleotides ------------------------------------------------------------------ ENV:Environmental Samples 30,908,230 14,420,391,278 FUN:Fungi 6,522,586 11,614,472,226 HUM:Human 32,094,500 38,072,362,804 INV:Invertebrates 31,907,138 52,527,673,643 MAM:Other Mammals 40,012,731 145,678,620,711 MUS:Musmusculus 11,745,671 19,701,637,499 PHG:Bacteriophage 8,511 85,549,111 PLN:Plants 52,428,994 55,570,452,118 PRO:Prokaryotes 2,808,489 28,807,572,238 ROD:Rodents 6,554,012 33,326,106,733 SYN:Synthetic 4,045,013 782,174,055 TGN:Transgenic 285,307 849,743,891 UNC:Unclassified 8,617,225 4,957,442,673 VRL:Viruses 1,358,528 1,518,575,082 VRT:Other Vertebrates 22,809,428 42,568,889,857 ----------- --------------- Total 252,106,363 450,481,663,919

Genbank (NCBI - USA) http://www.ncbi.nlm.nih.gov/Genbank/

The EMBL Nucleotide Sequence Database (EBI - UK)http://www.ebi.ac.uk/embl/

DDBJ - DNA Data Bank of Japanhttp://www.ddbj.nig.ac.jp/

Size of the nucleic sequence databases • Summary of database contents for the 3 main databases of nucleic sequences. • Source: NAR database issue January 2006.

Biomolecular Databases UniProt : protein sequencesand functional annotations

UniProt - the Universal Protein Resourcehttp://www.uniprot.org/ Number of entries (polypeptides) in Swiss-Prot http://www.expasy.org/sprot/relnotes/relstat.html Taxonomic distribution of the sequences Within Eukaryotes • Database content (Sept 2012) • UniProtKB: • 24,532,088 entries • Translation of EMBL coding sequences (non-redundant with Swiss-Prot) • UniProtKB/Swiss-Prot section (reviewed): • 537,505 entries • annotation by experts • high information content • many references to the literature • good reliability of the information • The rest (90% of the entries) • Automatic annotation by sequence similarity. • Features • The most comprehensive protein database in the world. • A huge team: >100 annotators + developers. • Annotation by experts: annotators are specialized for different types of proteins or organisms. • World-wide recognized as an essential resource. • References • Bairoch et al. The SWISS-PROT protein sequence data bank. Nucleic Acids Res (1991) vol. 19 Suppl pp. 2247-9 • The UniProt Consortium. The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res (2008). Database Issue.

UniProt example - Human Pax-6 proteinHeader : name and synonyms

UniProt example - Human Pax-6 proteinHuman-based annotation by specialists

UniProt example - Human Pax-6 proteinStructured annotation : keywords and Gene Ontology terms

UniProt example - Human Pax-6 proteinProtein interactions; Alternative products

UniProt example - Human Pax-6 proteinDetailed description of regions, variations, and secondary structure

UniProt example - Human Pax-6 proteinPeptidic sequence

UniProt example - Human Pax-6 proteinReferences to original publications

UniProt example - Human Pax-6 proteinCross-references to many databases (fragment shown)

3D Structure of macromolecules

PDB - The Protein Data Bankhttp://www.rcsb.org/pdb/

Genome browsers

EnsEMBL Genome Browser (Sanger Institute + EBI) http://www.ensembl.org/

UCSC Genome Browser (University California Santa Cruz - USA)http://genome.ucsc.edu/ Human gene Pax6 aligned with Vertebrate genomes

UCSC Genome Browser (University California Santa Cruz - USA)http://genome.ucsc.edu/ Drosophila gene eyeless (homolog to Pax6) aligned with Insect genomes

UCSC Genome Browser (University California Santa Cruz - USA)http://genome.ucsc.edu/ Drosophila 120kb chromosomal region covering the Achaete-Scute Complex

ECR Browserhttp://ecrbrowser.dcode.org/

EnsEMBL - Example: Drosophila gene Pax6http://www.ensembl.org/

Comparative genomics

Integr8 - access to complete genomes and proteomeshttp://www.ebi.ac.uk/integr8/

Integr8 - genome summarieshttp://www.ebi.ac.uk/integr8/

Integr8 - clusters of orthologous genes (COGs)http://www.ebi.ac.uk/integr8/

Integr8 - clusters of paralogous geneshttp://www.ebi.ac.uk/integr8/

Databases of protein domains

Prosite - protein domains, families and functional siteshttp://www.expasy.ch/prosite/

Prosite - aligned sequences and logohttp://www.expasy.ch/prosite/ • Some of the sequences that were used to built the Prosite profile for the Zn(2)-C6 fungal-type DNA-binding domain (ZN2_CY6_FUNGAL_2, PS50048). • The Sequence Logo (below) indicates the level of conservation of each residue in each column of the alignment. • Note the 6 cysteines, characteristic of this domain.

Prosite - Example of profile matrixhttp://www.expasy.ch/prosite/

Prosite - Example of sequence logohttp://www.expasy.ch/prosite/

Prosite - Example of domain signaturehttp://www.expasy.ch/prosite/ • The domain signature is a string-based pattern representing the residues that are characteristic of a domain.

PFAM (Sanger Institute - UK) http://pfam.sanger.ac.uk/Protein families represented by multiple sequence alignments and hidden Markov models (HMMs)

CATH - Protein Structure Classificationhttp://www.cathdb.info/ • CATH is a hierarchical classification of protein domain structures, which clusters proteins at four major levels: • Class (C), • Architecture (A), • Topology (T) • Homologous superfamily (H). • The boundaries and assignments for each protein domain are determined using a combination of automated and manual procedures which include computational techniques, empirical and statistical evidence, literature review and expert analysis. • References • Orengo et al. The CATH Database provides insights into protein structure/function relationships. Nucleic Acids Res (1999) vol. 27 (1) pp. 275-9 • Cuff et al. The CATH classification revisited--architectures reviewed and new ways to characterize structural divergence in superfamilies. Nucleic Acids Res (2008) pp.

CATH - Protein Structure Classificationhttp://www.cathdb.info/

InterPro (EBI - UK)http://www.ebi.ac.uk/interpro/ • “A database of protein families, domains, repeats and sites in which identifiable features found in known proteins can be applied to new protein sequences.”

InterPro (EBI - UK)Antennapedia-like Homeobox (entry IPR001827)

Biomolecular Databases The Gene Ontology (GO) database

Ontology definition • Ontologie: partie de la métaphysique qui s'intéresse à l'être en tant qu'être, indépendamment de ses déterminations particulières • Ontology: part of the metaphysics that focusses on the being as a beging, independently of its particular determinationsLe Petit Robert - dictionnaire alphabétique et analogique de la langue française. 1993

Biomolecular databases

Biomolecular databases

Presentation Transcript

Biomolecular Engineering Concentration

Biomolecular Nuclear Magnetic Resonance Spectroscopy

Biomolecular Nuclear Magnetic Resonance Spectroscopy

Biomolecular Nuclear Magnetic Resonance Spectroscopy

Biomolecular Networks Initiative

Biomolecular Nuclear Magnetic Resonance Spectroscopy

Biomolecular Modeling (practical)

Biomolecular Nuclear Magnetic Resonance Spectroscopy

Biomolecular Modelling and Simulation

Biomolecular and Cellular Engineering

Analysis of Biomolecular Interactions

Biomolecular processes as concurrent computation

Environmental Biomolecular Sciences

Biomolecular Interaction: Enzyme + Substrate

BIOMOLECULAR MATERIALS

Biomolecular NMR Spectroscopy

Biomolecular Interaction: Enzyme + Substrate

Biomolecular Machines

Molecular Biophysics Biomolecular Physics

Biomolecular and Cellular Research Devices