bioinformatics n.
Skip this Video
Loading SlideShow in 5 Seconds..
Biomolecular databases PowerPoint Presentation
Download Presentation
Biomolecular databases

Loading in 2 Seconds...

play fullscreen
1 / 68

Biomolecular databases - PowerPoint PPT Presentation

  • Uploaded on

Bioinformatics. Biomolecular databases. Contents. Examples of biological databases Nucleic sequences: Genbank, EMBL, and DDBJ Protein sequences: UniProt The Gene Ontology (GO) project Issues and perspectives for biological databases. Biomolecular Databases.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Biomolecular databases' - mircea

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Biomolecular databases

Jacques van Helden

Université d’Aix-Marseille, France

Lab. Technological Advances for Genomics and Clinics (TAGC, INSERM Unit U1090)

FORMER ADDRESS (1999-2011)

Université Libre de Bruxelles, Belgique

Bioinformatique des Génomes et des Réseaux (BiGRe lab)

  • Examples of biological databases
    • Nucleic sequences: Genbank, EMBL, and DDBJ
    • Protein sequences: UniProt
    • The Gene Ontology (GO) project
  • Issues and perspectives for biological databases
examples of biomolecular databases
Examples of biomolecular databases
  • Sequence and structure databases
    • Protein sequences (UniProt)
    • DNA sequences (EMBL, Genbank, DDBJ)
    • 3D structures (PDB)
    • Structural motifs (CATH)
    • Sequence motifs (PROSITE, PRODOM)
  • Genome sequences and annotations
    • Genome-specific databases (SGD, FlyBase, AceDB, PlasmoDB, …)
    • Multiple genomes (Integr8, NCBI, KEGG, TIGR, …)
  • Molecular functions
    • Transcriptional regulation (TRANSFAC, RegulonDB, InteractDB)
    • Enzymatic catalysis (Expasy, LIGAND/KEGG, BRENDA)
    • Transport (YTPdb)
  • Biological processes
    • Metabolic pathways (EcoCyc, LIGAND/KEGG, Biocatalysis/biodegradation)
    • Signal transduction pathways (CSNdb, Transpath)
    • Protein-protein interactions (DIP, BIND, MINT)
    • Gene networks (GeneNet, FlyNets)
databases of databases
Databases of databases
  • There are hundreds of databases related to molecular biology and biochemistry. New databases are created every year.
  • Every year, the first issue of Nucleic Acids Research is dedicated to biological databases
    • 2011 Issue:
  • The same journal maintains a database of databases: the Molecular Biology Database Collection
  • Some bioinformatics centres maintain multiple database, with cross-links between them. The SRS server at EBI holds an impressive collection of databases.
nucleic sequence databases
Nucleic sequence databases

Okubo et al. (2006) NAR 34: D6-D9

  • To publish an article dealing with a sequence, scientific journals impose to have previously deposited this sequence in a reference database.
  • There are 3 main repositories for nucleic acid sequences.
  • Sequences deposited in any of these 3 databases are automatically synchronized in the 2 other ones.
the sequencing pace
The sequencing pace
  • Nucleic sequences
    • Genbank (April 2011)
      • 126,551,501,141 bases in 135,440,924 sequence records in the traditional GenBank divisions
      • 191,401,393,188 bases in 62,715,288 sequence records in the Whole Genome Ssequencing
  • Entire genomes
    • GOLD Release V.2 (Oct 2011) contains ~2000 completely sequenced genomes.
  • Protein sequences
    • Essentially obtained by translation of putative genes in nucleic sequences (almost no direct protein sequencing).
    • UniProtKB/TrEMBL (2011) contains 17 millions of protein sequences.

Adapted from Didier Gonze

size of the nucleotide database
Size of the nucleotide database

EMBL Nucleotide Sequence Database: Release Notes - Release 113 September 2012

Class entries nucleotides


CON:Constructed 7,236,371 359,112,791,043

EST:Expressed Sequence Tag 73,715,376 40,997,082,803

GSS:Genome Sequence Scan 34,528,104 21,985,922,905

HTC:High Throughput CDNA sequencing 491,770 594,229,662

HTG:High Throughput Genome sequencing 152,599 25,159,746,658

PAT:Patents 24,364,832 12,117,896,594

STD:Standard 13,920,617 37,665,112,606

STS:Sequence Tagged Site 1,322,570 636,037,867

TSA:Transcriptome Shotgun Assembly 8,085,693 5,663,938,279

WGS:Whole Genome Shotgun 88,288,431 305,661,696,545

----------- ---------------

Total 252,106,363 450,481,663,919

Division entries nucleotides


ENV:Environmental Samples 30,908,230 14,420,391,278

FUN:Fungi 6,522,586 11,614,472,226

HUM:Human 32,094,500 38,072,362,804

INV:Invertebrates 31,907,138 52,527,673,643

MAM:Other Mammals 40,012,731 145,678,620,711

MUS:Musmusculus 11,745,671 19,701,637,499

PHG:Bacteriophage 8,511 85,549,111

PLN:Plants 52,428,994 55,570,452,118

PRO:Prokaryotes 2,808,489 28,807,572,238

ROD:Rodents 6,554,012 33,326,106,733

SYN:Synthetic 4,045,013 782,174,055

TGN:Transgenic 285,307 849,743,891

UNC:Unclassified 8,617,225 4,957,442,673

VRL:Viruses 1,358,528 1,518,575,082

VRT:Other Vertebrates 22,809,428 42,568,889,857

----------- ---------------

Total 252,106,363 450,481,663,919

the embl nucleotide sequence database ebi uk http www ebi ac uk embl
The EMBL Nucleotide Sequence Database (EBI - UK)
size of the nucleic sequence databases
Size of the nucleic sequence databases
  • Summary of database contents for the 3 main databases of nucleic sequences.
  • Source: NAR database issue January 2006.
uniprot the universal protein resource http www uniprot org
UniProt - the Universal Protein Resource

Number of entries (polypeptides) in Swiss-Prot

Taxonomic distribution of the sequences

Within Eukaryotes

  • Database content (Sept 2012)
    • UniProtKB:
      • 24,532,088 entries
      • Translation of EMBL coding sequences (non-redundant with Swiss-Prot)
    • UniProtKB/Swiss-Prot section (reviewed):
      • 537,505 entries
      • annotation by experts
      • high information content
      • many references to the literature
      • good reliability of the information
    • The rest (90% of the entries)
      • Automatic annotation by sequence similarity.
  • Features
    • The most comprehensive protein database in the world.
    • A huge team: >100 annotators + developers.
    • Annotation by experts: annotators are specialized for different types of proteins or organisms.
    • World-wide recognized as an essential resource.
  • References
    • Bairoch et al. The SWISS-PROT protein sequence data bank. Nucleic Acids Res (1991) vol. 19 Suppl pp. 2247-9
    • The UniProt Consortium. The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res (2008). Database Issue.
UniProt example - Human Pax-6 proteinDetailed description of regions, variations, and secondary structure
ucsc genome browser university california santa cruz usa http genome ucsc edu
UCSC Genome Browser (University California Santa Cruz - USA)

Human gene Pax6 aligned with Vertebrate genomes

ucsc genome browser university california santa cruz usa http genome ucsc edu1
UCSC Genome Browser (University California Santa Cruz - USA)

Drosophila gene eyeless (homolog to Pax6) aligned with Insect genomes

ucsc genome browser university california santa cruz usa http genome ucsc edu2
UCSC Genome Browser (University California Santa Cruz - USA)

Drosophila 120kb chromosomal region covering the Achaete-Scute Complex

ensembl example drosophila gene pax6 http www ensembl org
EnsEMBL - Example: Drosophila gene Pax6
integr8 access to complete genomes and proteomes http www ebi ac uk integr8
Integr8 - access to complete genomes and proteomes
integr8 clusters of orthologous genes cogs http www ebi ac uk integr8
Integr8 - clusters of orthologous genes (COGs)
integr8 clusters of paralogous genes http www ebi ac uk integr8
Integr8 - clusters of paralogous genes
prosite protein domains families and functional sites http www expasy ch prosite
Prosite - protein domains, families and functional sites
prosite aligned sequences and logo http www expasy ch prosite
Prosite - aligned sequences and logo
  • Some of the sequences that were used to built the Prosite profile for the Zn(2)-C6 fungal-type DNA-binding domain (ZN2_CY6_FUNGAL_2, PS50048).
  • The Sequence Logo (below) indicates the level of conservation of each residue in each column of the alignment.
  • Note the 6 cysteines, characteristic of this domain.
prosite example of profile matrix http www expasy ch prosite
Prosite - Example of profile matrix
prosite example of sequence logo http www expasy ch prosite
Prosite - Example of sequence logo
prosite example of domain signature http www expasy ch prosite
Prosite - Example of domain signature
  • The domain signature is a string-based pattern representing the residues that are characteristic of a domain.

PFAM (Sanger Institute - UK) families represented by multiple sequence alignments and hidden Markov models (HMMs)

cath protein structure classification http www cathdb info
CATH - Protein Structure Classification
  • CATH is a hierarchical classification of protein domain structures, which clusters proteins at four major levels:
    • Class (C),
    • Architecture (A),
    • Topology (T)
    • Homologous superfamily (H).
  • The boundaries and assignments for each protein domain are determined using a combination of automated and manual procedures which include computational techniques, empirical and statistical evidence, literature review and expert analysis.
  • References
    • Orengo et al. The CATH Database provides insights into protein structure/function relationships. Nucleic Acids Res (1999) vol. 27 (1) pp. 275-9
    • Cuff et al. The CATH classification revisited--architectures reviewed and new ways to characterize structural divergence in superfamilies. Nucleic Acids Res (2008) pp.
cath protein structure classification http www cathdb info1
CATH - Protein Structure Classification
interpro ebi uk http www ebi ac uk interpro
InterPro (EBI - UK)
  • “A database of protein families, domains, repeats and sites in which identifiable features found in known proteins can be applied to new protein sequences.”
ontology definition
Ontology definition
  • Ontologie: partie de la métaphysique qui s'intéresse à l'être en tant qu'être, indépendamment de ses déterminations particulières
  • Ontology: part of the metaphysics that focusses on the being as a beging, independently of its particular determinationsLe Petit Robert - dictionnaire alphabétique et analogique de la langue française. 1993
the bio ontologies
The "bio-ontologies"
  • Answer to the problem of inconsistencies in the annotations
    • Controlled vocabulary
    • Hierarchical classification between the terms of the controlled vocabulary
  • E.g.: The Gene Ontology
    • molecular function ontology
    • process ontology
    • cellular component ontology
gene ontology database http www geneontology org example methionine biosynthetic process
Gene Ontology Database ( methionine biosynthetic process
status of go annotations nar db issue 2006
Status of GO annotations (NAR DB issue 2006)
  • Term definitions
    • Biological process terms 9,805
    • Molecular function terms 7,076
    • Cellular component terms 1,574
    • Sequence Ontology terms 963
  • Genomes with annotation 30
    • Excludes annotations from UniProt, which represent 261 annotated proteomes.
  • Annotated gene products
    • Total 1,618,739
    • Electronic only 1,460,632
    • Manually curated 158,107
quickgo http www ebi ac uk quickgo
QuickGO (
  • Web site
  • A user-friendly Web interface to the Gene Ontology.
  • Graphical display of the hierarchical relationships between terms.
  • Convenient browsing between classes.
remarks on bio ontologies
Remarks on "bio-ontologies"
  • Improvement compared to free text
    • controlled vocabulary (choice among synonyms)
    • hierarchical relationships between the concepts
  • Nothing to do with the philosophical concept of ontology
    • A "bio-ontologies" is usually nothing more than a taxonomical classification of the terms of a controlled vocabulary
  • Multiple possibilities of classification criteria
    • e.g. compartment subtypes (plasma membrane is a membrane)
    • e.g. compartment locations (nucleus is inside cytoplasm is inside plasma membrane)
  • To be useful, should remain purpose-based
    • each biologist might wish to define his/her own classification based on his/her needs and scope of interest
    • impossible to define a unifying standard for all biologists
  • No representation of molecular interactions
    • relationships between objects are only hierarchical, not horizontal or cyclic
    • e.g. does not describe which genes are the target of a given transcription factor
what is biological function
What is biological function ?
  • A general definition
    • Fonction: action, rôle caractéristique d’un élément, d’un organe, dans un ensemble (souvent opposé à structure). Source: Le Petit Robert - dictionnaire alphabetique et analogique de la langue francaise. 1982.
    • Function: characteristic action (role) of an element (organ) within an set(often opposed to structure)
  • Function and gene ontology
    • Understanding the function requires to establish the link between molecular activity and the context in which it takes place (process).
    • Multifunctionality
      • Same activity can play different roles in different processes.
        • Example: scute gene in Drosophila melanogaster: a transcription factor (activity) involved in sex determination, determination of neural precursors and malpighian tubules (3 processes).
      • Multiple activities of a same protein in a given process
        • Example: aspatokinase PutA in Escherichia coli, contains 2 enzymatic domains (enzymatic activities) + a DNA-binding domain (DNA binding transcription factor) -> 3 molecular activities in the same process (proline utilization).
hapmap http www hapmap org
  • The International HapMap Project is a multi-country effort to identify and catalog genetic similarities and differences in human beings.
  • Associations between genetic variations (SNPs, ...) and diseases + response to pharmaceuticals.