Databases at PBIL-PRABI

Databases at PBIL-PRABI Specific databases, developed in the lab • Homologous Gene Family Databases. • Protein and nucleic sequences • Protein Families and Gene Families, with: • Sequence Alignments • Phylogenetic Trees • Applications: • Molecular evolution, Genome Structure and Evolution • Phylogeny, Tree of Life • Comparative Mapping • Gene Function Prediction • Population Genetics • Computing resources of the IN2P3 Calculation Center • parallelisation (BLAST, CLUSTALW,MUSCLE, PHYML) : gain from 50 x to 150 x

Homologous Gene Family Database at PBIL-PRABI • HOGENOM 4 • 514 complete genomes : Bacteria, Archaea, Eukarya • Data from EBI, NCBI, Tigr, JGI, etc. • 2,000,000 proteins & 2,000,000 CDS • 150,000 families • 2,000,000 x 2,000,000 protein sequences (BLAST) • 150,000 alignments 2-2000 sequences (MUSCLE) • 70,000 phylogenetic trees (GBLOCKS +PHYML) • HOMOLENS 4 • 40 animal genomes from Ensembl 49 • 700 000 proteins, 900 000 CDS • 23 000 famillies • 700,000 x 700,000 protein sequences (BLAST) • 23,000 alignments 2-2000 sequences (MUSCLE) • 15,000 phylogenetic trees (GBLOCKS + PHYML)

Building HOGENOM 1. Harvesting complete genomes data available from several web sites • Genome Reviews • Bacteria et archaea from EBI (and a few eukaryota) : 381 genomes • Standardized and comprehensively annotated genomes • Microbial Genomes • Bacteria form NCBI: 94 genomes which were not found in Genome Reviews • Ensembl • Animals: selection of a few representative species : 11 genomes • EBI • Eukaryotic complete genomes other than animals: 13 genomes • JGI, NCBI, NHGRI, Genoscope, TIGR, GenBank • Complete genomes from interesting peculiar species : 11 genomes

Building HOGENOM 2. Building ACNUC databases 514 Complete genomes (Flat files in EMBL format) Indexation ACNUC ACNUC database of genomic sequences Querying ACNUC Coding DNA Sequences Proteins UNIPROT (annotations) Complete proteomes (Flat files in Uniprot format) Indexation ACNUC ACNUC database of protein sequences (2 000 000 sequences)

BLASTP Filtering (SEG)  BLOSUM62 E ≤ 10-4 Parallelised calculations at IN2P3 Local pairwise alignments Building HOGENOM 3. Protein sequences families Similarity search 2,000,000 protein sequences

A A B C HSP ≥ 80 % length Similarity ≥ 50 % A B C Cluster A, B, C Protein Family Building HOGENOM 3. Protein sequences families Clustering

A B C D E F G Phylogenetic tree Building HOGENOM 3. Protein sequences families Parallelised calculations (IN2P3) A B C D E F G A B C D E F G MUSCLE Multiple alignment Protein family GBLOCKS PHYML V3 (SH-like branch supports, NNI) Parallelised calculations (IN2P3)

Building HOGENOM 4. Reannotation of ACNUC databases • Re-annotation of genomic sequences • Gene Family associated to each CDS • GC percentage,internal introns, 3’ and 5’ NCR regions • Re-annotation of protein sequences • Family associated to each protein • 2 ACNUC databases • 2 000 000 proteins • 2 000 000 CDS • 150 000 families and associated alignments and trees

Building HOGENOM 5. Format trees data for FamFetch & TreePattern • Tree Pattern • Allows to query the tree database with complex tree patterns • For example: detection of gene transfers • Tree Reconciliation • Duplication/Speciation events are associated to nodes in all gene trees according to species taxonomy from the NCBI • This is then possible to request the tree database with tree patterns including speciation/duplication events

Building HOMOLENS Same pipeline • Data origin • Ensembl flat files ( 40 organisms) • Modification of flat files • Sequence name according to organism and chromosome (when it is possible) • Example HOMO1_1.PE1 : first CDS in the first “contig” of chromsome 1 • 2 ACNUC databases • 700 000 proteins • 900 000 CDS • 23 000 families and associated alignments and trees

A few figures Sequence and species distribution in HOGENOM families

Use of HOGENOM Several tools • QueryWin, raa_query • Complex ACNUC queries, data extraction • Graphical interface or command line application • FamFetch • Visualisation of trees and alignments, taxon colouring according to user • Tree pattern search allowing orthologous or paralagous genes selection • Web • ACNUC queries • Queries according to complex taxonomic criteria • Visualisation of trees and alignments • seqinR • ACNUC queries • Statistical analysis R

Use of HOGENOM Web

Use of HOGENOM FamFetch TreePattern

Bottlenecks 1. Data origins • Data harvesting • genomes data are scattered on several sites • data redundancy : • Cereon/Dupont • NCBI/EBI • Data modifications • sequence nam • check/change taxonomy fields • various data quality: • Ensembl (EBI)/GenomeReview(EBI)/Microbial Genomes(NCBI) : good quality, validation, reliable • Some EBI Complete Genomes : data missing, mix of strains • Some JGI, Sangers Genomes : Raw Data (FASTA files), no validation, poor quality • Manual intervention needed

Bottlenecks 2. Data increase : technical problems • Calculation time • BLAST, MUSCLE, PHYML • Data visualisation • Alignments, trees are difficult to handle for big families • Protein sequence data redundancy • Intra database redundancy : • Identical sequences in different species • Inter database redundancy : • For example: human sequences are found in HOVERGEN, HOMOLENS et HOGENOM

Bottlenecks 3. Data increase : user related problems. • Exhaustive HOGENOM including the whole available taxonomy: • Huge families are difficult to handle • High Redundancy due to bacterial strains • Is it really useful ? • Exhaustive HOGENOM restricted to a peculiar taxonomy level • Examples: • All complete genomes from plants • Complete genomes of all the strains for a bacteria type • Non exhaustive HOGENOM including a wide taxonomic set • Examples: • All the bacterial genomes, each bacteria being represented by one unique strain • Representatives organisms covering a wide phylogenetic distance alternative : specialised hogenom-like databases

Perspectives 1/3:Participative Repository ofComplete Genomes • Filled by the members of laboratory interested in a peculiar genome • Collects the information about the available complete genomes ( • Definition, taxonomy • Data origin • Bibliography summary (including statistics) • Access on the server • if possible : ACNUC version and automatic data analysis • GOLD like (Genomes OnLine Database) • MySQL

Perspectives 2/3:Non-Redundant Sequences Database« BGENR » • BGENR is a database for internal use • Aims to • Suppress sequence data redundancy • Minimise similarity BLAST calculation and storage space • Connect together the different databases • Track sequences history, keep archives • Build from non-redundant protein sequences coming from several origin : • Uniprot • translated CDS from Ensembl, Genome Reviews, Microbial Genomes, several complete genomes. • Will be the sequence repository used to build all the databases. • Will be used to build a database of BLAST hits.

Perspectives 3/3:Blast Hits Database • Contains all the HSP from a BLAST similarity search of all the BGENR sequences against themself • Will be used to • build the homologous gene families databases on demand (taxa specific databases, non exhaustive databases, etc.)

Conclusion The future of HOGENOM relies on the solutions of the following issues : • Reconciliation • Not possible at the moment due to the NCBI bacterial taxonomy • Connection between databases • Between exhaustive and non exhaustive databases • Between non exhaustive databases • Visualisation • Incremental calculations ( trees, alignments, similarity) • Tree, Alignments, Simirarity search • Selection of representative genomes • How to do that?

Calculations are done at theIN2P3 Calculation Center • HOMOLENS • 600,000 x 600,000 protein sequences (BLAST) • 30,000 alignments 2-2000 sequences (MUSCLE) • 16,000 phylogenetic trees (PHYML SH-like branch supports) • HOGENOM • 2,000,000 x 2,000,000 protein sequences (BLAST) • 150,000 alignments 2-2000 sequences (MUSCLE) • 70,000 phylogenetic trees (PHYML SH-like branch supports, NNI+SPR) • Calculation time gain = about 100 x • About 1 month for BLAST, MUSLE and PHYML together

Calculations are done at theIN2P3 Calculation Center • 50 engineers • 1800 users from particles physics, nuclear physics, astrophysics and biology • 50 big experiences including LHC (CERN) • 24h/24h and 7d/7 • Since 2002, welcomes biology and medical users • 4 000 processors, soon 7 000 • BQS Batch Queuing System

Bottlenecks

Update frequencies • Automatic update : Structuration of data under ACNUC • GenBank & EMBL daily • Uniprot weekly • Semi automatic update: Structuration of data under ACNUC and: • Data harvesting, data re-annotation, check for quality • BLAST, Alignments (MUSCLE), Phylogenetic trees (PHYML) • Year 2003 2004 2005 2006 2007 • HOVERGEN 3 2 0 1 1 • HOMOLENS 1(first) 1 1 • HOGENOM 1(first) 1 1 0 1(new method) • Total 4 3 2 2 3

Databases at PBIL-PRABI

Databases at PBIL-PRABI

Presentation Transcript

Databases

Databases

Databases at UCSC

Databases

DATABASES

Databases

Databases

Databases

Finance and Investment Databases at UTSA

Databases at York

Databases

Databases

Databases

Databases

Databases

Science Databases at MIHS

Databases

Databases

Databases at PBIL-PRABI

Solving BCSP using GA and PBIL

Electronic Databases (Research Databases)

Emotional Databases at TAU