1 / 22

Bioinformatics Challenges in Data Handling and Presentation to the Bioinformaticists

Bioinformatics Challenges in Data Handling and Presentation to the Bioinformaticists. Metagenomic nucleotide sequence and annotation: Range of environments. Global ocean survey Human faecal virus communities Human distal gut microbiome Phosphorus removal sludge communities

cera
Download Presentation

Bioinformatics Challenges in Data Handling and Presentation to the Bioinformaticists

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics Challenges in Data Handling and Presentation to the Bioinformaticists

  2. Metagenomic nucleotide sequence and annotation: Range of environments Global ocean survey Human faecal virus communities Human distal gut microbiome Phosphorus removal sludge communities Obesity-associated gut microbiome Acidophilicbacterial community Mouse gut flora

  3. Metagenomic nucleotide sequence and annotation: Data growth: projects

  4. Metagenomic nucleotide sequence and annotation: Data growth: volume of dataset

  5. Metagenomic nucleotide sequence and annotation: Assembly issues Most metagenome records have not been assembled into scaffolds in INSDC records (only 4 of 24 projects so far) and remain as unassembled WGS records Those that have been assembled into scaffolds show very limited assembly - of the four assembled projects, one contains almost as many scaffolds as contigs

  6. Metagenomic nucleotide sequence and annotation: Metadata issues Current: FT source 1..2866 FT /organism="marine metagenome" FT /environmental_sample FT /mol_type="genomic DNA" FT /isolation_source="isolated as part of a large dataset FT composed predominantly from surface water marine samples FT collected along a voyage from Eastern North American coast FT to the Eastern Pacific Ocean, including locations in the FT Sargasso Sea, Panama Canal, and the Galapagos Islands" FT /note="metagenomic" FT /db_xref="taxon:408172" • Metadata, particularly sampling information, are often not shown, or are provided with limited granularity, restricting re-analysis by users • INSDC offers appropriate structures for such metadata, but they are frequently not used, even when the information is available to the submitters Could be: FT source 1..2866 FT /organism="marine metagenome" FT /environmental_sample FT /mol_type="genomic DNA" FT /country="French Polynesia: Moorea, Cooks Bay" FT /lat_lon="17.476 S 149.81 W" FT /isolation_source="marine surface water; sample FT depth: 34M; size range: 0.1-0.8 microns; water FT temperature: 28.900; salinity: 35.100" FT /db_xref="taxon:408172"

  7. Metagenomic nucleotide sequence and annotation: Taxonomy issues Taxonomic annotation in metagenomic data is simplistic - a very small number of non-specific taxa are necessarily used to describe all of the raw data Analysis methodology, particularly binning, is inconsistent across the dataset, so taxonomic assertions in assembled sequence are of uncertain provenance Standards on whether or not single contigs should contribute to scaffolds for more than one taxon are yet to be established

  8. Metagenomes and UniProt (1/2) • As of this month, ~6 million protein sequences from Global Ocean Survey have been released (vs. 4,534,260 UniProtKB entries) • Future exponential increase is anticipated: • The growth of public protein sequence data is exponential with a doubling time of about 20 months • Metagenomics data will have substantially shorter doubling time • GOS data will more than double the existing protein-coding sequences in UniProtKB

  9. Metagenomes and UniProt (2/2) • Perspectives • Vast amount of sequence data • Environmental context in metadata • New kind of data requires new storage, processing, and data mining procedures • Taxonomically unassigned data will not be included in the UniProt Knowledgebase • UniMES – UniProt Metagenomics and Environmental sequences (June 2007)

  10. UniMes requirements • Distinct storage and dissemination: separated from current UniProt databases. • Distinct production pipeline • Distinct accession number range: MES followed by 11 hexadecimal numbers, e.g. MES00000000001 • Distinct data mining pipelines: less restricted rules due to the lack of basic knowledge about the taxonomic origin of these sequences

  11. UniMes pipeline overview DNA Metagenomics (to be established) EMBL Genomic sequence (EMBL) Metagenomics data (WGS) Other Submissions Primary data UniProt Metagenomics UniProt Archive UniProt Knowledgebase Clustering Secondary analysis Secondary analysis Classification Automatic annotation rules

  12. UniProtKB vs.UniMes Database growth

  13. UniMes storage growth

  14. UniMes hardware requirements (1/2) • 2 HP/Compaq AlphaServers ES45 with 4 1250MHz CPU’s and 12GB Memory Oracle database designed to store and maintain data derived from EMBL Oracle Warehouse for data analysis, integration and display • 64-bit linux farm (AMD operon) using 40 nodes for data mining procedures

  15. UniMes hardware requirements (2/2) • New oracle servers: Sunfire v490 with 4 1500MHz UltraSparc IV CPUS’s and 16 GB memory • We have enough physical storage and CPU power for 2007

  16. UniMes dissemination • FASTA and XML files • UniProt Web Site: text and similarity searches

  17. GOS submission • Submission of nucleic acid sequence data to EMBL/GenBank/DDBJ is mandatory for publication of scientific paper • Craig Venter Institute submission to EMBL/GenBank/DDBJ in March 2007 • Environmental metadata can only be found in the CAMERA website • Metadata are of great importance for metagenomic sequence data: • Descriptions of sampling sites and habitats • Analysis of metagenomics sequence data • URGENT need for the community to agree on what metadata must be included with the submission of any metagenomics sample

  18. UniMes and GOS data

  19. UniMes and GOS data

  20. UniMes and GOS data

  21. UniMes and GOS data Top 10 InterPro entries hitting UniProt: Top 10 InterPro entries hitting GOS Top 10 InterPro entries hitting UniParc (including GOS):

  22. UniMes and GOS data: Analysis • Calculation time: 763,425 CPU hours • Storage for InterPro hits to GOS: 50 GB

More Related