1 / 31

The Sequence Read Archive at EBI

The Sequence Read Archive at EBI. Guy Cochrane, EMBL-EBI. European Nucleotide Archive. 2. 16.08.2014. European Nucleotide Archive. ENA Mechanisms. Sequence similarity search Term search Download Browse Pipe into analysis tools APIs. Direct presentation. Local data capture. Data

esben
Download Presentation

The Sequence Read Archive at EBI

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Sequence Read Archive at EBI Guy Cochrane, EMBL-EBI

  2. European Nucleotide Archive 2 16.08.2014 European Nucleotide Archive

  3. ENA Mechanisms Sequence similarity search Term search Download Browse Pipe into analysis toolsAPIs Direct presentation Local data capture Data exchange • Ensembl • - genebuild • variation • regulatory build • UniProt • ArrayExpress • 1k Genomes DCC Infrastructure service Brokered submissions 3 16.08.2014 European Nucleotide Archive

  4. SRA service Establish global repository for next gen. platform data submission services through extension of data exchange collaborations with partners at NCBI and DDBJ Provide route for data dissemination as ongoing infrastructure to support large-scale studies as a complement to publications relieve data generators of large hardware requirements Provide data access to users for re/meta-analysis of existing data to enable serendipitous discoveries 4 16.08.2014 European Nucleotide Archive

  5. Next gen. brings broader applications de novo assembly re-sequencing gene expression gene discovery epigenomics community genomics & transcriptomics others 5 16.08.2014 European Nucleotide Archive

  6. Next gen. is different Read length Data volume per run Metadata:data volume ratio Read substructure Complexity of metadata 6 16.08.2014 European Nucleotide Archive

  7. A sustainable data model for SRA Study, sample and experimental information Publication and author information Machine configuration Access to run datasets Access to selected reads within runs Intensity Noise data Sequence Quality 7 16.08.2014 European Nucleotide Archive

  8. A sustainable data model for SRA SRA XML schema format-specific toolkit specialist binary formats (SRF and SRA) 8 16.08.2014 European Nucleotide Archive

  9. Status Infrastructure Metadata schema initiated by NCBI, now under co-development with EBI Adoption of community data format, SRF Migration to NCBI’s SRA data format Common accession namespace established with NCBI Data capture Large sequencing centres (eg. Sanger, BGI, etc.) Small-scale submissions (Sanger Pathogen Sequencing Unit, Illumina UK, etc.) Data and metadata exchange NCBI-collected data and metadata mirrored at EBI Data presentation All metadata available as XML All data available via FTP and Aspera Beta browser launched in early October 9 16.08.2014 European Nucleotide Archive

  10. SRA contents Nucleotides (terabases) 985 studies 1,253 organisms 5,329 samples 27, 662 runs 9,296 experiments 10 16.08.2014 European Nucleotide Archive

  11. SRA by platform 11 16.08.2014 European Nucleotide Archive

  12. SRA by study type 12 16.08.2014 European Nucleotide Archive

  13. Aspera technology fasp protocol significantly faster than FTP Intelligent adaptive rate control mechanism Secure On-the-fly Data Encryption Integrity Verification Client download: http://www.asperasoft.com/downloads/connect-win.html Command line client and web browser plug-in 13 16.08.2014 European Nucleotide Archive

  14. Submissions Manual XML examples and information about supported data formats made available Data and XML metadata files uploaded with ftp or Aspera into drop box notification e-mail to datasubs@ebi.ac.uk to initiate processing Metadata files validated and cross-checked against data files Accessions returned by e-mail • Automated • Data files are uploaded into drop box • RESTful service used to submit data files and XML metadata files • Metadata file validation and accessioning is synchronous • Data file validation is asynchronous http://www.ebi.ac.uk/embl/Documentation/ENA-Reads.html datasubs@ebi.ac.uk 14 16.08.2014 European Nucleotide Archive

  15. SRA as infrastructure ArrayExpress: sequence-based transcriptomics data Bioinvestigation Index: sequence-based multi-omics data European Genome-Phenome Archive (EGA): sequence-based, ethically protected data 15 16.08.2014 European Nucleotide Archive

  16. Retrieval Data FTP: ftp.era.ebi.ac.uk, Aspera: fasp.era.ebi.ac.uk Metadata FTP: ftp.era-xml.ebi.ac.uk Browser (in beta) http://www.ebi.ac.uk/ena/data/view/<SRA object accession>&display=xml http://www.ebi.ac.uk/ena/data/view/<SRA object accession>&display=html Search by accession/description text EB-eye search tool on all EBI pages http://www.ebi.ac.uk/ebisearch/advancedsearch.ebi 16 16.08.2014 European Nucleotide Archive

  17. EB-eye search 17 16.08.2014 European Nucleotide Archive

  18. Summary of hits 18 16.08.2014 European Nucleotide Archive

  19. Submission view 19 16.08.2014 European Nucleotide Archive

  20. Study view 20 16.08.2014 European Nucleotide Archive

  21. Sample view 21 16.08.2014 European Nucleotide Archive

  22. Experiment view 22 16.08.2014 European Nucleotide Archive

  23. Run view 23 16.08.2014 European Nucleotide Archive

  24. Currently provided in Sequence Read Format (SRF) and SRA toolkit Derived fastq files available BAM format to be added in due course Data files can hold intensity data, base calls and qualities. Data files 24 16.08.2014 European Nucleotide Archive

  25. Data file manipulation • SRF • Io_lib of the Staden package, http://sourceforge.net/projects/staden • Solid2srf provided by ABI • Functionalities include conversion (native <-> SRF <-> fastq, indexing, summary generation • SRA toolkit • Software development kit, SRA SDK, http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software • Functionalities include format conversion, column extraction, selection, etc. 25 16.08.2014 European Nucleotide Archive

  26. Futures: sequence similarity search GATT AGAT GATCCGATGAG AGAA GCTCTAG CGAG TAGTCGA GGCT TAGA GAGGCT AGAGA AGACAG GCTTTAG CGACGC 26 16.08.2014 European Nucleotide Archive

  27. Futures: leveraging community standardisation efforts Coherent communities exist that develop standards around what information to collect and how to represent it Systematic incorporation into data capture and presentation tools Validation against minimal standards and stamp of approval 27 16.08.2014 European Nucleotide Archive

  28. Futures: data reduction strategies Disk space is finite! Intensity series have limited value for reuse Both future sample availability and application are factors Second base useful for polymorphism studies Proposal that minimal archived data includes sequence and quality 28 16.08.2014 European Nucleotide Archive

  29. User defines coordinates on a reference, reads returned that relate to this part of the reference: Give me all reads that map to given gene in digital gene expression assay Give me all reads that provide support for a given polymorphism Give me all reads that provide support for a given splice model Calculation Up to date, but computationally heavy, with reference tracking issues Capture Consistent with literature, but submission is not straightforward Futures: the mapped read issue 29 16.08.2014 European Nucleotide Archive

  30. People and funding Data submissions and management Sheila Plaister, Bob Vaughan, Ruth Akhtar, Petra ten Hoopen, Christopher Hunter, Richard Gibson Database programmers Ying Chang, Iain Cleland, Mikyung Jang, Rasko Leinonen, Quan Lin, Lawrence Bower, Siamak Sobhany, Gemma Hoad, Rajesh Radhakrishnan, Fehmi Demiralp, Vadim Kalunin, Neil Goodgame, Nadeem Faruque Database development and coordination Bob Vaughan, Nadeem Faruque, Rasko Leinonen, Guy Cochrane Sequencing data and tools (Sanger) Steven Leonard, James Bonfield Sequence search tools Guy Slater, Ewan Birney Data exchange collaborators NCBI, DDBJ EBI external services team Funding: European Molecular Biology Laboratory and Wellcome Trust 30 16.08.2014 European Nucleotide Archive

  31. ENA points of access http://www.ebi.ac.uk/embl/Documentation/ENA-Reads.html http://www.ebi.ac.uk/ena/data/view/<SRA object accession> 31 16.08.2014 European Nucleotide Archive

More Related