1 / 17

GCC Genomics Core Computing

GCC Genomics Core Computing. Current situation GCC 1.0. 8C 16Gb 2TB. Per run: ~ 1 Mio reads ~ 2Gb raw data. 8C 16Gb. 8C 16Gb. Current cluster. Roche 454. UZ network. UZ NAS Storage. New sequencer: 1000x increase. 1.1TB / run (200Gbp) ~1000 Mio reads 8 days run!

harken
Download Presentation

GCC Genomics Core Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GCCGenomics Core Computing

  2. Current situation GCC 1.0 8C 16Gb 2TB Per run: ~ 1 Mio reads ~ 2Gb raw data 8C 16Gb 8C 16Gb Current cluster Roche 454 UZ network UZ NAS Storage

  3. New sequencer: 1000x increase 1.1TB / run (200Gbp) ~1000 Mio reads 8 days run! Basic analysis of 1 full run < 1 week on 3 nodes with 48Gb RAM and 8 CPU cores each (and needs 7TB space) Full capacity sequencing = full capacity 24 cpu cores

  4. Meta-analyses & post-analyses • Several fold higher needs than basic run analyses • Integrate multiple runs (e.g,. patient versus controls, families, etc) • Integrate with previous data • Integrate with publicly available data • RNA-Seq + gene expression data from GEO • Integrate with other data sources • DNA-Seq + RNA-Seq + Methyl-Seq • Integrate with genome browsers • Galaxy, UCSC, Ensembl • Make analysis pipelines available to users as a service • Custom analyses as a service or in collaboration

  5. Ideal computing setup High Performance Computing (HPC) 500MB/s

  6. UZ-GBIOMED-VSC UZ-Patient data UZ gbiomed VSC Open-MPI SGE Distributed computing 8C 16Gb 2TB Flexible computing ~ 100 cpu 6Gb RAM/core Torque/PBS Distributed computing 8C 16Gb 8C 16Gb NetApp +DDN Storage UZ NAS Storage DAS or NAS? Dell, NetApp? • Additional RAM (32Gb or 48 Gb per node) • Additional storage? • Servers • Storage • Switches • Computing (0,5 EUR / cpu-hour) • Storage (750-1500 EUR / TB) • Software: • CASAVA • CLCBio • Roche • Software: • Academic tools • CLCBio? • Software: • CASAVA (parall. by user) • Academic: bowtie, bwa, … • CLCBio?

  7. To be discussed • How can HiSEQ2000 choose between UZ and KULeuven network to send run data to storage? • 1Gb • 350 Gb / run compressed • Where to store data after secondary analysis? • Cheap storage • External HDD • tape • Who does what? • Jeroen/ Jan for UZ? • Stein / Gert / Raf for Biomed? • Can we already buy additional RAM for UZ cluster? • Can we connect gbiomed servers directly to UZ storage? • What are the requirements? • Estimate load over 3 levels • # users • # run • Difficult to estimate now – evaluate after 1yr

  8. What’s next Test with 1000 genomes data Decide on gbiomedhardware List of things needed at UZ Start testing CASAVA on UZ system and on VSC Test CLCBio on UZ system for Illumina data

  9. Storage • How much do we need? • 1.1 TB per run • 7 TB space during analysis • BUT: keep only runs that are being analyzed • ~ 3 at a time? • 10-15 TB • After analysis: • Data delivered to client • Data compressed and moved to offline storage • Cheap HDD array? • Tape? • External HDDs?

  10. Proposal for GCC2.0 (ideas under construction) VSC (existing), pay per cpu-hour Roche 454 ICTS/VSC NetApp +DDN Storage Non-patient-related data ! Illumina HiSEQ2000 8C 16Gb 2TB 8C 16Gb 8C 16Gb UZ Computing nodes (existing) Fast interconnect; high I/O bandwidth 32C 256Gb UZ NetApp Storage ! 8C 48Gb 8C 48Gb gbiomed computing nodes Non-patient-related data (e.g., model organisms, cell lines, …) Patient-related data ! = to create, to test, or to open 10Gb link

  11. GCC2.0 features • Divide and conquer: solution at 3 levels • UZ: for UZ-patient-related data (protected) • Gbiomed: ad hoc, flexible computing for research (non-UZ-patient related data) • VSC: high-performance computing (non UZ-patient related data) • Storage (too expensive to duplicate) • VSC storage with Gbiomed access (create 10Gb fast interconnect from ICTS to gbiomed) • UZ storage with Gbiomed access (create ‘open-access’ policy for non-patient related data) • Gbiomedad hoc storage (HDDs in the local servers) • Computing • VSC for HPC • Servers in UZ (patient-related data) • Servers in gbiomed (for research-related ad hoc analyses, web services, development, software testing, …) • Requires fast (10Gb ethernet) access to ICTS storage and fast (and open) access to UZ-open storage

  12. GCC2.0 Cost, Timing & Effort estimates • Budget from StichtingtegenKanker • 200-250 K left for computing • Solution for the first 3 years should be possible (excluding bioinformatics manpower) • Budget spread between VSC-gbiomed-UZ: to be decided internally in genomics core • VSC x% • Storage (86.400 EUR for 32 TB; ~80 TB is needed for 25 runs per year) • Computing time (29.594 EUR for 55.000 cpu-hours) • Gbiomed local servers and local storage y% • UZ additional storage z% • Software licenses (CLCBio) (price quote requested) • More investments needed over time (e.g., new hardware is only for 3 years) • Timing: 31 August 2010? • Estimated effort (to be discussed) • VSC: • Create 10Gb ethernet link to gbiomed (cost?) • … mandays for startup and testing (network connections, storage, software) • Maintenance included in price • Genomics Core bioinformaticians (VRC, CME) • … mandays for startup and testing • Gbiomed IT: • … mandays for setting-up local servers & integration with ICTS storage • … FTE for maintenance of local servers • UZ: … mandays for additional storage and setup NetApp share

  13. Hurdles to overcome • 1) 10Gb ethernet link between VSC and gbiomed • For non-UZ-patient related data • To transfer Illumina data to VSC • To run ad hoc analyses on local gbiomed servers, connected to the VSC storage, without the need to duplicate the storage solution and the data (too costly) • An absolute requirement • Currently not available • A necessary investment for future VSC-BMW interactions • 2) UZ-Patient-related data cannot be transferred to VSC storage, nor computed at VSC • Can VSC provide a secure transfer, storage and computing environment for UZ-data? If not, data analysis and storage for UZ-data remains in UZ. • 3) Link between UZ storage and gbiomed for non-patient related data • Gbiomed-UZ • 10Gb link is possible in principle. Perhaps during transition period (while waiting for 10Gb link VSC-gbiomed)?

  14. Alternatives All-in-one solution PSSCLabs Public tender

  15. Bioinformatics analyses • Estimated effort from Genomics Core bioinformatician for basic analysis of 1 run: ~2-3 mandays • Included in service fee? • This analysis will not be satisfactory for most projects • Fee-based bioinformatics and data analysis service for more advanced analyses? • Many users have a bioinformatician in the group or already collaborate with bioinformaticians • Contribution in the service fee for GCC hardware & maintenance cost, and software licenses • Estimated effort: • Either only basic analysis services are offered: ½ FTE bioinformatics postdoc • Or basic plus advanced bioinformatics services will be offered: 1 FTE bioinformatics postdoc.

More Related