Computational challenges

Computational challenges Sean Eddy HHMI Janelia Farm Research Campus

My charge 2008: 2 Tb 2009: 32 Tb 2010: 150 Tb 2011: 165 Tb • How will we keep up with this? • maintaining/annotating quality • storage • communication (network bandwidth) • analysis (including software and databases) • integration

Ewan Birney Michael Brent Jeremy Buhler Goran Ceric Barak Cohen Richard Durbin Jonathan Eisen Rob Finn Paul Gardner Ian Holmes Scott Hunicke-Smith Rob Knight David Konerding Saul Kravitz Anthony Leonardo Rob Mitra Ryan Richt Jason Stajich Lincoln Stein Granger Sutton George Weinstock Rick Wilson Vivien Bonazzi and Adam Felsenfeld NIH NHGRI Steven Brenner UC Berkeley David Dooling WashU Genome Center Dan Meiron Dept. of Applied & Computational Mathematics and Aeronautics, Caltech http://cryptogenomicon.org

FY09 NHGRI: $488M about 40M databases about 60M informatics FY09 NIH: $29,000M HHMI: $760M Janelia Farm alone: $120M CERN: $1000M SLAC: $300M LSST: $45M (start 2015) • Informatics challenges affect all biomedical research • NHGRI lacks resources to solve these problems alone • First planning priority is dealing with NHGRI’s own data well • At the same time: • lead and catalyze -- show others how to do it; best practices • work together -- NCBI, EBI, NCI, others share our problems.

Data volume per se is not the problem.

1024 Yottabytes Future Sensor Z 1021 Theater Data Stream (2006): ~270 TB of NTM data / year Zettabytes Example: One Theater’s Storage Capacity: 250 TB Sensor Data Volume Future Sensor Y 1018 Exabytes 12 TB Future Sensor X Capability Gap 2010 2006 Large Data JCTD 1015 Petabytes GLOBAL HAWK DATA PREDATOR UAV VIDEO UUVs FIRESCOUT VTUAV DATA 1012 Terabytes GIG Data Capacity (Services, Transport & Storage) 2000 Today 2010 2015 & Beyond Bob Gourley http://ctovision.typepad.com/InfoSharingTechnologyFutures.ppt

Moore’s law: CPU power doubles in ~18-24 mo. Hard drive capacity doubles in ~12 mo. Network bandwidth doubles in ~20 mo.

Fundamental computing capabilities should increase: 7-10x in 5 years 50-100x in 10 years We project in 3-5 years: 100x increase in sequencing volume Therefore: yes, next-gen sequencing tech bumps us up; and we can’t just sit on our hands; but we only have to be a little more clever

Fortunately, we are not alone.

Private sector datasets and computing capacity are already huge. Google, Yahoo!, Microsoft: probably ~100 PB or so Ebay, Facebook, Walmart: probably ~10 PB or so For example: Microsoft is constructing a new $500M data center in Chicago. Four new electrical substations totalling 168 MW power. About 200 40’ truckable containers, each containing ~1000-2000 servers. Estimated 200K-400K servers total. Comparisons to Google, Microsoft, etc. aren’t entirely appropriate; scale of their budgets vs. ours aren’t comparable. Google FY2007: 11.5B; ~ $1B to computing hardware Though they do give us early warning of coming trends: (container data centers; cloud computing)

CERN Large Hadron Collider (LHC) ~10 PB/year at start ~1000 PB in ~10 years 2500 physicists collaborating http://www.cern.ch

Large Synoptic Survey Telescope (LSST) NSF, DOE, and private donors ~5-10 PB/year at start in 2012 ~100 PB by 2025 Pan-STARRS (Haleakala, Hawaii) US Air Force now: 800 TB/year soon: 4 PB/year http://www.lsst.org; http://pan-starrs.ifa.hawaii.edu/public/

1. Petabyte data volumes are manageable using commodity tech Pan STARRS: 80 “data bricks”, RAID-6; 3 PB for ~$1M

2. Just because you can store raw data doesn’t mean you should data filtering at the source and at every stop along the way using strategy appropriate to a particular experiment/analysis CERN LHC Atlas detector generates 105 more data than is stored (40 million events/sec  200/sec stored)

3. Distributed, hierarchical, redundant data archives and analysis (CERN LHC’s four tiers of data centers: 10 Tier 1 sites, 30 Tier 2 sites) 4. Computational infrastructure is integral to experimental design

Hardware technology is important, but is not where we are stressed. Our single most important problem is the democratization of sequence analysis. Biology has become an informatics- and data-heavy science, but we lack a culture that supports pervasive computational analysis Our weak links are computational infrastructure and the training and expertise of bench scientists.

The good old days: one genome cloning, mapping, sequencing genome center assembly international DNA databases

international databases reference genome assemblies genome centers ENCODE centers, CEGS genome browsers comparative sequence model organism databases transcript sequence departmental core sequencers ChIP-seq, CLIP-seq boutique databases resequencing (mutants, variants) PI’s lab supplementary material phenotype data • Evolving toward a tiered structure (like physics/astronomy) 2. Must integrate lots of different data (unlike physics/astronomy)

A return to a paper as a unit of advance, not a genome The output of genome sequencing and assembly is simple, modular, and well-understood, including the associated quality metrics This meant we could shoot pre-publication data into the databases and it was useful Now that next-gen sequencing is a multipurposed digital assay tool: Details of methods, experimental design, and analysis all matter again we’ve been calling this information “metadata”, as if it were merely a db format issue to solve with XML; it is not. it’s the information in a properly written paper.

Democratization means: For an individual PI’s lab to generate reliable/reusable datasets, integrate them with other datasets, conduct large-scale computational analyses, and write great papers, with results that can in turn be integrated with others; Those individual labs need good software for mapping/assembling sequence access to reliable, modular, well-annotated datasets good software for integrated data analysis efficient means of sharing/distributing their datasets

Bottlenecks (challenges, opportunities): Availability of good software. Availability of other datasets in a form that can be most readily integrated into analysis workflows. Computing infrastructure to do the work.

The main challenge with software: Software and database infrastructure requires engineering discipline and science Our culture values science, not engineering The result: a software literature full of good ideas that don’t get fully baked; tools that work in one place but aren’t portable Commercialization path largely hasn’t worked: why? market too small? too dynamic? poisoned by open source? Commercialization isn’t a complete answer anyway: tools themselves are research, require open publication

Suggested approach to better software: There is currently little support niche for the engineering of robust research software in biology (exceptions include NCI caBIG; NCBI) “Centers of excellence” in software engineering could be established to harden/productize research tools while they’re still in R&D phase: reward engineering for its own sake (compare Tech D funding at genome centers) Encourage commercialization of stable tools once they’ve left R&D phase: SBIR mechanisms compare Road Map NCBCs: http://www.ncbcs.org

One desired outcome: earlier, more widespread adoption of analysis best practices (no more using BLAST to map short reads) more efficient use of time and computational resources; less big iron and less global warming required.

The main challenges with datasets: overly reliant on monolithic, overly centralized international databases versioning: instability of coordinate systems interferes with data integration poor ability to improve annotation and quality of archived data

An aggregated monolithic database makes sense if you’re going to search it all at once Historically, we think in terms of the sequence databases and homology searches Literature, text search also makes sense (Google) But does a monolithic archive make sense for all data? For example: is the Short Read Archive useful?

An approach for better datasets: modularity Do one thing well; define standards for input/output so tools can be chained together in powerful combinations. Akin to CERN/LHC tiered structure, where each tier digests data from previous tier, adding new information while compressing the previous. modularity rather than tiers because our data isn’t a hierarchical single experiment like the LHC Example: I really don’t want the raw short reads from your ChIP-seq experiment; I want the histogram of them mapped to a reference genome, with defined methods and reliability measures

The lowest level of modularity is supplementary material Supplementary material should be electronic datasets in standard exchange formats suitable for integration with other data (not an unreviewed, wordy alternative version of the same paper to circumvent page limits) Not an NHGRI problem; a community problem requiring consciousness-raising and commitment at journals R. Gentleman, Reproducible research: a case study. Stat Appl Genet Mol Biol 4:Article2 (2005)

consider the fate of a coding gene annotation bottom-up top-down International DNA databases International DNA databases International protein databases International protein databases Model organism databases Model organism databases integration-ready data from supplementary material

main challenge with computing infrastructure: Efficient large-scale analysis and data requires data centers Data centers exhibit strong economies of scale, due to load balancing, space, cooling, power, staffing Most individual labs cannot justify cost of an efficient data center, nor can they keep it loaded NIH traditionally funds at individual lab level Individual labs are wasting money on subscaled computing

Example: the Janelia Farm data center circa 2006: 528 nodes (1056 cores): 480 w/ 4 GB RAM, 40 w/ 8 GB RAM, 8 w/ 64 GB RAM 1 gigabit to each node; 10 gigabit between racks EMC DMX-3 + 8 EMC Celerra file servers; MPFSi (parallel NFS) 200 TB disk; 1 PB offline backup crucially: entire datacenter is accessible on our desktops (no transfer lag in/out) 2 full time staff (including one demigod); $3M capital expense, recurring every 3 years; serves ~40 labs with widely mixed needs

Approaches to computational infrastructure: • Enable department- or institute-level data centers • (NCRR? however, requires plan for the 3-year technology • refresh rate; more a consumable than a capital expense) • Web services (“SOA”, “service-oriented architecture”) • For certain well-defined computational tasks, • a remote server can process a query and return • a formatted answer. • includes annotation/integration problems: DAS, for example. • Cloud computing • For arbitrary computational tasks, • you can create a virtual machine image, send • it to a remote server, and have it execute there. • “move the compute to the data”: large datasets can be hosted

Recommendations: 1. Develop “centers of excellence” for software engineering. 2. Modularize the organization of databases for key genomic resources, reduce reliance on monolithic centralized archives: think tiers, except not in a hierarchy. 3. Strengthen that modularity all the way down to the level of supplementary material in publications: reproducible methods, integration-ready results 4. Plan for hardware infrastructure at department, institute level 5. Catalyze development of web services and cloud computing resources, especially on hosted large datasets 6. Engage resources outside traditional biology: create “grand challenges” attractive to high-performance computing community

Computational challenges