Developing Cyberinfrastructure for Data-Oriented Science and Engineering

Developing Cyberinfrastructure for Data-Oriented Science and Engineering Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed Chair, UC San Diego

The Digital World Education Entertainment Commerce Information

Research, Education, and Data Japanese Art Images – 70.6 GB JCSG/SLAC – 15.7 TB NVO – 100+ TB Life Sciences Astronomy Arts, and Humanities Engineering Projected LHC Data – 10 PB/year SCEC – 153 TB TeraBridge – 800 GB Geosciences Physics

Data (more BYTES) Today’s Research and Education Applications Cover the Spectrum Data-oriented Applications Large-scale data required as input, intermediate, output for many modern HPC applications Applications vary with respect to how well they can perform in distributed mode (grid computing) Researchers increasingly dependent on both High Performance Computing (HPC)and Highly Reliable Data (HRD) TeraShake PDB applications NVO Home, Lab, Campus, Desktop Applications Medium, Large, and Leadership HPC Applications World ofWarcraft MolecularModeling Quicken Compute (more FLOPS)

The mission of the San Diego Supercomputer Center (SDSC) is to empower communities in data-oriented research, education, and practice through the innovation and provision of Cyberinfrastructure SDSC Mission Cyberinfrastructure=resources(computers, data storage, networks, scientific instruments, experts, etc.) + “glue”(integrating software, systems, and organizations).

SDSC Cyberinfrastructure • SDSC DATA COLLECTIONS, ARCHIVAL AND STORAGE SYSTEMS • 2.4 PB Storage-area Network (SAN) • 25 PB StorageTek/IBM tape library • HPSS and SAM-QFS archival systems • DB2, Oracle, MySQL • Storage Resource Broker • Supporting servers: IBM 32-way p690s, • 72-CPU SunFire 15K, etc. • http://datacentral.sdsc.edu/ Support for community data collections and databases Data management, mining, analysis, and preservation SDSC HIGH PERFORMANCE COMPUTING SYSTEMS • DataStar • 15.6 TFLOPS Power 4+ system • 7.125 TB total memory • Up to 4 GBps I/O to disk • 115 TB GPFS filesystem • Blue Gene Data • First academic IBM Blue Gene system • 17.1 TF • 1.5 TB total memory • 3 racks, each with 2,048 PowerPC processors and 128 I/O nodes • TeraGrid Cluster • 524 Itanium2 IA-64 processors • 2 TB total memory • Also 16 2-way data I/O nodes http://www.sdsc.edu/ user_services/ • SDSC SCIENCE and TECHNOLOGY STAFF, SOFTWARE, SERVICES • User Services • Application/Community Collaborations • Education and Training • SDSC Synthesis Center • Data-oriented Community SW, toolkits, portals, codes • http://www.sdsc.edu/

Who are SDSC’s “customers”?What role does digital data play in their research?

Data and Simulations -- TeraShake Major Earthquakes on the San Andreas Fault, 1680-present • Simulation results provide new scientific information enabling better • Estimation of seismic risk • Emergency preparation, response and planning • Design of next generation of earthquake-resistant structures • Results provide information which can help in saving many lives and billions in economic losses • Researchers use geological, historical, and environmental data to simulate massive earthquakes. • These simulations are critical to understand seismic movement, and assess potential impact. 1906 M 7.8 ? 1857 M 7.8 1680 M 7.7 How dangerous is the San Andreas Fault?

TeraShake Visualization TeraShake1 • Simulation of Southern of 7.7 earthquake on lower San Andreas Fault • Physics-based dynamic source model – simulation of mesh of 1.8 billion cubes with spatial resolution of 200 m • Simulated first 3 minutes of a magnitude 7.7 earthquake, 22,728 time steps of 0.011 second each • Simulation generates 45+ TB data Project leadership: Tom Jordan (SCEC), Bernard Minster (SIO) Reagan Moore (SDSC), Carl Kesselman (ISI)

Data parking of 100s of TBs for many months “Fat Nodes” with 256 GB of DS for pre-processing and post visualization 10-20 TB data archived a day 240 procs on SDSC Datastar, 5 days, 1 TBof main memory 47 TB output data for 1.8 billion grid points Continuous I/O 2GB/sec TeraShake Data Choreography Resources must support a complicated orchestration of computation and data movement Parallelfile system Dataparking Finer resolution simulations require even more resources. TeraShake being scaled to run on petascale architectures “I have desired to see a large earthquake simulation for over a decade. This dream has been accomplished.” Bernard Minster, Scripps Institute of Oceanography

TeraShake at Petascale – A qualitative difference in prediction accuracy creates even greater data infrastructure demands • Petascale platform will allow much higher resolution • for very accurate prediction • of ground motion Estimates courtesy of the Southern California Earthquake Center

Data as a driver – the Protein Data Bank: A resource for the global Biology community The Protein Data Bank • Largest repository on the planet for structural information about proteins • Provides free worldwide public access 24/7 to accurate protein data • PDB maintained by the Worldwide PDB administered by the Research Collaboratory for Structural Bioinformatics (RCSB), directed by Helen Berman Each structure costs roughly 200K to generate. 2006 holdings will have cost roughly $80B in research investment January 2007 Molecule of the Month: Importins Complex of 3 proteins which aids in protein synthesis by ferrying molecules back and forth between the inside and the outside of the nucleus through tube-shaped nuclear pores 2006: > 5000 structures in one year, >36,000 total structures 1976-1990, roughly 500 structures or less per year Growth of Yearly/Total Structures in PDB

PDB accessible over the Internet and serves 10,000 users a day (> 200,000 hits) H. Berman estimated that in 2005, more than $1B of research funding was spent to generate the data that were collected, curated, and distributed by the PDB. WWW User Interface Data collected, annotated and validated at one of 3 worldwide PDB sites (Rutgers in US). Infrastructure required: 20 highly trained personnel and significant computational, storage and networking capabilities. SearchLite SearchFields Query Result Browser Structure Explorer New queries New queries New tools New tools DB INTEGRATION LAYER FLAT FILES Infrastructure: PDB portal served by cluster at SDSC. PDB system designed with multiple failover capabilities to ensure 24/7 access and 99.99% uptime. PDB infrastructure requires 20TB storage at SDSC DERIVED DATA FTP tree (download) BMCD KEYWORDSEARCH CORE DB CORBA Interface Remote Applications SDSC machine room How Does the PDB Work?

Software to access data Software to federate data Anatomy Disciplinary Databases Users Physiology Organisms Organs Cell Biology Cells Proteomics (PDB level) Organelles Genomics Bio-polymers Medicinal Chemistry Atoms Using the PDB • PDB provides a critical building block for research, education, and practice in the Biosciences • PDB tools include • Data Extraction and Preparation • Data Format Conversion • Data Validation • Dictionary and Data management • Tools supporting the OMB Corba Standard for Macromolecular Structure Data, etc.

Cyberinfrastructure to Support Data-oriented Research, Education, and Practice at SDSC

Storage of research data in SDSC’s archives show consistent increase in the need for capacity • Most of the data is supercomputer simulation output, but digital library collections and experimental data are contributing to growth rates • Consistent exponential growth with ~15 month doubling drives planning and cost projections • Technology advancements help, but media costs/byte are not decreasing as storage is increasing Information courtesy of Richard Moore

National Data Repository: SDSC DataCentral • First broad program of its kind to support national research and community data collections and databases • “Data allocations” providedon SDSC resources • Data collection and database hosting • Batch oriented access, collection management services • Comprehensive data resources:disk, tape, databases, SRB, web services, tools, 24/7 operations, collection specialists, etc. Web-based portal access

DataCentral Allocated Collections include

Data Systems SAM/QFS HPSS GPFS SRB Data Services Data migration/upload, usage and support (SRB) Database selection and Schema design (Oracle, DB2, MySQL) Database application tuning and optimization Portal creation and collection publication Data analysis (e.g. Matlab) and mining (e.g. WEKA) DataCentral Data-oriented Toolkits and Tools Biology Workbench Montage (astronomy mosaicking) Kepler (Workflow management) Vista Volume renderer (visualization), etc. Services, Tools, and Technologies Key for Data-related Capability

Digital State and Federal records Increasing Need to Sustain Digital Data for the Foreseeable Future The Public Sector UCSD Libraries The Private Sector The Entertainment Industry Researchers and Educators

What data is the most valuable? • Key criteria • Irreplaceable • Longitudinal • Used by many • Expensive • Needed in the future • Culturally or scientifically meaningful • … Reference collections Data needing rescue Federal records Time-series Irreplaceable data

Key Challenges for Digital Preservation • What should we preserve? • What materials must be “rescued”? • How to plan for preservation of materials by design? • How should we preserve it? • Formats • Storage media • Stewardship – who is responsible, and for how long? • Who should pay for preservation? • The content generators? • The government? • The users? • Who should have access? Print media provides easy access for long periods of time but is hard to data-mine Digital media is easier to data-mine but requires management of evolution of media and resource planning over time

Preservation and Risk Less risk means more replicants, more resources, more people

Consortium Chronopolis™: An Integrated Approach to Long-term Digital Preservation SDSC , the UCSD Libraries, NCAR, UMd, NARA, Library of congress, NSF working together on long-term preservation of digital collections • Chronopolis™ provides a comprehensive approach to infrastructure for long-term preservation integrating • Collection ingestion • Access and Services • Research and development for new functionality and adaptation to evolving technologies • Business model, data policies, and management issues critical to success of the infrastructure

Chronopolis™ Federation architecture NCAR U Md SDSC Chronopolis Site Chronopolis™ – Replication and Distribution • 3 replicas of valuable collections considered reasonable mitigation for risk of data loss • Chronopolis™ Consortium will store 3 copies of preservation collections: • “Bright copy”– Chronopolis ™ site supports ingestion, collection management, user access • “Dim copy”– Chronopolis ™ site supports remote replica of bright copy and supports user access • “Dark copy”– Chronopolis ™ site supports reference copy that may be used for disaster recovery but no user access • Each site may play different roles for different collections Dim copy C1 Dark copy C1 Dark copy C2 Bright copy C2 Bright copy C1 Dim copy C2

Prokudin-Gorskii Photographs cost to store SDSC Playing a Leadership Role in Development of a National Digital Data Framework SDSC storing genetic research data for the City of Hope value ofcontent SDSC Developing Data Visualizations for UCSD Moores Cancer Center SDSC Storing National Collections for National Archives and Records Administration SDSC working with the Library of Congress on Distributed Data Stewardship

Thank You Community Cyberinfrastructure at SDSC SDSC Summer Institutes, Training, Outreach www.sdsc.edu/us/training http://education.sdsc.edu DataCentral data repository datacentral.sdsc.edu www.sdsc.edu Allocated HPC resources (via TeraGrid) http://www.sdsc.edu/resources/CompStorage.html Community CI-oriented R&D projects www.sdsc.edu/research/ SW, visualization and other services http://www.sdsc.edu/resources/Resources.html

Developing Cyberinfrastructure for Data-Oriented Science and Engineering