A Grand Challenge for the Information Age

A Grand Challenge for the Information Age Dr. Natasha Balac San Diego Supercomputer Center UC San Diego

SDSC overview ProductionSystems User Servicesand Development • National NSF facility since 1985 and UCSD Organized Research Unit • ~400 staff & students • Core & TeraGrid programs provide high-end computational and storage resources to US researchers, based on a national peer-review proposal/allocation process • Supports many programs including • “Core” cyberinfrastructure program • National TeraGrid program • Protein Data Bank (PDB) • Biomedical Informatics Research Network (BIRN) • Optiputer • Network for Earthquake Engineering Simulation (NEES) • Geosciences Network (GEON) • Alliance for Cell Signaling (AfCS) • High Performance Wireless Research and Education Network (HPWREN) • National Virtual Observatory TechnologyResearch and Development ScienceResearchand Development Data andKnowledge Systems

SDSC in Brief SDSC CAIDA image from the 2007 “Design and the Elastic Mind” Exhibit at the NY Museum of Modern Art • Funding • In 2007, SDSC was home to over 110 research projects and received research funding in excess of $45M • ~85% funding from NSF; also NIH, DOE, NARA, LC and other agencies/industry • Facilities • SDSC’s data center is the largest academic data center in the world with 36+ PBs capacity • SDSC hosted over 90 resident events, workshops, and summer institutes in our facilities in 2007. • SDSC’s focus on increased efficiency reduced our utility usage by 18% • SDSC’s new building is LEED silver -equivalent, the first on the UCSD campus. • Research • SDSC hosts hosted over 100 separate community digital data sets and collections for sponsors such as NSF, NIH, and the Library of Congress. • SDSC staff and collaborators published scholarly articles in a spectrum of journals including Cell, Science, Nature, Journal of Seismology, Journal of the American Chemical Society, Journal of Medicinal Chemistry, Nano Letters, PLoS Computational Biology, and many others

Cyberinfrastructure … the organized aggregate of information technologies coordinated to address problems in science and society If infrastructure is required for an industrial economy, then we could say that cyberinfrastructureis required for a knowledge economy.” NSF 2003 Final Report of the Blue Ribbon Advisory Panel on Cyberinfrastructure [“Atkins Report”]

Cloud Platforms and Virtualization SDSC’s Mission: To transform science and society through Cyberinfrastructure Health SciencesDrug Design Earthquake dynamics; structural integrity of buildings, levees Biofuels,Renewable energy Climate modeling: Global warming Data management and analysis Web, DB, Portals High Performance computing IT Training DisasterResponse SDSC Cyberinfrastructure Visualization Green IT Scientific Instru-ments Terascaleand Petascale Computers Cyber-infrasrructure Services Performance optimization Data preservation and Life cycle management Storage DataStorage

CI Innovation Ongoing collaborations in cloud computing, power efficiency, virtualization, disaster response, drug design, etc. accelerating research, education, and practice Data Cyberinfrastructure New data resource will complement SDSC’s green Datacenter – one of the largest academic data center in the world ComputingCyberinfrastructure New computers will provide a unique resource for massive data analysis, and provide the seed for growing large-scale, professionally maintained computational platform at UCSD SDSC Initiatives Harnessing the 2 Most Significant Trends in Information Technology Unlimited Computation Unlimited Data

The Fundamental Driver of the Information Age is Digital Data Education Entertainment Shopping Health Information Business

Data at multiple scales in the Biosciences Data from multiple sources in the Geosciences Data Accessand Use DataIntegration Anatomy Disciplinary Databases Users Physiology Organisms Organs Cell Biology Cells Proteomics Organelles Genomics Bio-polymers Medicinal Chemistry Atoms Digital Data Critical for Research and Education Where should we drill for oil? What is the Impact of Global Warming? How are the continents shifting? Data Integration Complex “multiple-worlds” mediation What genes are associated with cancer? What parts of the brain are responsible for Alzheimers? Geo-Physical Geo-Chronologic Geo-Chemical Foliation Map Geologic Map

Today’s Presentation • Data Cyberinfrastructure Today – Designing and developing infrastructure to enable today’s data-oriented applications • Challenges in Building and Delivering Capable Data Infrastructure • Sustainable Digital Preservation – Grand Challenge for the Information age

Data Cyberinfrastructure Today – Designing and Developing Infrastructure for Today’s Data-Oriented Applications

Today’s Data-oriented Applications Span the Spectrum DATA (more BYTES) Designing Infrastructure for Data: Data and High Performance Computing Data and Grids Data and Cyberinfrastructure Services Data-intensiveand Compute-intensive HPC applications Data-intensive applications Home, Lab, Campus, Desktop Applications Compute-intensiveHPC Applications Data Grid Applications COMPUTE (more FLOPS) NETWORK (more BW) Grid Applications

DATA (more BYTES) Data and High Performance Computing • For many applications, development of “balanced systems” needed to support applications which are both data-intensive and compute-intensive. Codes for which • Grid platforms not a strong option • Data must be local to computation • I/O rates exceed WAN capabilities • Continuous and frequent I/O is latency intolerant • Scalability is key • Need high-bandwidth and large-capacity local parallel file systems, archival storage Data-intensiveand Compute-intensive HPC applications Data-intensive applications Data-intensive applications Compute-intensiveHPC Applications Compute-intensive applications COMPUTE (more FLOPS)

: Earthquake Simulation at Petascale – better prediction accuracy creates greater data-intensive demands Information courtesy of the Southern California Earthquake Center

Data and Grids • Data applications some of the first applications which • required Grid environments • could naturally tolerate longer latencies • Grid model supports key data application profiles • Compute at site A with data from site B • Store Data Collection at site A with copies at sitesB and C • Operate instrument at site A, move data to site B for storage, post-processing, etc. CERN data providing key driver for grid technologies

Data Services Key for TeraGrid Science Gateways • Science Gateways provide common application interface for science communities on TeraGrid • Data services key for Gateway communities • Analysis • Visualization • Management • Remote access, etc. NVO LEAD GridChem Information and images courtesy of Nancy Wilkins-Diehr

Unifying Data over the Grid – the TeraGrid GPFS WAN Effort • User wish list • Unlimited data capacity (everyone’s aggregate storage almost looks like this) • Transparent, high speed access anywhere on the Grid • Automatic archiving and retrieval • No Latency • TeraGrid GPFS-WAN effort focuses on providing “infinite“(SDSC) storage over the grid • Looks like local disk to grid sites • Uses automatic migration with a large cache to keep files always “online” and accessible • Data automatically archived without user intervention Information courtesy of Phil Andrews

Data Grids • SRB - Storage Resource Broker • Persistent naming of distributed data • Management of data stored in multiple types of storage systems • Organization of data as a shared collection with descriptive metadata, access controls, audit trails • iRODS - integrated Rule-Oriented Data System • Rules control execution of remote micro-services • Manage persistent state information • Validate assertions about collection • Automate execution of management policies Slide adapted from presentation by Dr. Reagan Moore, UCSD/SDSC

iRODS: integrated Rule-based Data Systemhttp://irods.sdsc.edu • Organizes distributed data into shared collections, while automating the application of management policies • Each policy is expressed as a set of rules that control the execution of a set of micro-services • Persistent state information is maintained to track the results of all operations Slide adapted from presentation by Dr. Reagan Moore, UCSD/SDSC

Service Manager Consistency Check Module Rule Engine integrated Rule-Oriented Data System Client Interface Admin Interface Rule Invoker Rule Modifier Module Config Modifier Module Metadata Modifier Module Rule Base Current State Consistency Check Module Consistency Check Module Confs Resources Metadata-based Services Resource-based Services Metadata Persistent Repository Micro Service Modules Micro Service Modules Slide adapted from presentation by Dr. Reagan Moore, UCSD/SDSC

Data Management Applications(What do they have in common?) • Data grids • Share data- organize distributed data as a collection • Digital libraries • Publish data - support browsing and discovery • Persistent archives • Preserve data- manage technology evolution • Real-time sensor systems • Federate sensor data- integrate across sensor streams • Workflow systems • Analyze data- integrate client- & server-side workflows Slide adapted from presentation by Dr. Reagan Moore, UCSD/SDSC

iRODS Approach • To meet the diverse requirements, the architecture must: • Be highly modular • Be highly extensible • Provide infrastructure independence • Enforce management policies • Provide scalability mechanisms • Manipulate structured information • Enable community standards Slide adapted from presentation by Dr. Reagan Moore, UCSD/SDSC

Data Management Challenges • Authenticity • Manage descriptive metadata for each file • Manage access controls • Manage consistent updates to administrative metadata • Integrity • Manage checksums • Replicate files • Synchronize replicas • Federate data grids • Infrastructure independence • Manage collection properties • Manage interactions with storage systems • Manage distributed data Slide adapted from presentation by Dr. Reagan Moore, UCSD/SDSC

Types of Risk • Media failure • Replicate data onto multiple media • Vendor specific systemic errors • Replicate data onto multiple vendor products • Operational error • Replicate data onto a second administrative domain • Natural disaster • Replicate data to a geographically remote site • Malicious user • Replicate data to a deep archive

How Many Replicas • Three sites minimize risk • Primary site • Supports interactive user access to data • Secondary site • Supports interactive user access when first site is down • Provides 2nd media copy, located at a remote site, uses different vendor product, independent administrative procedures • Deep archive • Provides 3rd media copy, staging environment for data ingestion, no user access

Data Reliability • Manage checksums • Verify integrity • Rule to verify checksums • Synchronize replicas • Verify consistency between metadata and records in vault • Rule to verify presence of required metadata • Federate data grids • Synchronize metadata catalogs

Data Services – Beyond Storage to Use What services do users want? How can I combine my data with my colleague’s data? How should I organize my data? How do I make sure that my data will be there when I want it? What are the trends and what is the noise in my data? My data is confidential; how do I make sure that it is seen/used only by the right people? How should I display my data? How can I make my data accessible to my collaborators?

Services: Integrated Environment Key to Usability analysis modeling • Database selection and schema design • Portal creation and collection publication • Data analysis • Data mining • Data hosting • Preservation services • Domain-specific tools • Biology Workbench • Montage (astronomy mosaicking) • Kepler (Workflow management) • Data visualization • Data anonymization, etc. Integrated Infrastructure Data Access simulation visualization Data Manipulation Data Management File systems,Database systems, Collection ManagementData Integration, etc. computers instruments Data Storage Many Data Sources Sensor-nets

Data Hosting: SDSC DataCentral – A Comprehensive Facility for Research Data • Broad program to support research and community data collections and databases • DataCentralservices include: • Public Data Collections and Database Hosting • Long-term storage and preservation (tape and disk) • Remote data management and access (SRB, iRODSportals) • Data Analysis, Visualization and Data Mining • Professional, qualified 24/7 support PDB – 28 TB • DataCentralresources include • 3 PB On-line disk • 36 PB StorageTek tape library capacity • 540 TB Storage-area Network (SAN) • DB2, Oracle, MySQL • Storage Resource Broker, iRODS • Gpfs-WAN with 800 TB Web-based portal access

Data Cyberinfrastructure at SDSC • Comprehensive data environment that incorporates access to the full spectrum of data enabling resources • hosting, managing and publishing data in digital libraries • sharing data through the Web and data grids • creating, optimizing, porting large scale databases • data intensive computing with high bandwidth data movement • analyzing, visualizing, rendering and data mining large scale data • preservation of data in persistent archives • building collections, portals, ontologies, etc. • providing resource, services and expertise

Portals, Data Grids, WAN File Systems Data Intensive computing, High Bandwidth Data Movement Consulting, Support, SACs, Ontologies, Education Foster Sharing and Collaboration, Collection Management, Data Services Preservation, Digital Libraries, Offsite Backup, Chronopolis Data Analysis, Databases, Data Mining, Visualization, Rendering Data to Discovery

SDSC Data Infrastructure Resources • 3 PB+ On-line disk • 36 PB StorageTek tape library capacity • 540 TB Storage-area Network (SAN) • DB2, Oracle, MySQL • SAS, R, MATLAB, Mathematica • Storage Resource Broker • Wide area file system with 800TB Petabyte-scale high-performance tape storage system High-performance SATA & SAN disk storage system

36 PB

DataCentral Allocated Collections include

Data Visualization is key SCEC Earthquake simulations Visualization of Cancer Tumors Prokudin– Gorskii historical images Information and images courtesy of Amit Chourasia, SCEC, Steve Cutchin, Moores Cancer Center, David Minor, U.S. Library of Congress

Building and Delivering Capable Data Cyberinfrastructure

Building Capable Data Cyberinfrastructure: Incorporating the “ilities” • Scalability • Interoperability • Reliability • Capability • Sustainability • Predictability • Accessibility • Responsibility • Accountability • …

Reliability • How can we maximize data reliability? • Replication, UPS systems, heterogeneity, etc. • How can we measure data reliability? • Network availability= 99.999% uptime (“5 nines”), • What is the equivalent number of “0’s” for data reliability? Reliability: What can go wrong Information courtesy of Reagan Moore

Responsibility and Accountability • What are reasonable expectations between users and repositories? • What are reasonable expectations between federated partner repositories? • What are appropriate models for evaluating repositories? • What incentives promote good stewardship? What should happen if/when the system fails? • Who owns the data? • Who takes care of the data? • Who pays for the data? • Who can access the data?

Good Data Infrastructure Incurs Real Costs Capability Costs Capacity Costs • Reliabilityincreased by up-to-date and robust hardware and software for • Replication (disk, tape, geographically) • Backups, updates, syncing • Audit trails • Verification through checksums, physical media, network transfers, copies, etc. • Data professionals needed to facilitate • Infrastructure maintenance • Long-term planning • Restoration, and recovery • Access, analysis, preservation, and other services • Reporting, documentation, etc. • Most valuable data must be replicated • SDSC research collections have been doubling every 15 months • SDSC storage is 36 PB and counting. Data is from supercomputer simulations, digital library collections, etc. Information courtesy of Richard Moore

Economic Sustainability Relay Funding • Making Infinite Funding Finite • Difficult to support infrastructure for data preservation as an infinite, increasing mortgage • Creative partnerships help create sustainable economic models User fees, recharges Geisel Library at UCSD Consortium support Endowments Hybrid solutions

Preserving Digital Information Over the Long Term

How much Digital Data is there? SDSC HPSS tape archive =36+ PetaBytes • 5 exabytes of digital information produced in 2003 • 161 exabytes of digital information produced in 2006 • 25% of the 2006 digital universe is born digital (digital pictures, keystrokes, phone calls, etc.) • 75% is replicated (emails forwarded, backed up transaction records, movies in DVD format) • 1 zettabyte aggregate digital information projected for 2010 iPod (up to 20K songs) =80 GB 1 novel =1 MegaByte U.S. Library of Congress manages 295 TB of digital data, 230 TB of which is “born digital” Source: “The Expanding Digital Universe: A forecast of Worldwide Information Growth through 2010” IDC Whitepaper, March 2007

How much Storage is there? • 2007 is the “crossover year” where the amount of digital information is greater than the amount of available storage • Given the projected rates of growth, we will never have enough space again for all digital information Source: “The Expanding Digital Universe: A forecast of Worldwide Information Growth through 2010” IDC Whitepaper, March 2007

Focus for Preservation: the “most valuable” data • What is “valuable”? • Community reference data collections (e.g. UniProt, PDB) • Irreplaceable collections • Official collections (e.g. census data, electronic federal records) • Collections which are very expensive to replicate (e.g. CERN data) • Longitudinal and historical data • and others … Value Cost Time

National, InternationalScale “Regional” Scale Local Scale The Data Pyramid A Framework for Digital Stewardship Repositories/Facilities • Preservation efforts should focus on collections deemed “most valuable” • Key issues: • What do we preserve? • How do we guard against data loss? • Who is responsible? • Who pays? Etc. IncreasingValue IncreasingTrust Increasingrisk/responsibility Increasingstability Increasinginfra-structure Digital Data Collections Reference, nationally important, and irreplaceable data collections National / International-scale data repositories, archives, and libraries. Key research and community data collections “Regional”-scale libraries and targeted data centers. Personal data collections Private repositories.

Digital Collections of Community Value National, InternationalScale “Regional” Scale Local Scale • Key techniques for preservation: replication, heterogeneous support The Data Pyramid

: A Conceptual Model for Preservation Data Grids The Chronopolis Model • Geographically distributed preservation data grid that supports long-term management , stewardship of, and access to digital collections • Implemented by developing and deploying a distributed data grid, and by supporting its human, policy, and technological infrastructure • Integrates targeted technology forecasting and migration to support of long-term life-cycle management and preservation Distributed Production Preservation Environment Digital Information of Long-Term Value TechnologyForecasting and Migration Administration, Policy, Outreach

Chronopolis Focus Areas and Demonstration Project Partners 2 Prototypes: National Demonstration Project Library of Congress Pilot Project Partners SDSC/UCSD U Maryland UCSD Libraries NCAR NARA Library of Congress NSF ICPSR Internet Archive NVO UCSD Libraries • Chronopolis R&D, Policy, and Infrastructure Focus areas: • Assessment of the needs of potential user communities and development of appropriate service models • Development of formal roles and responsibilities of providers, partners, users • Assessment and prototyping of best practices for bit preservation, authentication, metadata, etc. • Development of appropriate cost and risk modelsfor long-term preservation • Development of appropriate success metrics to evaluate usefulness, reliability, and usability of infrastructure Information courtesy of Robert McDonald, David Minor, Ardys Kozbial

Chronopolis Federation architecture NCAR U Md SDSC Chronopolis Site National Demonstration Project – Large-scale Replication and Distribution • Focus on supporting multiple, geographically distributed copies of preservation collections: • “Bright copy”– Chronopolis site supports ingestion, collection management, user access • “Dim copy”– Chronopolis site supports remote replica of bright copy and supports user access • “Dark copy”– Chronopolis site supports reference copy that may be used for disaster recovery but no user access • Each site may play different roles for different collections Dim copy C1 Dark copy C1 Dark copy C2 Bright copy C2 Bright copy C1 Dim copy C2 • Demonstration collections included: • National Virtual Observatory (NVO) [1 TB Digital Palomar Observatory Sky Survey] • Copy of Interuniversity Consortium for Political and Social Research (ICPSR) data [1 TB Web-accessible Data] • NCAR Observational Data [3 TB of Observational and Re-Analysis Data]

SDSC/ UCSD Libraries Pilot Project with U.S. Library of Congress Prokudin-Gorskii Photographs (Library of Congress Prints and Photographs Division) http://www.loc.gov/exhibits/empire/ (also collection of web crawls from the Internet Archive) Goal: To “… demonstrate the feasibility and performance of current approaches for a production digital Data Center to support the Library of Congress’ requirements.” • Historically important 600 GB Library of Congress image collection • Images over 100 years old with red, blue, green components (kept as separate digital files). • SDSC stores 5 copies with dark archival copy at NCAR • Infrastructure must support idiosyncratic file structure. Special logging and monitoring software developed so that both SDSC and Library of Congress could access information Library of Congress Pilot Project information courtesy of David Minor

A Grand Challenge for the Information Age

A Grand Challenge for the Information Age

Presentation Transcript

Grand Challenge: Memories for Life

DARPA Grand Challenge 2005 Information for Sponsors

DARPA Grand Challenge

The Verification Grand Challenge

MOD GRAND CHALLENGE

A Grand Challenge for the Information Age

Pascal Grand Challenge

Grand Challenge Communities

Grand Challenge:

The challenge on job hunting in the Information Age

Mini Grand Challenge

Grand Challenge Initiative

Grand Challenge IV

A Grand Challenge For the Workplace

The Digital Community A Challenge for the Digital Age

Evolution: a Grand Challenge

Ethics for the Information Age

DARPA Grand Challenge

'The third age as a challenge '

A Grand Challenge for the Workplace Partnership for Life

DARPA Grand Challenge 2005 Information for Sponsors

Grand Challenge: