Architecting Scientific Data Systems in the 21 st Century Dan Crichton

Architecting Scientific Data Systems in the 21st Century Dan Crichton Principal Computer Scientist Program Manager, Data Systems and Technology NASA Jet Propulsion Laboratory

Architecting the “End-to-End” Science Data System • Focus on • science data generation • data capture, end-to-end • access to science data by the community • Multiple scientific domains • Earth science • Planetary science • Biomedical research • Applied technology research • SW/Sys architectures • Product lines • Emerging technologies

Challenges in Science Data Systems • A major challenge is in organizing the wealth of science data which requires both standards and data engineering/curation • Search and access are dependent on good curation • Community support is critical to capture and curate the data in a manner that is useful the community • Usability of data continues to be a big challenge • Planetary science requires ALL science data and/or science data pipelines be peer reviewed prior to release of data • Standard formats are critical • Data sharing continues to be a challenge • Policies at the grant level coupled with standard data management plans are helping • Computational and Storage, historically major concerns, are now commodity services • Google, Microsoft Research, Yahoo! And Amazon try to provide services to e-science in the form of “Cloud Computing”

National Research Council: Committee on Data Management and Computation • CODMAC (1980s) identified seven core principles: • Scientific involvement; • Scientific oversight; • Data availability including usable formats, ancillary data, timely distribution, validated data, and documentation; • Proper facilities; • Structured, transportable, adequately documented software; • Data storage in permanent and retrievable form; and • Adequate data system funding. • The CODMAC has led to national efforts to organize scientific results in partnership with the science community (particularly physical science) • What does CODMAC mean in the 21st Century?

The “e-science” Trend… • Highly distributed, multi-organizational systems • Systems are moving towards loosely coupled systems or federations in order to solve science problems which span center and institutional environments • Sharing of data and services which allow for the discovery, access, and transformation of data • Systems are moving towards publishing of services and data in order to address data and computationally-intensive problems • Infrastructures which are being built to handle future demand • Address complex modeling, inter-disciplinary science and decision support needs • Need a dynamic environment where data and services can be used quickly as the building blocks for constructing predictive models and answering critical science questions

JPL e-science Examples • Planetary Science Data System (4X) • Highly diverse (40 years of science data • from NASA and Int’l missions) • Geographically distributed; moving int’l • New centers plugging in (i.e. data nodes) • Multi-center data system infrastructure • Heterogeneous nodes with common • interfaces • Integrated based on enterprise-wide data • standards • Sits on top of COTS-based middleware • EDRN Cancer Research (8X) • Highly diverse (30+ centers performing • parallel studies using different instruments) • Geographically distributed • New centers plugging in (i.e. data nodes) • Multi-center data system infrastructure • Heterogeneous sites with common • interfaces allowing access to distributed • portals • Integrated based on common data standards • Secure (e.g. encryption, authentication, • authorization)

Architectural drivers in science data systems • Increasing data volumes requiring new approaches for data production, validation, processing, discovery and data transfer/distribution (E.g., scalability relative to available resources) • Increased emphasis on usability of the data (E.g., discovery, access and analysis) • Increasing diversity of data sets and complexity for integrating across missions/experiments (E.g., common information model for describing the data) • Increasing distribution of coordinated processing and operations (E.g., federation) • Increased pressure to reduce cost of supporting new missions • Increasing desire for PIs to have integrated tool sets to work with data products with their own environments (E.g. perform their own generation and distribution) Planetary Science Archive

Architectural Focus • Consistent distributed capabilities • Resource discovery (data, metadata, services, etc), unified repository access, simple transformations, bulk transfer of multiple products, and unified catalog access • Move towards era of “grid-ing” loosely coupled science system • Develop on-demand, shared services (E.g. processing, translation, etc) • Processing • Translation • Deploy high throughput data movement mechanisms • Move capability up the mission pipeline • Reduce local software solutions that do not scale • Increasing importance in developing an “enterprise” approach with common services • Build value-added services and capabilities on top of the infrastructure

Object Oriented Data Technology* • Started in 1998 as a research and development task funded at JPL by the Office of Space Science to address • Application of Information Technology to Space Science • Provide an infrastructure for distributed data management • Research methods for interoperability, knowledge management and knowledge discovery • Develop software frameworks for data management to reuse software, manage risk, reduce cost and leverage IT experience • OODT Initial focus • Data archiving – Manage heterogeneous data products and resources in a distributed, metadata-driven environment • Data location and discovery – Locate data products across multiple archives, catalogs and data systems • Data retrieval – Retrieve diverse data products from distributed data sources and integrate * 2003 NASA Software of the Year Runner Up

Architectural Principles* * Crichton, D, Hughes, J. S, Hyon, J, Kelly, S. “Science Search and Retrieval using XML”, Proceedings of the 2nd National Conference on Scientific and Technical Data, National Academy of Science, Washington DC, 2000. Separate the technology and the information architecture Encapsulate the messaging layer to support different messaging implementations Encapsulate individual data systems to hide uniqueness Provide data system location independence Require that communication between distributed systems use metadata Define a model for describing systems and their resources Provide scalability in linking both number of nodes and size of data sets Allow systems using different data dictionaries and metadata implementations to be integrated Leverage existing software, where possible (e.g., open source, etc)

Distributed Architecture 1. Science data tools and applications use “APIs” to connect to a virtual data repository 2. Middleware creates the data grid infrastructure connecting distributed heterogeneous systems and data 3. Repositories for storing and retrieving many types of data Mission Data Repositories OODT Reusable Data Grid Framework OODT API Visualization Tools Biomedical Data Repositories OODT API Web Search Tools Engineering Data Repositories OODT API Analysis Tools

Software Implementation • OODT is Open Source • Developed using open source software (i.e. Java/J2EE and XML) • Implemented reusable, extensible Java-based software components • Core software for building and connecting data management systems • Provided messaging as a “plug-in” component that can be replaced independent of the other core components. Messaging components include: • CORBA, Java RMI, JXTA, Web Services, etc • REST seems to have prevailed • Provided client APIs in Java, C++, HTTP, Python, IDL • Simple installation on a variety of platforms (Windows, Unix, Mac OS X, etc) • Used international data architecture standards • ISO/IEC 11179 – Specification and Standardization of Data Elements • Dublin Core Metadata Initiative • W3C’s Resource Description Framework (RDF) from Semantic Web Community

Characteristics of Informatics in Space Science • Often unique, one of a kind missions • Can drive technological changes • Instruments are competed and developed by academic, industry and industrial partners • Highly distributed acquisition and processing across partner organizations • Highly diverse data sets given heterogeneity of the instruments and the targets (i.e. solar system) • Missions are required to share science data results with the research community requiring: • Common domain information model used to drive system implementations • Expert scientific help to the user community on using the data • Peer-review of data results to ensure quality • Distribution of data to the community • Planetary science data from NASA (and some international) missions is deposited into the Planetary Data System

Distributed Space Architecture Relay Satellite Simple Information Object Spacecraft and Scientific Instruments Spacecraft / lander Science Data Archive External Science Community Primitive Information Object Primitive Information Object Science Information Package Science Information Package Science Data Processing Science Products - Information Objects Telemetry Information Package Science Information Package Data Analysis and Modeling Science Information Package Planning Information Object Instrument Planning Information Object Science Team Data Acquisition and Command Mission Operations Instrument /Sensor Operations • Common Meta Models for Describing Space Information Objects • Common Data Dictionary end-to-end

PDS Image Class (Object-Oriented) PDS Image Label (ODL) Describes An Image Planetary Science Data Standards • JPL has led and managed development of the planetary science data standards for NASA and the international community • ESA, ISRO, JAXA, etc leveraging planetary science data standards • A diverse model used across the community that unifies data systems • Core “information” model that has been used to describe every type of data from NASA’s planetary exploration missions and instruments • ~4000 different types of data

2001 Mars Odyssey: A paradigm change • Pre-Oct 2002, no unified view across distributed operational planetary science data repositories • Science data distributed across the country • Science data distributed on physical media • Planetary data archive increasing from 4 TBs in 2001 to 100 TBs in 2009 • Traditional distribution infeasible due to cost and system constraints • Mars Odyssey could not be distributed using traditional method • Current work with the OODT Data Grid Framework has provided the technology for NASA’s planetary data management infrastructure to • Support online distribution of science data to planetary scientists • Enable interoperability between nine institutions • Support real-time access to data products • Provided uniform software interfaces to all Mars Odyssey data allowing scientists and developers to link in their own tools • Operational October 1, 2002 • Moving to multi-terrabyte online data movement in 2009 2001 Mars Odyssey

Explosion of Data in Biomedical Research • “To thrive, the field that links biologists and their data urgently needs structure, recognition and support. The exponential growth in the amount of biological data means that revolutionary measures are needed for data management, analysis and accessibility. Online databases have become important avenues for publishing biological data.” – Nature Magazine, September 2008 • The capture and sharing of data to support collaborative research is leading to new opportunities to examine data in many sciences • NASA routinely releases “data analysis programs” to analyze and process existing data • EDRN has become a leader in building informatics technologies and constructing databases for cancer research. The tools and technologies are now ready for wider use! EDRN Data Repositories 7-Sep-14 17

Bioinformatics: National Cancer Institute Early Detection Research Network (EDRN) • Initiated in 2000, renewed in 2005 • 100+ Researchers (both members and associated members) • ~40 + Research Institutions • Mission of EDRN • Discover, develop and validate biomarkers for cancer detection, diagnosis and risk assessment • Conduct correlative studies/trials to validate biomarkers as indicators of early cancer, pre-invasive cancer, risk, or as surrogate endpoints • Develop quality assurance programs for biomarker testing and evaluation • Forge public-private partnerships • Leverage building distributed planetary science data systems for biomedicine

EDRN Knowledge Environment • EDRN has been a pioneer in the use of informatics technologies to support biomarker research • EDRN has developed a comprehensive infrastructure to support biomarker data management across EDRN’s distributed cancer centers • Twelve institutions are sharing data • Same architectural framework as planetary science • It supports capture and access to a diverse set of information and results • Biomarkers • Proteomics • Biospecimens • Various technologies and data products (image, micro-satellite, …) • Study Management

EDRN’s Ontology Model EDRN CDE Tools • EDRN has developed a High level ontology model for biomarker research which provides standards for the capture of biomarker information across the enterprise • Specific models are derived from this high level model • Model of biospecimens • Model for each class of science data • EDRN is specifically focusing on a granular model for annotating biomarkers, studies and scientific results • EDRN has a set of EDRN Common Data Elements which is used to provide standard data elements and values for the capture and exchange of data EDRN Biomarker Ontology Model

Earth Science Distributed Process Mgmt SeaWinds on ADEOS II (Launched Dec 2002) • Leveraged OODT software framework for constructing ground data systems for earth science missions • Used OODT Catalog and Archive Service software • Constructed “workflows” • Execution of “processors” based on a set of rules • Provided “lights out” operations • Multiple Missions • SeaWinds • QuikSCAT • Orbiting Carbon Observatory (OCO) • NP Sounder PEATE • SMAP

Supporting Climate Research • Earth Observing System Data and Information System (EOSDIS) serves NASA’s earth scientists data needs • Two major legacies are left • Archiving of explosion in observational data in Distributed Active Archive Centers (DAACs) • Request-driven retrieval from archive is time consuming • Adoption of Hierarchical Data Format (HDF) for data files • Defined by and unique to each instrument but not necessarily consistent between instruments • What are the next steps to accelerating use of an ever increasing observational data collection? • What data are available? • What is the information content? • How should it be interpreted in climate modeling research?

EOSDIS DAAC’sEarth Observing System Data and Information System Distributed Active Archive Centers

Current Data System • System serves static data products. User must find move, and manipulate all data him/herself. • User must change spatial and temporal resolutions to match. • User must understand instrument observation strategies and subtleties to interpret.

Climate Data eXchange (CDX) • Develop an architecture that enables sharing of climate model output and NASA observational data • Develop an architectural model that evaluates trade space of model • Provide extensive server-side computational services side • Increase performance • Subsetting, reformatting, re-gridding • Deliver an “open source” toolkit • Connect NASA and DOE

Combining Instrument Data to enable Climate Research: AIRS and MLS • Combining AIRS and MLS requires: • Rectifying horizontal, vertical and temporal mismatch • Assessing and correcting for the instruments’ scene-specific error characteristics (see left diagram)

Specific Tools (H2O, CO2, …) Climate Data Exchange Key Questions to be Answered 28 28

Summary • Software is critical to supporting collaborative research in science • Virtual organizations • Transparent access to data • End-to-end environments • Software architecture is critical to • Reducing cost of building science data systems • Building virtual organizations • Constructing software product lines • Driving standards • Science is still learning how to best leverage technology in a collaborative discovery environment, but significant progress is being made!

THANK YOU… Dan Crichton • Dan.Crichton@jpl.nasa.gov • +1 818 354 9155

Architecting Scientific Data Systems in the 21 st Century Dan Crichton

Architecting Scientific Data Systems in the 21 st Century Dan Crichton

Presentation Transcript

Capital in the 21 st century

Capital in the 21 st century

LEADERSHIP IN THE 21 ST CENTURY

21 st Century

Evangelism in the 21 st Century

School founded in the 21 st Century with 21 st Century Curriculum:

Hydropower in the 21 st Century

Geodesy in the 21 st century

Capital in the 21 st century

Toward Sustainable Agricultural Systems in the 21 st Century

Architecting Scientific Data Systems in the 21 st Century Dan Crichton

Mentoring in the 21 st Century

Toward Sustainable Agricultural Systems in the 21 st Century

21 st Century?

21 st Century Archives: Architecting the Future

Electoral Systems: Strengthening Democracy in the 21 st Century

21 st century

Asia in the 21 st Century

21 st CENTURY

War in the 21 st Century

Travel in the 21 st century