380 likes | 551 Views
Data and the Grid: From Databases to Global Knowledge Communities. Ian Foster Argonne National Laboratory University of Chicago www.mcs.anl.gov/~foster. Image Credit: Electronic Visualization Lab, UIC.
E N D
Data and the Grid:From Databases to Global Knowledge Communities Ian Foster Argonne National Laboratory University of Chicago www.mcs.anl.gov/~foster Image Credit: Electronic Visualization Lab, UIC Keynote Talk, 15th Intl Conf on Scientific and Statistical Database Management, Boston, July 11, 2003
My Presentation 1) Data integration as a new opportunity • Driven by advances in technology & science • The need to discover, access, explore, analyze diverse distributed data sources • Grid technologies as a substrate for essential management functions 2) Science as collaborative workflow • The need to organize, archive, reuse, explain, and schedule scientific workflows • Virtual data as a unifying concept
It’s Easy to ForgetHow Different 2003 is From 1993 • Enormous quantities of data: Petabytes • For an increasing number of communities, gating step is not collection but analysis • Ubiquitous Internet: 100+ million hosts • Collaboration & resource sharing the norm • Ultra-high-speed networks: 10+ Gb/s • Global optical networks • Huge quantities of computing: 100+ Top/s • Moore’s law gives us all supercomputers
Consequence: The Emergence ofGlobal Knowledge Communities • Teams organized around common goals • Communities: “Virtual organizations” • With diverse membership & capabilities • Heterogeneity is a strength not a weakness • And geographic and political distribution • No location/organization possesses all required skills and resources • Must adapt as a function of the situation • Adjust membership, reallocate responsibilities, renegotiate resources
Global Knowledge CommunitiesOften Driven by Data: E.g., Astronomy • No. & sizes of data sets as of mid-2002, grouped by wavelength • 12 waveband coverage of large areas of the sky • Total about 200 TB data • Largest catalogs near 1B objects Data and images courtesy Alex Szalay, John Hopkins
Many sources of data, services, computation Security & policy must underlie access & management decisions Discovery R R RM RM Registries organize services of interest to a community Access RM Resource management is needed to ensure progress & arbitrate competing demands RM RM Policy service Security service Policy service Security service Data integration activities may require access to, & exploration of, data at many locations Exploration & analysis may involve complex,multi-step workflows Data Integrationas a Fundamental Challenge
Performance Requirements Demand Whole-System Management • Assume • Remote data at 1 GB/s • 10 local bytes per remote • 100 operations per byte >1 GByte/s achievable today (FAST, 7 streams, LAGeneva) Local Network Parallel computation: 1000 Gop/s Remote data Wide area link (end-to-end switched lambda?) 1 GB/s Parallel I/O: 10 GB/s
Data Integration: Key Challenges • Of course, familiar issues: data organization, schema definition/mediation, etc., etc. • But also new challenges relating to dynamic, distributed communities • Establishment, negotiation, management, & evolution of multi-organizational federations • And to the sheer number of resources, speed of networks, and volume of data • Coordination, management, provisioning, & monitoring of workflows & required resources
Enter Grid Technologies • Infrastructure (“middleware”) for establishing, managing, and evolving multi-organizational federations • Dynamic, autonomous, domain independent • On-demand, ubiquitous access to computing, data, and services • Mechanisms for creating and managing workflow within such federations • New capabilities constructed dynamically and transparently from distributed services • Service-oriented, virtualization
Managed shared virtual systems Computer science research Open Grid Services Arch Web services, etc. Real standards Multiple implementations Globus Toolkit Internet standards Defacto standard Single implementation The Emergence ofOpen Grid Standards Increased functionality, standardization Custom solutions 1990 1995 2000 2005 2010
OGSA Structure • A standard substrate: the Grid service • Standard interfaces and behaviors that address key distributed system issues: naming, service state, lifetime, notification • A Grid service is a Web service • … supports standard service specifications • Agreement, data access & integration, workflow, security, policy, diagnostics, etc. • Target of current & planned GGF efforts • … and arbitrary application-specific services based on these & other definitions
Client • Introspection: • What port types? • What policy? • What state? GridService (required) Other standard interfaces: factory, notification, collections Grid Service Handle Service data element Service data element Service data element handle resolution Grid Service Reference Open Grid Services Infrastructure • Lifetime management • Explicit destruction • Soft-state lifetime Data access Implementation Hosting environment/runtime (“C”, J2EE, .NET, …)
Open Grid Services Infrastructure GWD-R (draft-ggf-ogsi- gridservice-23) Editors: Open Grid Services Infrastructure (OGSI) S. Tuecke, ANL http://www.ggf.org/ogsi-wg K. Czajkowski, USC/ISI I. Foster, ANL J. Frey, IBM S. Graham, IBM C. Kesselman, USC/ISI D. Snelling, Fujitsu Labs P. Vanderbilt, NASA February 17, 2003 Open Grid Services Infrastructure (OGSI) “The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration”, Foster, Kesselman, Nick, Tuecke, 2002
Client Client Client Request and manage file transfer operations Notf’n Source Policy Grid Service Fault Monitor Pending interfaces Query &/or subscribe to service data Performance service data elements Policy Perf. Monitor Faults Example:Reliable File Transfer Service File Transfer Internal State Data transfer operations
OGSA and Data Integration • OGSI provides key enabling mechanisms for distributed data integration • Introspect on distributed system elements • Create and manage distributed state • We need more than OGSI, of course, e.g., • WS-Agreement: negotiate agreements between service provider and consumer • OGSA-DAI: Data Access and Integration • WS-Management: service management • Security and policy
Job Submission Brokering Workflow Structured Data Integration Registry Banking Authorisation Data Transport Resource Usage Transformation Structured Data Access Structured Data Relational XML Semi-structured - Infrastructure Architecture Data Intensive X-ology Researchers Data Intensive Applications for X-ology Research Simulation, Analysis & Integration Technology for X-ology Generic Virtual Data Access and Integration Layer OGSA OGSI: Interface to Grid Infrastructure Compute, Data & Storage Resources Distributed Virtual Integration Architecture Slide Courtesy Malcolm Atkinson, UK eScience Center
Data as Service:OGSA Data Access & Integration • Service-oriented treatment of data appears to have significant advantages • Leverage OGSI introspection, lifetime, etc. • Compatibility with Web services • Standard service interfaces being defined • Service data: e.g., schema • Derive new data services from old (views) • Externalize to e.g. file/database format • Perform queries or other operations
1a. Request to Registry for sources of data about “x” SOAP/HTTP service creation API interactions Registry 1b. Registry responds with Factory handle 2a. Request to Factory for access to database Factory Client 2c. Factory returns handle of GDS to client 2b. Factory creates GridDataService to manage access 3a. Client queries GDS with XPath, SQL, etc XML / Relational database Grid Data Service 3c. Results of query returned to client as XML 3b. GDS interacts with database Data Access & Integration Services Slide Courtesy Malcolm Atkinson, UK eScience Center
Globus Toolkit v3 (GT3)Open Source OGSA Technology • Implements and builds on OGSI interfaces • Supports primary GT2 interfaces • Public key authentication • Scalable service discovery • Secure, reliable resource access • High-performance data movement (GridFTP) • Numerous new services included or planned • SLA negotiation, service registry, community authorization, data access & integration, … • Rapidly growing adoption and contributions • E.g., OGSA-DAI from U.K. eScience program
My Presentation 1) Data integration as a new opportunity • Driven by advances in technology & science • The need to discover, access, explore, analyze diverse distributed data sources • Grid technologies as a substrate for essential management functions 2) Science as collaborative workflow • The need to organize, archive, reuse, explain, & schedule scientific workflows • Virtual data as a unifying concept
Science as Workflow • Data integration = the derivation of new data from old, via coordinated computation(s) • May be computationally demanding • The workflows used to achieve integration are often valuable artifacts in their own right • Thus we must be concerned with how we • Build workflows • Share and reuse workflows • Explain workflows • Schedule workflows
Virtual Data Concept • Capture and manage information about relationships among • Data (of widely varying representations) • Programs (& their execution needs) • Computations (& execution environments) • Apply this information to, e.g. • Discovery: Data and program discovery • Workflow: Structured paradigm for organizing, locating, specifying, & requesting data • Explanation: provenance • Planning and scheduling • Other uses we haven’t thought of
“I’ve come across some interesting data, but I need to understand the nature of the corrections applied when it was constructed before I can trust it for my purposes.” Motivations “I’ve detected a calibration error in an instrument and want to know which derived data to recompute.” Data consumed-by/ generated-by created-by Transformation Derivation execution-of “I want to apply an astronomical analysis program to millions of objects. If the results already exist, I’ll save weeks of computation.” “I want to search an astronomical database for galaxies with certain characteristics. If a program that performs this analysis exists, I won’t have to write one from scratch.”
Chimera Virtual Data System(www.griphyn.org/chimera) • Virtual data catalog • Transformations, derivations, data • Virtual data language • Catalog definitions • Query tool • Applications include browsers and data analysis applications
Chimera Virtual Data Schema describes describes Metadata
Virtual Data in CMS HEP Analysis Define a virtual data space for exploration by other scientists mass = 200 decay = bb mass = 200 decay = ZZ mass = 200 mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = WW stability = 1 mass = 200 event = 8 mass = 200 decay = WW stability = 1 event = 8 • Knowledge capture mass = 200 decay = WW event = 8 mass = 200 plot = 1 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida
Virtual Data in CMS HEP Analysis Search for WW decays of the Higgs Boson for which only stable, final state particles are recorded? stability = 1 mass = 200 decay = bb mass = 200 decay = ZZ mass = 200 mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = WW stability = 1 mass = 200 event = 8 mass = 200 decay = WW stability = 1 event = 8 • Knowledge capture • On-demand data gen • Workload mgmt mass = 200 decay = WW event = 8 mass = 200 plot = 1 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida
Virtual Data in CMS HEP Analysis Search for WW decays of the Higgs Boson and where only stable, final state particles are recorded: stability = 1 mass = 200 decay = bb mass = 200 decay = ZZ mass = 200 mass = 200 decay = WW stability = 3 mass = 200 decay = WW Scientist discovers an interesting result – wants to know how it was derived. mass = 200 decay = WW stability = 1 mass = 200 event = 8 mass = 200 decay = WW stability = 1 event = 8 • Knowledge capture • On-demand data gen. • Workload mgmt • Explain provenance mass = 200 decay = WW event = 8 mass = 200 plot = 1 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida
Virtual Data in CMS HEP Analysis ... The scientist adds a new derived data Branch ... mass = 200 decay = WW stability = 1 LowPt = 20 HighPt = 10000 ... and continues to Investigate … Search for WW decays of the Higgs Boson and where only stable, final state particles are recorded: stability = 1 mass = 200 decay = bb mass = 200 decay = ZZ mass = 200 mass = 200 decay = WW stability = 3 mass = 200 decay = WW Scientist discovers an interesting result – wants to know how it was derived. mass = 200 decay = WW stability = 1 mass = 200 event = 8 mass = 200 decay = WW stability = 1 event = 8 • Knowledge capture • On-demand data gen. • Workload mgmt • Explain provenance • Collaboration mass = 200 decay = WW event = 8 mass = 200 plot = 1 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida
Virtual Data “Explorations” Can be Long-Lived Computations • Production Run on the Integration Testbed • Simulate 1.5 million full CMS events for physics studies: ~500 sec per event on 850 MHz processor • 2 months continuous running across 5 testbed sites • Managed by a single person at the US-CMS Tier 1
Virtual Datain Sloan Galaxy Cluster Analysis DAG Sloan Data Galaxy cluster size distribution Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao, Chicago
Figure 1. GADU data flow Virtual Data in Genome Analysis DOESG Resource
User WF-Pilot Design Execution monitoring WF-Engine Scheduling and execution WF-Compiler AWF EWF Translation ET schemas C C C AAV rules conversion rules Also Very Relevant:Workflow & Web Services AWF EWF web service invocation web service invocation ET ET query rewriting semantic type checking data type conversion web service matching Genbank BLAST Abstract Task (AT) Repository Executable Task (ET) Repository Data & Parameter Ontologies Datatype & Conversion Repository B. Ludäscher, I. Altintas, A. Gupta – http://kbi.sdsc.edu/SciDAC-SDM/scidac-tn-02-01.pdf
Summary 1) Data integration as a new opportunity • Driven by advances in technology & science • The need to discover, access, explore, analyze diverse distributed data sources • Grid technologies as a substrate for essential management functions 2) Science as collaborative workflow • The need to organize, archive, reuse, explain, and schedule scientific workflows • Virtual data as a unifying concept
For More Information • The Globus Project™ • www.globus.org • Technical articles • www.mcs.anl.gov/~foster • Open Grid Services Arch. • www.globus.org/ogsa • Chimera • www.griphyn.org/chimera • Global Grid Forum • www.ggf.org 2nd Edition: November 2003