1 / 38

Data and the Grid: From Databases to Global Knowledge Communities

Data and the Grid: From Databases to Global Knowledge Communities. Ian Foster Argonne National Laboratory University of Chicago www.mcs.anl.gov/~foster. Image Credit: Electronic Visualization Lab, UIC.

paul2
Download Presentation

Data and the Grid: From Databases to Global Knowledge Communities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data and the Grid:From Databases to Global Knowledge Communities Ian Foster Argonne National Laboratory University of Chicago www.mcs.anl.gov/~foster Image Credit: Electronic Visualization Lab, UIC Keynote Talk, 15th Intl Conf on Scientific and Statistical Database Management, Boston, July 11, 2003

  2. My Presentation 1) Data integration as a new opportunity • Driven by advances in technology & science • The need to discover, access, explore, analyze diverse distributed data sources • Grid technologies as a substrate for essential management functions 2) Science as collaborative workflow • The need to organize, archive, reuse, explain, and schedule scientific workflows • Virtual data as a unifying concept

  3. It’s Easy to ForgetHow Different 2003 is From 1993 • Enormous quantities of data: Petabytes • For an increasing number of communities, gating step is not collection but analysis • Ubiquitous Internet: 100+ million hosts • Collaboration & resource sharing the norm • Ultra-high-speed networks: 10+ Gb/s • Global optical networks • Huge quantities of computing: 100+ Top/s • Moore’s law gives us all supercomputers

  4. Consequence: The Emergence ofGlobal Knowledge Communities • Teams organized around common goals • Communities: “Virtual organizations” • With diverse membership & capabilities • Heterogeneity is a strength not a weakness • And geographic and political distribution • No location/organization possesses all required skills and resources • Must adapt as a function of the situation • Adjust membership, reallocate responsibilities, renegotiate resources

  5. The Emergence of Global Knowledge Communities

  6. Global Knowledge CommunitiesOften Driven by Data: E.g., Astronomy • No. & sizes of data sets as of mid-2002, grouped by wavelength • 12 waveband coverage of large areas of the sky • Total about 200 TB data • Largest catalogs near 1B objects Data and images courtesy Alex Szalay, John Hopkins

  7. Many sources of data, services, computation Security & policy must underlie access & management decisions Discovery R R RM RM Registries organize services of interest to a community Access RM Resource management is needed to ensure progress & arbitrate competing demands RM RM Policy service Security service Policy service Security service Data integration activities may require access to, & exploration of, data at many locations Exploration & analysis may involve complex,multi-step workflows Data Integrationas a Fundamental Challenge

  8. Performance Requirements Demand Whole-System Management • Assume • Remote data at 1 GB/s • 10 local bytes per remote • 100 operations per byte >1 GByte/s achievable today (FAST, 7 streams, LAGeneva) Local Network Parallel computation: 1000 Gop/s Remote data Wide area link (end-to-end switched lambda?) 1 GB/s Parallel I/O: 10 GB/s

  9. Data Integration: Key Challenges • Of course, familiar issues: data organization, schema definition/mediation, etc., etc. • But also new challenges relating to dynamic, distributed communities • Establishment, negotiation, management, & evolution of multi-organizational federations • And to the sheer number of resources, speed of networks, and volume of data • Coordination, management, provisioning, & monitoring of workflows & required resources

  10. Enter Grid Technologies • Infrastructure (“middleware”) for establishing, managing, and evolving multi-organizational federations • Dynamic, autonomous, domain independent • On-demand, ubiquitous access to computing, data, and services • Mechanisms for creating and managing workflow within such federations • New capabilities constructed dynamically and transparently from distributed services • Service-oriented, virtualization

  11. Managed shared virtual systems Computer science research Open Grid Services Arch Web services, etc. Real standards Multiple implementations Globus Toolkit Internet standards Defacto standard Single implementation The Emergence ofOpen Grid Standards Increased functionality, standardization Custom solutions 1990 1995 2000 2005 2010

  12. OGSA Structure • A standard substrate: the Grid service • Standard interfaces and behaviors that address key distributed system issues: naming, service state, lifetime, notification • A Grid service is a Web service • … supports standard service specifications • Agreement, data access & integration, workflow, security, policy, diagnostics, etc. • Target of current & planned GGF efforts • … and arbitrary application-specific services based on these & other definitions

  13. Client • Introspection: • What port types? • What policy? • What state? GridService (required) Other standard interfaces: factory, notification, collections Grid Service Handle Service data element Service data element Service data element handle resolution Grid Service Reference Open Grid Services Infrastructure • Lifetime management • Explicit destruction • Soft-state lifetime Data access Implementation Hosting environment/runtime (“C”, J2EE, .NET, …)

  14. Open Grid Services Infrastructure GWD-R (draft-ggf-ogsi- gridservice-23) Editors: Open Grid Services Infrastructure (OGSI) S. Tuecke, ANL http://www.ggf.org/ogsi-wg K. Czajkowski, USC/ISI I. Foster, ANL J. Frey, IBM S. Graham, IBM C. Kesselman, USC/ISI D. Snelling, Fujitsu Labs P. Vanderbilt, NASA February 17, 2003 Open Grid Services Infrastructure (OGSI) “The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration”, Foster, Kesselman, Nick, Tuecke, 2002

  15. Client Client Client Request and manage file transfer operations Notf’n Source Policy Grid Service Fault Monitor Pending interfaces Query &/or subscribe to service data Performance service data elements Policy Perf. Monitor Faults Example:Reliable File Transfer Service File Transfer Internal State Data transfer operations

  16. OGSA and Data Integration • OGSI provides key enabling mechanisms for distributed data integration • Introspect on distributed system elements • Create and manage distributed state • We need more than OGSI, of course, e.g., • WS-Agreement: negotiate agreements between service provider and consumer • OGSA-DAI: Data Access and Integration • WS-Management: service management • Security and policy

  17. Job Submission Brokering Workflow Structured Data Integration Registry Banking Authorisation Data Transport Resource Usage Transformation Structured Data Access Structured Data Relational XML Semi-structured - Infrastructure Architecture Data Intensive X-ology Researchers Data Intensive Applications for X-ology Research Simulation, Analysis & Integration Technology for X-ology Generic Virtual Data Access and Integration Layer OGSA OGSI: Interface to Grid Infrastructure Compute, Data & Storage Resources Distributed Virtual Integration Architecture Slide Courtesy Malcolm Atkinson, UK eScience Center

  18. Data as Service:OGSA Data Access & Integration • Service-oriented treatment of data appears to have significant advantages • Leverage OGSI introspection, lifetime, etc. • Compatibility with Web services • Standard service interfaces being defined • Service data: e.g., schema • Derive new data services from old (views) • Externalize to e.g. file/database format • Perform queries or other operations

  19. 1a. Request to Registry for sources of data about “x” SOAP/HTTP service creation API interactions Registry 1b. Registry responds with Factory handle 2a. Request to Factory for access to database Factory Client 2c. Factory returns handle of GDS to client 2b. Factory creates GridDataService to manage access 3a. Client queries GDS with XPath, SQL, etc XML / Relational database Grid Data Service 3c. Results of query returned to client as XML 3b. GDS interacts with database Data Access & Integration Services Slide Courtesy Malcolm Atkinson, UK eScience Center

  20. Globus Toolkit v3 (GT3)Open Source OGSA Technology • Implements and builds on OGSI interfaces • Supports primary GT2 interfaces • Public key authentication • Scalable service discovery • Secure, reliable resource access • High-performance data movement (GridFTP) • Numerous new services included or planned • SLA negotiation, service registry, community authorization, data access & integration, … • Rapidly growing adoption and contributions • E.g., OGSA-DAI from U.K. eScience program

  21. My Presentation 1) Data integration as a new opportunity • Driven by advances in technology & science • The need to discover, access, explore, analyze diverse distributed data sources • Grid technologies as a substrate for essential management functions 2) Science as collaborative workflow • The need to organize, archive, reuse, explain, & schedule scientific workflows • Virtual data as a unifying concept

  22. Science as Workflow • Data integration = the derivation of new data from old, via coordinated computation(s) • May be computationally demanding • The workflows used to achieve integration are often valuable artifacts in their own right • Thus we must be concerned with how we • Build workflows • Share and reuse workflows • Explain workflows • Schedule workflows

  23. Sloan Digital Sky Survey Production System

  24. Virtual Data Concept • Capture and manage information about relationships among • Data (of widely varying representations) • Programs (& their execution needs) • Computations (& execution environments) • Apply this information to, e.g. • Discovery: Data and program discovery • Workflow: Structured paradigm for organizing, locating, specifying, & requesting data • Explanation: provenance • Planning and scheduling • Other uses we haven’t thought of

  25. “I’ve come across some interesting data, but I need to understand the nature of the corrections applied when it was constructed before I can trust it for my purposes.” Motivations “I’ve detected a calibration error in an instrument and want to know which derived data to recompute.” Data consumed-by/ generated-by created-by Transformation Derivation execution-of “I want to apply an astronomical analysis program to millions of objects. If the results already exist, I’ll save weeks of computation.” “I want to search an astronomical database for galaxies with certain characteristics. If a program that performs this analysis exists, I won’t have to write one from scratch.”

  26. Chimera Virtual Data System(www.griphyn.org/chimera) • Virtual data catalog • Transformations, derivations, data • Virtual data language • Catalog definitions • Query tool • Applications include browsers and data analysis applications

  27. Chimera Virtual Data Schema describes describes Metadata

  28. Virtual Data in CMS HEP Analysis Define a virtual data space for exploration by other scientists mass = 200 decay = bb mass = 200 decay = ZZ mass = 200 mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = WW stability = 1 mass = 200 event = 8 mass = 200 decay = WW stability = 1 event = 8 • Knowledge capture mass = 200 decay = WW event = 8 mass = 200 plot = 1 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida

  29. Virtual Data in CMS HEP Analysis Search for WW decays of the Higgs Boson for which only stable, final state particles are recorded? stability = 1 mass = 200 decay = bb mass = 200 decay = ZZ mass = 200 mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = WW stability = 1 mass = 200 event = 8 mass = 200 decay = WW stability = 1 event = 8 • Knowledge capture • On-demand data gen • Workload mgmt mass = 200 decay = WW event = 8 mass = 200 plot = 1 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida

  30. Virtual Data in CMS HEP Analysis Search for WW decays of the Higgs Boson and where only stable, final state particles are recorded: stability = 1 mass = 200 decay = bb mass = 200 decay = ZZ mass = 200 mass = 200 decay = WW stability = 3 mass = 200 decay = WW Scientist discovers an interesting result – wants to know how it was derived. mass = 200 decay = WW stability = 1 mass = 200 event = 8 mass = 200 decay = WW stability = 1 event = 8 • Knowledge capture • On-demand data gen. • Workload mgmt • Explain provenance mass = 200 decay = WW event = 8 mass = 200 plot = 1 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida

  31. Virtual Data in CMS HEP Analysis ... The scientist adds a new derived data Branch ... mass = 200 decay = WW stability = 1 LowPt = 20 HighPt = 10000 ... and continues to Investigate … Search for WW decays of the Higgs Boson and where only stable, final state particles are recorded: stability = 1 mass = 200 decay = bb mass = 200 decay = ZZ mass = 200 mass = 200 decay = WW stability = 3 mass = 200 decay = WW Scientist discovers an interesting result – wants to know how it was derived. mass = 200 decay = WW stability = 1 mass = 200 event = 8 mass = 200 decay = WW stability = 1 event = 8 • Knowledge capture • On-demand data gen. • Workload mgmt • Explain provenance • Collaboration mass = 200 decay = WW event = 8 mass = 200 plot = 1 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida

  32. Virtual Data “Explorations” Can be Long-Lived Computations • Production Run on the Integration Testbed • Simulate 1.5 million full CMS events for physics studies: ~500 sec per event on 850 MHz processor • 2 months continuous running across 5 testbed sites • Managed by a single person at the US-CMS Tier 1

  33. Virtual Datain Sloan Galaxy Cluster Analysis DAG Sloan Data Galaxy cluster size distribution Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao, Chicago

  34. Figure 1. GADU data flow Virtual Data in Genome Analysis DOESG Resource

  35. Bringing it All Together:A Virtual Data Grid

  36. User WF-Pilot Design Execution monitoring WF-Engine Scheduling and execution WF-Compiler AWF  EWF Translation ET schemas C C C AAV rules conversion rules Also Very Relevant:Workflow & Web Services AWF EWF web service invocation web service invocation ET ET query rewriting semantic type checking data type conversion web service matching Genbank BLAST Abstract Task (AT) Repository Executable Task (ET) Repository Data & Parameter Ontologies Datatype & Conversion Repository B. Ludäscher, I. Altintas, A. Gupta – http://kbi.sdsc.edu/SciDAC-SDM/scidac-tn-02-01.pdf

  37. Summary 1) Data integration as a new opportunity • Driven by advances in technology & science • The need to discover, access, explore, analyze diverse distributed data sources • Grid technologies as a substrate for essential management functions 2) Science as collaborative workflow • The need to organize, archive, reuse, explain, and schedule scientific workflows • Virtual data as a unifying concept

  38. For More Information • The Globus Project™ • www.globus.org • Technical articles • www.mcs.anl.gov/~foster • Open Grid Services Arch. • www.globus.org/ogsa • Chimera • www.griphyn.org/chimera • Global Grid Forum • www.ggf.org 2nd Edition: November 2003

More Related