200 likes | 326 Views
Virtual Data Grid Architecture. Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny. GriPhyN Summary.
E N D
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny
GriPhyN Summary The GriPhyN research agenda aims at IT advances that will enable groups of scientists distributed worldwide to harness Petascale processing, communication, and data resources to transform raw experimental data into scientific discoveries. The goals of the GriPhyN project are to achieve the fundamental IT advances required to realize Petascale Virtual Data Grids and to demonstrate, evaluate, and transfer these research results via the creation of a Virtual Data Toolkit to be used by the four major physics experiments and other projects.
Major Points • Project has two complementary & supporting elements • IT research project: will be judged on contributions to knowledge • CS/application partnership: will also be judged on successful transfer to experiments • Two associated unifying concepts • Virtual data as the central intellectual concept • Toolkit as a central deliverable and technology transfer vehicle
Virtual Data as a Key Intellectual Challenge and Unifying Concept “These characteristics combine to enable the definition and delivery of a potentially unlimited virtual space of data products derived from other data. In this virtual space, requests can be satisfied via direct retrieval of materialized products and/or computation, with local and global resource management, policy, and security constraints determining the strategy used.”
Virtual Data (contd) “The concept of virtual data recognizes that all except irreproducible raw experimental data need ‘exist’ physically only as the specification for how they may be derived. The grid may materialize zero, one, or many copies of derivable data depending on probable demand and the relative costs of computation, storage, and transport.”
(Simple) Virtual Data Example • (LIGO) “Gravitational strain for 2 minutes around each of 200 gamma-ray bursts over the last year” • For each requested data value, need to • Determine if it is materialized; if so, where; if not, how to compute it • Plan data movements and computations required to obtain all results • Execute this plan
GriPhyN Goals “Explore concept of virtual data and its applicability to data-intensive science,” i.e., • Transparency with respect to location • Known concept; but how to realize in a large-scale, performance-oriented Data Grid? • Transparency with respect to materialization • To determine: is this useful? • Automated management of computation • Issues of scale, transparency
Production Team Individual Investigator Other Users Interactive User Tools Request Planning and Request Execution Virtual Data Tools Scheduling Tools Management Tools Performance Estimation and Evaluation Resource Security and Other Grid Resource Security and Other Grid Management Policy Services Management Policy Services Services Services Services Services Transforms Raw data Distributed resources source (code, storage, computers, and network) Primary GriPhyN R&D Components
Data Grid Reference Architecture:Purpose • Identify primary components of a Data Grid architecture (part vocabulary, part requirements definition, part strategy) • Suggest potential implementation approaches • Identify principal areas in which uncertainty exists and hence research is required
Observations on Architecture • We need an architecture so that we can • Coordinate our own activities • Coordinate with other Data Grid projects • Explain to others (experiments, NSF, CS community) what we are doing • An architecture must: • Facilitate CS research activities by simplifying evaluation of alternatives • Not preclude experimentation with (radically) alternative approaches
Documents • A Data Grid Reference Architecture • Representing Virtual Data: A Catalog Architecture for Location and Materialization Transparency • Virtual Data Research Challenges • Requirements documents from CMS, LIGO, SDSS
Data Grid Reference Architecture User Applications Request Formulation Virtual Data Catalogs Request Manager Request Planner Request Executor Storage Systems Code Repositories Computers Networks
Relationship Between Components Virtual Data Data Grids Grids
Application “Specialized services”: user- or appln-specific distributed services Application User Internet Protocol Architecture “Managing multiple resources”: ubiquitous infrastructure services Collective “Sharing single resources”: negotiating access, controlling use Resource “Talking to things”: communication (Internet protocols) & security Connectivity Transport Internet “Controlling things locally”: Access to, & control of, resources Fabric Link Layered Grid Architecture
GriPhyNData Grid Reference Architecture Application Discipline-Specific Data Grid Application Catalogs Replica Management Request Management Community Policy … Collective Access to data, access to computers, access to network performance data, … Resource Communication, service discovery (DNS), authentication, delegation Connectivity Storage Systems Compute Systems Networks Code Repositories Fabric …
Existing Components • Globus Toolkit • MDS-2 information service: access to static & dynamic configuration & state information • GRAM resource access protocol • GridFTP data access and transfer protocol • Replica catalog, replica management • Grid Security Infrastructure: single sign on • Condor, Condor-G resource management • SRB catalog services
Globus Data Grid Components Attribute Specification Replica Catalog Metadata Catalog Application Multiple Locations Logical Collection and Logical File Name MDS Selected Replica Replica Selection Performance Information & Predictions GridFTP commands NWS Disk Cache TapeLibrary Disk Array Disk Cache Replica Location 1 Replica Location 2 Replica Location 3
Short-Term (2001) Developments • Deployment of, and experimentation with, basic tools: data movement, data location, computation management • Already started in CMS and LIGO • Requirements definition for experiments • Already started with documents from CMS, LIGO, SDSS • Virtual data catalog prototype • Prototyping of other elements TBD • Work breakdown with EDG, PPDG
Goals for this Meeting • Identify major areas in with Data Grid Reference Architecture needs improvement • Identify how each CS research thrust contributes to this refinement process, and on what schedule • Research, software, and/or experiments • Identify how each application area will contribute to evaluating DGRA ideas • Experiments conducted