Grid Data Management Components

Grid Data Management Components Adam Belloum Computer Architecture & Parallel Systems group University of Amsterdam adam@science.uva.nl

The Problem (application pull) • New class of application are emerging in different domains and which involve a huge collection of the data geographically distributed and owned by different collaborating organizations • Examples of such applications • Large Hadron Collider at CERN (2005) • Climate Modeling

Requirements of the Grid-based Applications • Efficient data transfer service • Efficient data access service • Reliability and security • Possibility to create and manage multiple copies of the data

Replica Selection Replica Location Replica Management Storage System Metadata Repository Transport service Security Summary Distributed Application Requested Data Management System ? Services specific to the data Grid infrastructure Low level Services (shared with other Grid Components)

The Data Grid is then … • The Data Grid is the infrastructure that provides the services required for manipulating geographically distributedlarge collection of measured and computed data • security services • replicas services • data transfer services • Etc. • Design Principles: • Mechanism neutrality • Compatibility with Grid infrastructure • Uniform of information and infrastructure

Replica Selection Replica Management Resource Management Metadata Repository Storage System Security Transport service The Data Grid Architecture High Level Components Core Services Data Grid Specific services Generic Grid Specific services

Replica services for Data Grid • Possibility to create multiple copies of the data • Efficient and reliable management of the replicas • Efficient replication strategy: Replica ManagementService • Location of the replicas: Replica location mechanism • Coherence of the replicas: Replica consistency mechanism • Selection of the replica: Replica selection service • Secure replicas mechanism

Transfer services for Data Grid • Fast mechanisms for large data transfer • Reliable transfer mechanisms • Secure transfer mechanisms • GridFTP

Security services for Data Grid • Authentication • Who is can access or view the data? • Authorization • Who is authorized to effectively use the data • Accounting • The users may be charged for using the data

Replica Management for Data Grid Efficient access to the data sets ? Replica Replica Replica Create replicas of The data sets Replica Replica

Replica Management Problem • When a request for a large file is issued, it requires a considerable amount of bandwidth to be achieved. The availability of the bandwidth at the due time will have an impact on the latency of access to the requested file. • Replication of files nearby the potential users (in other domains this is called caching)

What is the replicas manager It is a Grid Service responsible for creating complete and partial copies of the datasets (mainly a collection of files) • Grid Data model: • Datasets are stored in files grouped into collections • A replica • Is a subset of a collection that is stored on a particular physical storage Bill Allocock et al. “Data management and Transfer in High-Performance Computational Grid Environments”

The role of replicas manager service • Its purpose is to map a logical file name to a physical name for the file on a specific storage Note: Does not use any semantic information contained on the logical file names

Services relevant to the Replicas Manager Particle Physics applications, climate modeling application, etc. Replica Mgmt service Replica Selection Service Metadata Services Distributed Catalog Service Information Services Storage Mgmt protocols Catalog Mgmt protocols Network Mgmt protocols Compute Mgmt protocols Communications , services discovery (DNS), authentication, delegation, … Storage Systems Networks Compute Systems Replica catalog Metadata Catalog

Framework of the replica manager service • Separation of Replication and Metadata Information • Only information needed for the mapping of the logical file names to physical locations are considered • Replication Semantics • The replicas are not guaranteed to be coherent, • The information on the original copy is not saved • Replica Management Consistency • The Replicas Manager is able to recover and return to a consistent state

Replicas Management Targets • Replicas Management should answer the following questions: • Which files should be replicated? • static files, large size files … • When a replicas should be created? • Frequently accessed files … • Where the replicas should be located? • Close to users, fast storage systems, …

Replication Strategy How should I replicate The data You need a replication strategy

Simple Dynamic Replication Strategies • Best Client • Files are replicated at the node where they are the mostly requested • Cascading replication • Replicas are created each time a threshold of requests is reached starting from the original node (root) and follow the hierarchy of the nodes • Plain caching • Files are stored locally at the client side • Fast Spread • Files are stored on each node of the path to the destination • Caching plus cascading replication

Dynamic Model-Driven Replication The decision of replicating a file and where to locate the replicas are taken following a performance model that compares the costs and the benefits of creating replicas of a particular file in certain locations • Single-system stability • Transfer time between nodes • Storage cost • Accuracy of the replica location mechanism • Etc. Kavitha R. et al. “Improving Data Availability through Dynamic-Driven Replication in Large Peer-to-Peer Communities”

Dynamic Model-Driven Replication • The model driven approach is trying to answer critical questions: • What is the optimal number of replicas for a given file? • Which is the best location for the replicas? • When a file needs to be replicated?

number of replicas for a file • Is defined given a certain availability • Proposed model: RLacc*(1-(1-p)r)>= Avail • Where • P: the probability of a node to be up • RLacc: is the accuracy of the location mechanism • Avail: the needed availability • r: is the number of replicas Kavitha R. et al. “Improving Data Availability through Dynamic-Driven Replication in Large Peer-to-Peer Communities”

Best location for the replicas • A query to the Discovery service returns a number of nodes (candidate for replication) which: • don’t contain a copy of the file • have available storage • And a reasonable response time • The best candidates should maximize the difference between: • The replication benefit (high as much as possible) • Replication costs (low as much as possible) Kavitha R. et al. “Improving Data Availability through Dynamic-Driven Replication in Large Peer-to-Peer Communities”

Best location for the replicas • Replication costs S(F, N2) + trans(F, N1, N2) Where • N1: node that currently contains the file • N2: Candidate for a new replica • S(F,N): storage cost for a file F at node N • Trans(F,a,b): transfer costs between locations a, and b • The benefit of creating a replica is trans(F,N1,User) –trans(F,N2,User) Kavitha R. et al. “Improving Data Availability through Dynamic-Driven Replication in Large Peer-to-Peer Communities”

Replica Catalog How do I keep Track of the replica Replica Replica Replica Create A catalog Replica Replica

The Replica catalog The Replica catalog is a key component of the Replica management service, it provides the mapping between logical and physical entities. The Replica catalog register three types of entities: • Logical collections: represents a number of logical file names • Locations: maps the logical collection to a particular physical instance of that collection • Logical files: represents a unique logical file name

Replica Catalog Logical collection Logical collection Location Logical File Parent Location The Replica catalog Filename Jan 1998 Filename:Feb 1998 etc Filename:Mar 1998 Filename:Jun 1998 Protocol: GridFTP Hostname:jupiter.isi.edu Path:nfs/v6/climate Logical File Logical File Logical File Jan 1998 Size: 1468762 Logical File

Operation allowed on the Replica catalog • Publish (File_publish) • Copies a file from a storage system not registered in the replicas catalogue to a registered storage system and updated the replica catalogue. • Copy (File_copy) • Copies a file from a registered storage system to another registered storage system and updated the replica catalogue. It creates the replicas. • Delete (File_delete) • deletes a filename from the replica catalogue location entry and optionally removes the file from the registered storage system.

Replicas management recovery • Two functionality at least are required to restart the replica manager after a failure: • Restart • rollback

Replica Location Service Where did I put the replicas !!! You need A replicas location service

Replication location Service Application-Oriented Data Services Data Management services Reliable Replication Services Metadata Service Replication location service File Transfer Service Ann Chervenak “Giggle: A Framework for Constructing Scalable Replica Location Services”

Role of the replica location service The main task of the replica location Service is to find a specified number of Physical File Names given a Logical File Name • The minimal set of required functions is • Autonomy • Best-effort consistency • Adaptiveness

Distributed, Adaptive Replica Location Service Replica Location Node - query based on LFN - forward query - Digest distribution - Overlay Network - Soft-state Storage sites: - register/delete of pairs of (LFN, PFN) Matei Ripeanu & Ian Foster “Decentralized, Adaptive Replica Location service”

Replica Selection Service Replicas Replicas Replicas You need A replica Selection Service Replicas Replicas

The Problem of the Replicas Selection • An application that requires access to replicated data • Query a specific metadata repository. • The logical file name are identified the existence of replicas, • The application requires access to the most appropriate replica (according specific characteristics). • This task is achieved by the Replica selection. Sudharshan Vazhkudai “Replica Selection in the Globus Data Grid”

The role of replica selection • The process selection is the process of choosing a replica from among those spread across the Grid based on some characteristics specified by the application. • Access speed • Geographical location • Access costs • Etc. Sudharshan Vazhkudai “Replica Selection in the Globus Data Grid”

A Data selection scenario (8) Location of Selected replicas application Attribute of Desired Data (1) Replica Selection Service (2) (3) Logical File Names (4) Location of (5) One or more replicas Metadata service Replica Management Service (7) Performances Measurements And Predictions (6) Sources and Destinations Candidates transfers Information Service Sudharshan Vazhkudai “Replica Selection in the Globus Data Grid”

Replica Selection Replica Management Resource Management Metadata Repository Storage System How the replica selection achieves its goals? (3) Search for the replicas that matches the applications characteristics (1) Locate the replicas (2) Get the capability and usage policy for all the replicas Core Services Sudharshan Vazhkudai “Replica Selection in the Globus Data Grid”

The replica selection • Two services are necessary to the replica selection service • Replica management Service (High Level Service) • Provides the information on all existing replicas • Resource Management (Core Service) • Provides the information of the characteristics of the underlying resource. Sudharshan Vazhkudai “Replica Selection in the Globus Data Grid”

GRIS Metacomputing Directory Service - information collection - publication - access service for grid resources GIIS GIIS Grid Index Information Service (GIIS) - register GRISs - support broad users’ queries GRIS GIIS GIIS GIIS GRIS storage resource: Grid Resource Information Server (GRIS) - collect and publish - system configuration Metadata - security - state propagation - dynamically generate information GRIS GRIS Sudharshan Vazhkudai “Replica Selection in the Globus Data Grid”

Replica catalog Replicas Replicas Replicas Storage broker - Search - Match - Access You need A storage broker service Replicas Replicas

Matching Problem • The matching process depends on: • Physical characteristics of the resources and the load of the CPU, Networks, storage devices that are part of the end-to-end path linking possible sources and sinks • The Matching process depends on factors which are very dynamic (dramatically change in the future) • Predictor to estimate future usage Sudharshan Vazhkudai “Predicting the Performance of Wide Area Data Transfers”

Intelligent Matching process • Having replica location expose performance about: • previous data transfers, which can be used to predict future behaviour between sites • Prediction of end-to-end system performance: • Create a model of each system component involved in the end-to-end data transfer (CPU, cache hit, disk access, network …) • Observations from past application from the entire system. Sudharshan Vazhkudai “Predicting the Performance of Wide Area Data Transfers”

Collecting the observations • Tools: NWS, NetLogger, Web100, ipref, NetPerf • Experience has shown substantial difference in performance between a small network probe (64kb) and the actual data transfer (GridFTP) • From logs of past applications • The sporadic nature of large data transfers means that often there is no data available about current conditions Sudharshan Vazhkudai “Predicting the Performance of Wide Area Data Transfers”

Summary • We did not discuss in this course two services: the security service, and the data transfer service • A number of technique for the replica management have not addressed • Replica location using small-world Models • System for Representing, Querying, and Automating Data derivation • Most of the topics not addressed in this course are covered by documents available at www.globus.org/research/papers.hmtl

Grid Data Management Components