1 / 87

Data Management Services in GT2 and GT3

Data Management Services in GT2 and GT3. Requirements for Grid Data Management. Terabytes or petabytes of data Often read-only data, “published” by experiments Other systems need to maintain data consistency

alain
Download Presentation

Data Management Services in GT2 and GT3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Management Services in GT2 and GT3

  2. Requirements for Grid Data Management • Terabytes or petabytes of data • Often read-only data, “published” by experiments • Other systems need to maintain data consistency • Large data storage and computational resources shared by researchers around the world • Distinct administrative domains • Respect local and global policies governing how resources may be used • Access raw experimental data • Run simulations and analysis to create “derived” data products Data Management

  3. Requirements for Grid Data Management (Cont.) • Locate data • Record and query for existence of data • Data access based on metadata • High-level attributes of data • Support high-speed, reliable data movement • E.g., for efficient movement of large experimental data sets • Support flexible data access • E.g., databases, hierarchical data formats (HDF), aggregation of small objects • Data Filtering • Process data at storage system before transferring Data Management

  4. Requirements for Grid Data Management (Cont.) • Planning, scheduling and monitoring execution of data requests and computations • Management of data replication • Register and query for replicas • Select the best replica for a data transfer • Security • Protect data on storage systems • Support secure data transfers • Protect knowledge about existence of data • Virtual data • Desired data may be stored on a storage system (“materialized”) or created on demand Data Management

  5. Application MetadataService Planner: Data location, Replica selection, Selection of compute and storage nodes Replica Location Service Information Services Security and Policy Executor: Initiates data transfers and computations Data Movement Data Access Compute Resources Storage Resources Functional View of Grid Data Management Location based on data attributes Location of one or more physical replicas State of grid resources, performance measurements and predictions Data Management

  6. Architecture Layers Collective 2: Services for coordinating multiple resources that are specific to an application domain or virtual organization (e.g., Authorization, Consistency, Workflow) Collective 1: General services for coordinating multiple resources (e.g., RLS, MCS, RFT, Federation, Brokering) Resource: sharing single resources (e.g., GridFTP, SRM, DBMS) Connectivity (e.g., TCP/IP, GSI) Fabric (e.g., storage, compute nodes, networks) Data Management

  7. Outline: Data Services for Grids • The Replica Location Service (RLS) • A distributed registy of replicas for data discovery; maintains mappings between logical names for data and physical locations of replicas • The Metadata Catalog Service (MCS) • A catalog that associates descriptive attributes (metadata) that describe data items with logical names for data items • The GridFTP data transport protocol • Extends basic ftp protocol to provide parallel transfers, striped transfers, grid security, third-party transfers, control of TCP buffer sizes • The Reliable File Transfer (RFT) service • A grid service (extension of web service) that maintains state about outstanding transfers, is able to retry and restart after client failures Data Management

  8. Replica Management in Grids • Data intensive applications • Produce Terabytes or Petabytes of data • Replicate data at multiple locations • Fault tolerance • Performance: avoid wide area data transfer latencies, achieve load balancing • Issues: • Locating replicas of desired files • Creating new replicas • Scalability • Reliability Data Management

  9. A Replica Location Service • A Replica Location Service (RLS) is a distributed registry service that records the locations of data copies and allows discovery of replicas • Maintains mappings between logical identifiers and target names • Physical targets: Map to exact locations of replicated data • Logical targets: Map to another layer of logical names, allowing storage systems to move data without informing the RLS • RLS was designed and implemented in a collaboration between the Globus project and the DataGrid project Data Management

  10. Replica Location Indexes RLI RLI LRC LRC LRC LRC LRC Local Replica Catalogs • LRCs contain consistent information about logical-to-target mappings on a site • RLIs nodes aggregate information about LRCs • Soft state updates from LRCs to RLIs: relaxed consistency of index information, used to rebuild index after failures • Arbitrary levels of RLI hierarchy Data Management

  11. Giggle: A Replica Location Service Framework • We define a flexible RLS framework • Allows users to make tradeoffs among: • consistency • space overhead • reliability • update costs • query costs • By different combinations of 5 essential elements, the framework supports a variety of RLS designs Data Management

  12. Five elements: 1. Consistent Local State: Records mappings between logical names and target names and answers queries 2. Global State with relaxed consistency: Global index supports discovery of replicas at multiple sites; relaxed consistency 3. Soft state mechanisms for maintaining global state: LRCs send information about their mappings (state) to RLIs using soft state protocols 4. Compression of state updates (optional): reduce communication, CPU and storage overheads 5. Membership service: for location of participating LRCs and RLIs and dealing with changes in membership A Flexible RLS Framework Data Management

  13. 1. Reliable Local State: Local Replica Catalog • Maintains consistent information about replicas at a single replica site (may aggregate multiple storage resources) • Contains mappings between logical names and target names • Answers queries: • What target names are associated with a logical name? • What logical names are associated with a target name? • Associates user-defined attributes with logical and target names and mappings • Sends soft state updates describing LRC mappings to global index nodes Data Management

  14. 2. Global State with Relaxed Consistency: Replica Location Index • Require a global index to support discovery of replicas at multiple sites • Consists of set of one or more Replica Location Index Nodes (RLIs) • Each RLI must: • Contain mappings between logical names and LRCs • Accept periodic state updates from LRCs • Answer queries for mappings associated with a logical name • Implement time outs of information stored in index • Global index has relaxed consistency • RLIs are not required to maintain persistent state Data Management

  15. 2. The Replica Location Index (Cont.) Can construct a wide range of index configurations by varying framework parameters: • Number of RLIs • Redundancy of RLIs • Can guarantee that all LRCs send soft state updates to at least n RLIs • Partitioning of RLIs • Divide logical file namespace or stroage systems among RLIs Data Management

  16. An RLS with No Redundancy, Partitioning of Index by Storage Sites Replica Location Indexes RLI RLI LRC LRC LRC LRC LRC Local Replica Catalogs Data Management

  17. An RLS with Redundancy Data Management

  18. 3. Soft State Mechanisms for Maintaining Global State • LRCs send information about their mappings (state) to RLIs using soft state protocols • Soft state: information times out and must be periodically refreshed • Advantages of soft state mechanisms: • Stale information in RLIs removed implicitly via timeouts • RLIs need not maintain persistent state: can reconstruct state from soft state updates • Some delay in propagating changes in LRC state to RLIs • Provides relaxed consistency • Soft state update strategies: • Complete state or incremental updates • Send immediately after LRC state changes or periodically Data Management

  19. 4. Compression of State Updates • Optional mechanism for reducing: • communication requirements for state updates • storage system requirements on RLIs • Compression options: • Hash digest techniques (e.g., Bloom filters) • Use structural or semantic information in logical names (e.g., logical collection names) • Others • Lossy compression: • May lose accuracy about mappings E.g., bloom filters: • Small probability of false positives on RLI queries • Lose ability to do wildcard searches on logical names in RLIs Data Management

  20. 5. Membership Service Used for the following: • (Currently we provide only static membership configuration) • Locating participating LRCs and RLIs • Keeping track of which servers sends and receives soft state updates from one another • Dealing with changes in membership (RLI leaves or joins): • Membership service notifies LRCs of change in RLI(s) to which they send state • May repartition LFNs among set of RLIs Data Management

  21. Replica Location Service In Context • The Replica Location Service is one component in a layered data management architecture • Provides a simple, distributed registry of mappings • Consistency management provided by higher-level services Data Management

  22. Components of RLS Implementation • Front-End Server • Multi-threaded • Supports GSI Authentication • Common implementation for LRC and RLI • Back-end Server • mySQL or PostgreSQL Relational Database • Holds logical name to target name mappings • Client APIs: C and Java • Client Command line tool Data Management

  23. Implementation Features • Two types of soft state updates from LRCs to RLIs • Complete list of logical names registered in LRC • Bloom filter summaries of LRC • Immediate mode • When active, sends updates of new entries after 30 seconds (default) or after 100 updates • User-defined attributes • May be associated with logical or target names • Partitioning (without bloom filters) • Divide LRC soft state updates among RLI index nodes using pattern matching of logical names • Currently, static configuration only Data Management

  24. Installing the LRC and RLI • First requires installing the underlying database • PostgreSQL, MySQL • For each of these, must install both database and ODBC driver • See RLS installation guide for instructions on RLS server installation • Requires latest Globus Packaging Toolkit (GPT) • Source and binary bundles • Clients • C • Java (JNI wrapper, native Java client in progress) • Command line client tool Data Management

  25. RLS Server and Soft State Update Configuration • RLS server configuration • Whether an LRC or RLI or both • If LRC, configure • Method of soft state update to send (stored in database, set via command line tool) • May send updates of different types to different RLIs • Frequency of soft state updates (in config file) • If RLI, configure • Method of soft state update to accept (in config file) • Can configure RLS server to act as a service provider to the MDS (Monitoring and Discovery Service) Data Management

  26. Configuring Soft State Updates (Cont.) • LFN List • Send list of Logical Names stored on LRC • Can do exact and wildcard searches on RLI • RLI must maintain a database and update database whenever new soft state update arrives • Soft state updates get increasingly expensive (space, network transfer time, CPU time on RLI to update RLI DB) as number of LRC entries increases • E.g., with 1 million entries, takes 20 minutes to update mySQL on dual-processor 2 GHz machine (CPU-limited in this case) Data Management

  27. Configuring Soft State Updates (Cont.) • Bloom filters • Construct a summary of LRC state by hashing logical names, creating a bitmap • Compression • Updates much smaller, faster • Can be stored in memory on RLI, no database • E.g., with 1 million entries, update takes less than 1 second • Supports higher query rate • Small probability of false positives (lossy compressions) • Lose ability to do wildcard queries Data Management

  28. Configuring soft state updates (cont.) • Whether or not to use Immediate Mode • Send updates after 30 seconds (configurable) or after fixed number (100 default) of updates • Full updates are sent at a reduced rate • Immediate mode usually sends less data • Because of less frequent full updates • Tradeoffs depend on volatility of data • Frequency of updates • Need to have fast updates of RLI vs. allowing some inconsistency between LRC and RLI content • Usually advantageous • An exception would be initially loading of large database Data Management

  29. Wide Area Complete Soft State Update Performance • LRCs in Geneva and Pisa updating RLI at Glasgow • Full soft state updates quite slow for large databases, dominated by update costs on RLI database • Performance does not scale as LRCs grow: need compression of soft state updates Data Management

  30. Soft State Performance With Bloom Filters • Sending bloom filter bitmap summarizing 1 million LRC mapping entries • Store bloom filters in RLI memory • Takes less than 1 millisecond to send updates on LAN • Currently measuring wide area performance • Bloom filter advantages • Reduce size of soft state updates • Reduce associated storage overheds and network requirements • Sending updates is faster and scales better with size of LRC Data Management

  31. globus-rls-admin Command Line Administration Tool globus-rls-adminoption [ rli ] [ server ] -p: verifies that server is responding -A: add RLI to list of servers to which LRC sends updates -s: shows list of servers to which updates are sent -c all: retrieves all configuration options -S: show statistics for RLS server -e: clear LRC database Data Management

  32. globus-rls-cli Command Line Tool globus-rls-cli [ -c ] [ -h ] [ -l reslimit ] [ -s ] [ -t timeout ] [ -u ] [ command ] rls-server • If command is not specified, enters interactive mode • Create an initial mapping from a logical name to a target name: globus-rls-cli create logicalName targetName1 rls://myrls.isi.edu • Add a mapping from same logical name to a second replica/target name: globus-rls-cli add logicalName targetName2 rls://myrls.isi.edu Data Management

  33. globus-rls-cli (cont.) Attribute Functions • globus-rls-cli attribute add <object> <attr> <obj-type> <attr-type> • Add an attribute to an object • object should be the lfn or pfn name • obj-type should be one of lfn or pfn • attr-type should be one of date, float int, or string • attribute modify <object> <attr> <obj-type> <attr-type> • attribute query <object> <attr> <obj-type> Data Management

  34. globus-rli-client (cont.) Bulk Operations • bulk add <lfn> <pfn> [<lfn> <pfn> • Bulk add lfn, pfn mappings. • bulk delete <lfn> <pfn> [<lfn> <pfn> • Bulk delete lfn, pfn mappings. • bulk query lrc lfn [<lfn> ...] • Bulk query lrc for lfns. • bulk query lrc pfn [<pfn> ...] • Bulk query lrc for pfns. • bulk query rli lfn [<lfn> ...] • Bulk query rli for lfns. Data Management

  35. globus-rls-cli (cont.) Bulk Attribute Operations • globus-rls-cli attribute bulk add <object> <attr> <obj-type> • Bulk add attribute values • globus-rls-cli attribute bulk delete <object> <attr> <obj-type> • globus-rls-cli attribute bulk query <attr> <obj-type> <object> • globus-rls-cli attribute define <attr> <obj-type> <attr-type> • globus-rls-cli attribute delete <object> <attr> <obj-type> Data Management

  36. Registering a mapping using C API globus_module_activate(GLOBUS_RLS_CLIENT_MODULE) globus_rls_client_connect (serverURL, serverHandle) globus_rls_client_lrc_create (serverHandle, logicalName, targetName1) globus_rls_client_lrc_add (serverHandle, logicalName, targetName2) globus_rls_client_close (serverHandle) Data Management

  37. Registering a mapping using Java API RLSClient rls = new RLSClient(URLofServer); RLSClient.LRC lrc = rls.getLRC(); lrc.create(logicalName, targetName1); lrc.add(logicalName, targetName2); rls.Close(); Data Management

  38. Status of RLS and Future Work • Continued development of RLS • Code available as source and binary bundles at: www.globus.org/rls • RLS is part of the GT3.0 (as a GT2 service) • RLS will become an OGSI-compliant grid service • Replica location grid service specification will be standardized through Global Grid Forum • First step may be wrapping the current GT2 services in a GT3 wrapper • Significant changes related to treatment of data entities as first-class OGSI-compliant services Data Management

  39. Higher-Level OGSA Replication Services • Registration and Copy Service • Calls RFT to perform reliable file transfer • Calls RLS to register newly created replicas • Atomic operations; roll back to previous consistent state if part of operation fails • General replication services with various consistency levels/guarantees • Subscription-based model • Updates of data items must be propagated to all replicas according to update policies • Plan is also to standardize these through GGF OGSA Data Replication Services Working Group Data Management

  40. Outline: Data Services for Grids • The Replica Location Service (RLS) • A distributed registy of replicas for data discovery; maintains mappings between logical names for data and physical locations of replicas • The Metadata Catalog Service (MCS) • A catalog that associates descriptive attributes (metadata) that describe data items with logical names for data items • The GridFTP data transport protocol • Extends basic ftp protocol to provide parallel transfers, striped transfers, grid security, third-party transfers, control of TCP buffer sizes • The Reliable File Transfer (RFT) service • A grid service (extension of web service) that maintains state about outstanding transfers, is able to retry and restart after client failures Data Management

  41. Grid Infrastructure forMetadata Cataloguing and Discovery • Metadata is information that describes data sets • Distinguish between logical metadata and physical metadata • Logical metadata: Describes the contents of files and collections • Variables contained in the data set, annotations • Provenance information • Applies to all physical file instances or replicas • Stored in Metadata Catalog Service • Physical metadata: Describes a particular physical instance of a file • Mappings from physical to logical names stored in a Replica Location Service • Physical file information such as size, owner, modifier, etc. is typically stored in a file system or storage service Data Management

  42. Metadata Examples • Application-specific • Temperature, longitude, latitude, depth • Time, duration, sensor • Application-independent • creator, logical name, time created, access control • notion of a data collection—data collected during an experiment, data collected over a certain time interval • notion of a view--users might want to group the data in a way that they want to look at it Data Management

  43. Metadata Service Requirements • Storing attributes assoicated with logical files • Responding to queries based on logical file name or on attribute names and values • Extensibility to support user-defined and application-specific attributes • Consistency of content • Security: authentication and authorization • Support for logical collections: Aggregations of logical files • Support for logical views • Provenance information: history of creation and transformation • Auditing Data Management

  44. Use of Metadata Catalogs in ESG Data Management

  45. History of Metadata Catalog Service Development • Identified need for a stand-alone metadata service • Designed a general schema for metadata attributes • General attributes (based largely on Storage Resource Broker) • Ability to specify user-defined attributes • Implemented a prototype system in mid 2002 • Used the prototype in several projects in late 2002 • Earth Systems Grid • GryiPhyn LIGO (Gravitational Wave Physics) • Gathered lessons from use in these systems • Currently re-designing the Metadata Catalog Service for greater functionality, extensibility and performance Data Management

  46. Data Model Logical file Logical View Logical Collection Data Management

  47. MCS Data Model and Implementation • Logical files, logical collections and logical views • May associate pre-defined or user-defined attributes with files, collections or views • Prototype is a centralized service based on open source web service and database technology SOAP/HTTP MCS Server/ Apache Axis SOAP Engine/ Apache Axis MySQL DB MCS Java Client API Data Management

  48. Experience with MCS within the Earth System Grid Project • Store climate model metadata corresponding to ESG schema • ESG metadata in XML format • Parse or “shred” the metadata and store in MCS relational tables • Create new user-defined attributes for domain-specific metadata schema • Shredding is fairly slow and cumbersome • Query performance is acceptable • Can recreate the original XML documents • Used in SC2002 ESG Demo and in subsequent demonstrations Data Management

  49. MCS and GriPhyN • Provide on-demand data derivation based on existing data “recipes” • If data products already available, no need to recompute • Data easily stored in relational db • Used to find the existing data products • Query MCS based on application-specific attributes, receive list of logical file names • Store information about newly created data products Data Management

  50. For 2003: Redesigning the MCS • New implementation will be based on OGSA Database Access and Integration (DAI) Service • Being standardized through Global Grid Forum • Reference implementation involving IBM, Oracle, UK eScience researchers, academic institutions • Provides both relational and native XML back ends • Provides a grid service front end with grid security • Provides a general pass-through SQL query interface • Testing OGSA DAI services with ESG metadata • Supporting provenance information • Common schema with the Chimera project • Provenance information describes data transformations Data Management

More Related