Distributed metadata with the amga metadata catalog
Download
1 / 24

Distributed Metadata with the AMGA Metadata Catalog - PowerPoint PPT Presentation


  • 119 Views
  • Uploaded on

Distributed Metadata with the AMGA Metadata Catalog. Nuno Santos, Birger Koblitz 20 June 2006 Workshop on Next-Generation Distributed Data Management. Abstract. Metadata Catalogs on Data Grids – The case for replication The AMGA Metadata Catalog Metadata Replication with AMGA

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Distributed Metadata with the AMGA Metadata Catalog' - heinz


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Distributed metadata with the amga metadata catalog

Distributed Metadata with the AMGA Metadata Catalog

Nuno Santos, Birger Koblitz

20 June 2006

Workshop on Next-Generation Distributed Data Management


Abstract
Abstract

  • Metadata Catalogs on Data Grids – The case for replication

  • The AMGA Metadata Catalog

  • Metadata Replication with AMGA

  • Benchmark Results

  • Future Work/Open Challenges

Workshop on Next-Generation Distributed Data Management - 20 June 2006


Metadata catalogs
Metadata Catalogs

  • Metadata on the Grid

    • File Metadata - Describe files with application-specific information

      • Purpose: file discovery based on their contents

    • Simplified Database Service–Store generic structured data on the Grid

      • Not as powerful as a DB, but easier to use and better Grid integration (security, hide DB heterogeneity)

  • Metadata Services are essential for many Grid applications

  • Must be accessible Grid-wide

    But Data Grids can be large…

Workshop on Next-Generation Distributed Data Management - 20 June 2006


An example the lcg sites
An Example - The LCG Sites

  • LCG – LHC Computing Grid

    • Distribute and process the data generated by the LHC (Large Hadron Collider) at CERN

    • ~200 sites and ~5.000 users worldwide

Taken from: http://goc03.grid-support.ac.uk/googlemaps/lcg.html

Workshop on Next-Generation Distributed Data Management - 20 June 2006


Challenges for catalog services
Challenges for Catalog Services

  • Scalability

    • Hundreds of grid sites

    • Thousands users

  • Geographical Distribution

    • Network latency

  • Dependability

    • In a large and heterogeneous system, failures will be common

  • A centralized system does not meet the requirements

    • Distribution and replicationrequired

Workshop on Next-Generation Distributed Data Management - 20 June 2006


Off the shelf db replication
Off-the-shelf DB Replication?

  • Most DB systems have DB replication mechanisms

    • Oracle Streams, Slony for PostgreSQL, MySQL replication

  • Example: 3D Project at CERN

    (Distributed Deployment of Databases)

    • Uses Oracle Streams for replication

    • Being deployed only at a few LCG sites (~10 sites, Tier-0 and Tier-1s)

      • Requires Oracle ($$$) and expert on-site DBAs ($$$)

      • Most sites don’t have these resources

  • Off-the-shelf replication is vendor-specific

    • But Grids are heterogeneous by nature

    • Sites have different DB systems available

Only partial solution to the problem of metadata replication

Workshop on Next-Generation Distributed Data Management - 20 June 2006


Replication in the catalog
Replication in the Catalog

  • Alternative we are exploring:

    Replication in the Metadata Catalog

  • Advantages

    • Database independent

    • Metadata-aware replication

      • More efficient – replicate Metadata commands

      • Better functionality – Partial replication, federation

    • Ease of deployment and administration

      • Built-in into the Metadata Catalog

      • No need for dedicated DB admin

  • The AMGA Metadata Catalogue is the basis for our work on replication

Workshop on Next-Generation Distributed Data Management - 20 June 2006


The amga metadata catalog
The AMGA Metadata Catalog

  • Metadata Catalog of the gLite Middleware (EGEE)

  • Several groups of users among the EGEE community:

    • High Energy Physics

    • Biomed

  • Main features

    • Dynamic schemas

    • Hierarchical organization

    • Security:

      • Authentication: user/pass, X509 Certs, GSI

      • Authorization: VOMS, ACLs

Workshop on Next-Generation Distributed Data Management - 20 June 2006


Amga implementation
AMGA Implementation

  • C++ implementation

  • Back-ends

    • Oracle, MySQL, PostgreSQL, SQLite

  • Front-end - TCP Streaming

    • Text-based protocol like TELNET, SMTP, POP…

  • Examples:

    Adding data

    Retrieving data

addentry /DLAudio/song.mp3

/DLAudio:Author ‘John Smith’

/DLAudio:Album ‘Latest Hits’

selectattr /DLAudio:FILE /DLAudio:Author /DLAudio:Album

‘like(/DLAudio:FILE, “%.mp3")‘

Workshop on Next-Generation Distributed Data Management - 20 June 2006


Standalone performance
Standalone Performance

  • Single server scales well up to 100 concurrent clients

  • Could not go past 100. Limited by the database

  • WAN access one to two orders of magnitude slower than LAN

Replication can solve both bottlenecks

Workshop on Next-Generation Distributed Data Management - 20 June 2006


Distributed metadata with the amga metadata catalog

Metadata Replication with AMGA

Workshop on Next-Generation Distributed Data Management - 20 June 2006


Requirements of egee communities
Requirements of EGEE Communities

  • Motivation: Requirements of EGEE’s user communities.

    • Mainly HEP and Biomed

  • High Energy Physics (HEP)

    • Millions of files, 5.000+ users distributed across 200+ computing centres

    • Mainly (read-only) file metadata

    • Main concerns: scalability, performance and fault-tolerance

  • Biomed

    • Manage medical images on the Grid

      • Data produced in a distributed fashion by laboratories and hospitals

      • Highly sensitive data: patient details

    • Smaller scale than HEP

    • Main concern: security

Workshop on Next-Generation Distributed Data Management - 20 June 2006


Metadata replication
Metadata Replication

Some replication models

Partial replication

Full replication

Federation

Proxy

Workshop on Next-Generation Distributed Data Management - 20 June 2006


Architecture
Architecture

  • Main design decisions

    • Asynchronous replication – for tolerating with high latencies and fault-tolerance

    • Partial replication – Replicate only what is interesting for the remote users

    • Master-slave – Writes only allowed on the master

      • But mastership is granted to metadata collections, not to nodes

Workshop on Next-Generation Distributed Data Management - 20 June 2006


Status
Status

  • Initial implementation completed

    • Available functionality:

      • Full and partial replication

      • Chained replication (master → slave1 → slave2)

      • Federation - basic support

        • Data is always copied to slave

      • Cross DB replication: PostgreSQL → MySQL tested

        • Other combinations should work (give or take some debugging)

  • Available as part of AMGA

Workshop on Next-Generation Distributed Data Management - 20 June 2006


Distributed metadata with the amga metadata catalog

Benchmark Results

Workshop on Next-Generation Distributed Data Management - 20 June 2006


Benchmark study
Benchmark Study

  • Investigate the following:

    • Overhead of replication and scalability of the master

    • Behaviour of the system under faults

Workshop on Next-Generation Distributed Data Management - 20 June 2006


Scalability
Scalability

  • Setup

  • Insertion rate at master: 90 entries/s.

  • Total: 10,000 entries

  • 0 slaves - saving replication updates, but not shipping (slaves disconnected)

  • Small increase in CPU usage as number of slaves increases

    • 10 slaves, 20% increase from standalone operation

  • Number of update logs sent scales almost linearly

Workshop on Next-Generation Distributed Data Management - 20 June 2006


Fault tolerance
Fault Tolerance

  • Next test illustrates fault tolerance mechanisms

  • Slave fails

    • Master keeps the updates for the slave

    • Replication log grows

  • Slave reconnects

    • Master sends pending updates

    • Eventually system recovers to a steady state with the slave up-to-date

  • Test conditions:

    • Insertion rate at master: 50 entries/s

    • Total: 20.000 entries

    • Two slaves, both start connected

    • Slave1 disconnects temporarily

Setup:

Workshop on Next-Generation Distributed Data Management - 20 June 2006


Fault tolerance and recovery
Fault Tolerance and Recovery

  • While slave1 is disconnected, the replication log grows in size

    • Limited in size. Slave unsubscribed if it does not reconnect in time.

  • After slave reconnection, system recovers in around 60 seconds.

Workshop on Next-Generation Distributed Data Management - 20 June 2006


Distributed metadata with the amga metadata catalog

Future Work/Open Challenges

Workshop on Next-Generation Distributed Data Management - 20 June 2006


Scalability1
Scalability

  • Support hundreds of replicas

    • HEP use case. Extreme case: one replica catalog per site

  • Challenges

    • Scalability

    • Fault-tolerance – tolerate failures of slaves and of master

  • Current method of shipping updates (direct streaming) might not scale

    • Chained replication (divide and conquer)

      • Already possible with AMGA, performance needs to be studied

    • Group communication

Workshop on Next-Generation Distributed Data Management - 20 June 2006


Federation
Federation

  • Federation of independent catalogs

    • Biomed use case

  • Challenges

    • Provide a consistent view over the federated catalogs

    • Shared namespace

    • Security - Trust management, access control and user management

  • Ideas

Workshop on Next-Generation Distributed Data Management - 20 June 2006


Conclusion
Conclusion

  • Replication of Metadata Catalogues necessary for Data Grids

  • We are exploring replication at the Catalogue using AMGA

  • Initial implementation completed

    • First results are promising

  • Currently working on improving scalability and on federation

  • More information about our current work at:

    http://project-arda-dev.web.cern.ch/project-arda-dev/metadata/

Workshop on Next-Generation Distributed Data Management - 20 June 2006