1 / 24

Distributed Metadata with the AMGA Metadata Catalog

Distributed Metadata with the AMGA Metadata Catalog. Nuno Santos, Birger Koblitz 20 June 2006 Workshop on Next-Generation Distributed Data Management. Abstract. Metadata Catalogs on Data Grids – The case for replication The AMGA Metadata Catalog Metadata Replication with AMGA

heinz
Download Presentation

Distributed Metadata with the AMGA Metadata Catalog

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Metadata with the AMGA Metadata Catalog Nuno Santos, Birger Koblitz 20 June 2006 Workshop on Next-Generation Distributed Data Management

  2. Abstract • Metadata Catalogs on Data Grids – The case for replication • The AMGA Metadata Catalog • Metadata Replication with AMGA • Benchmark Results • Future Work/Open Challenges Workshop on Next-Generation Distributed Data Management - 20 June 2006

  3. Metadata Catalogs • Metadata on the Grid • File Metadata - Describe files with application-specific information • Purpose: file discovery based on their contents • Simplified Database Service–Store generic structured data on the Grid • Not as powerful as a DB, but easier to use and better Grid integration (security, hide DB heterogeneity) • Metadata Services are essential for many Grid applications • Must be accessible Grid-wide But Data Grids can be large… Workshop on Next-Generation Distributed Data Management - 20 June 2006

  4. An Example - The LCG Sites • LCG – LHC Computing Grid • Distribute and process the data generated by the LHC (Large Hadron Collider) at CERN • ~200 sites and ~5.000 users worldwide Taken from: http://goc03.grid-support.ac.uk/googlemaps/lcg.html Workshop on Next-Generation Distributed Data Management - 20 June 2006

  5. Challenges for Catalog Services • Scalability • Hundreds of grid sites • Thousands users • Geographical Distribution • Network latency • Dependability • In a large and heterogeneous system, failures will be common • A centralized system does not meet the requirements • Distribution and replicationrequired Workshop on Next-Generation Distributed Data Management - 20 June 2006

  6. Off-the-shelf DB Replication? • Most DB systems have DB replication mechanisms • Oracle Streams, Slony for PostgreSQL, MySQL replication • Example: 3D Project at CERN (Distributed Deployment of Databases) • Uses Oracle Streams for replication • Being deployed only at a few LCG sites (~10 sites, Tier-0 and Tier-1s) • Requires Oracle ($$$) and expert on-site DBAs ($$$) • Most sites don’t have these resources • Off-the-shelf replication is vendor-specific • But Grids are heterogeneous by nature • Sites have different DB systems available Only partial solution to the problem of metadata replication Workshop on Next-Generation Distributed Data Management - 20 June 2006

  7. Replication in the Catalog • Alternative we are exploring: Replication in the Metadata Catalog • Advantages • Database independent • Metadata-aware replication • More efficient – replicate Metadata commands • Better functionality – Partial replication, federation • Ease of deployment and administration • Built-in into the Metadata Catalog • No need for dedicated DB admin • The AMGA Metadata Catalogue is the basis for our work on replication Workshop on Next-Generation Distributed Data Management - 20 June 2006

  8. The AMGA Metadata Catalog • Metadata Catalog of the gLite Middleware (EGEE) • Several groups of users among the EGEE community: • High Energy Physics • Biomed • Main features • Dynamic schemas • Hierarchical organization • Security: • Authentication: user/pass, X509 Certs, GSI • Authorization: VOMS, ACLs Workshop on Next-Generation Distributed Data Management - 20 June 2006

  9. AMGA Implementation • C++ implementation • Back-ends • Oracle, MySQL, PostgreSQL, SQLite • Front-end - TCP Streaming • Text-based protocol like TELNET, SMTP, POP… • Examples: Adding data Retrieving data addentry /DLAudio/song.mp3 /DLAudio:Author ‘John Smith’ /DLAudio:Album ‘Latest Hits’ selectattr /DLAudio:FILE /DLAudio:Author /DLAudio:Album ‘like(/DLAudio:FILE, “%.mp3")‘ Workshop on Next-Generation Distributed Data Management - 20 June 2006

  10. Standalone Performance • Single server scales well up to 100 concurrent clients • Could not go past 100. Limited by the database • WAN access one to two orders of magnitude slower than LAN Replication can solve both bottlenecks Workshop on Next-Generation Distributed Data Management - 20 June 2006

  11. Metadata Replication with AMGA Workshop on Next-Generation Distributed Data Management - 20 June 2006

  12. Requirements of EGEE Communities • Motivation: Requirements of EGEE’s user communities. • Mainly HEP and Biomed • High Energy Physics (HEP) • Millions of files, 5.000+ users distributed across 200+ computing centres • Mainly (read-only) file metadata • Main concerns: scalability, performance and fault-tolerance • Biomed • Manage medical images on the Grid • Data produced in a distributed fashion by laboratories and hospitals • Highly sensitive data: patient details • Smaller scale than HEP • Main concern: security Workshop on Next-Generation Distributed Data Management - 20 June 2006

  13. Metadata Replication Some replication models Partial replication Full replication Federation Proxy Workshop on Next-Generation Distributed Data Management - 20 June 2006

  14. Architecture • Main design decisions • Asynchronous replication – for tolerating with high latencies and fault-tolerance • Partial replication – Replicate only what is interesting for the remote users • Master-slave – Writes only allowed on the master • But mastership is granted to metadata collections, not to nodes Workshop on Next-Generation Distributed Data Management - 20 June 2006

  15. Status • Initial implementation completed • Available functionality: • Full and partial replication • Chained replication (master → slave1 → slave2) • Federation - basic support • Data is always copied to slave • Cross DB replication: PostgreSQL → MySQL tested • Other combinations should work (give or take some debugging) • Available as part of AMGA Workshop on Next-Generation Distributed Data Management - 20 June 2006

  16. Benchmark Results Workshop on Next-Generation Distributed Data Management - 20 June 2006

  17. Benchmark Study • Investigate the following: • Overhead of replication and scalability of the master • Behaviour of the system under faults Workshop on Next-Generation Distributed Data Management - 20 June 2006

  18. Scalability • Setup • Insertion rate at master: 90 entries/s. • Total: 10,000 entries • 0 slaves - saving replication updates, but not shipping (slaves disconnected) • Small increase in CPU usage as number of slaves increases • 10 slaves, 20% increase from standalone operation • Number of update logs sent scales almost linearly Workshop on Next-Generation Distributed Data Management - 20 June 2006

  19. Fault Tolerance • Next test illustrates fault tolerance mechanisms • Slave fails • Master keeps the updates for the slave • Replication log grows • Slave reconnects • Master sends pending updates • Eventually system recovers to a steady state with the slave up-to-date • Test conditions: • Insertion rate at master: 50 entries/s • Total: 20.000 entries • Two slaves, both start connected • Slave1 disconnects temporarily Setup: Workshop on Next-Generation Distributed Data Management - 20 June 2006

  20. Fault Tolerance and Recovery • While slave1 is disconnected, the replication log grows in size • Limited in size. Slave unsubscribed if it does not reconnect in time. • After slave reconnection, system recovers in around 60 seconds. Workshop on Next-Generation Distributed Data Management - 20 June 2006

  21. Future Work/Open Challenges Workshop on Next-Generation Distributed Data Management - 20 June 2006

  22. Scalability • Support hundreds of replicas • HEP use case. Extreme case: one replica catalog per site • Challenges • Scalability • Fault-tolerance – tolerate failures of slaves and of master • Current method of shipping updates (direct streaming) might not scale • Chained replication (divide and conquer) • Already possible with AMGA, performance needs to be studied • Group communication Workshop on Next-Generation Distributed Data Management - 20 June 2006

  23. Federation • Federation of independent catalogs • Biomed use case • Challenges • Provide a consistent view over the federated catalogs • Shared namespace • Security - Trust management, access control and user management • Ideas Workshop on Next-Generation Distributed Data Management - 20 June 2006

  24. Conclusion • Replication of Metadata Catalogues necessary for Data Grids • We are exploring replication at the Catalogue using AMGA • Initial implementation completed • First results are promising • Currently working on improving scalability and on federation • More information about our current work at: http://project-arda-dev.web.cern.ch/project-arda-dev/metadata/ Workshop on Next-Generation Distributed Data Management - 20 June 2006

More Related