1 / 11

RLS Production Services

RLS Production Services. Maria Girone PPARC-LCG, CERN LCG-POOL and IT-DB Physics Services 10 th GridPP Meeting, CERN, 3 rd June 2004. What is the RLS RLS and POOL Service Overview Experience in Data Challenges Towards a Distributed RLS Summary. What is the RLS.

paloma-hall
Download Presentation

RLS Production Services

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RLS Production Services Maria Girone PPARC-LCG, CERN LCG-POOL and IT-DB Physics Services 10th GridPP Meeting, CERN, 3rd June 2004 • What is the RLS • RLS and POOL • Service Overview • Experience in Data Challenges • Towards a Distributed RLS • Summary

  2. What is the RLS • The LCG Replica Location Service (LCG-RLS) is the central Grid File Catalog, responsible for maintaining a consistent list of accessible files (physical and logical names) together with their relevant file metadata attributes • The RLS (and POOL) refers to files via a uniqueand immutable file identifier,(FileID)generated at creation time • Stable inter-file reference LFN2 PFN2 PFNn LFNn File metadata (jobid, owner, …) Maria Girone

  3. POOL and the LCG-RLS • POOL is the LCG Persistency Framework • See talk from Radovan Chytracek • The LCG-RLS is one of the three POOL File Catalog implementations • XML based local file catalog • MySQL based shared catalog • RLS based Grid-aware file catalog • A complete production chain deploys several of these • Cascading changes from isolated worker nodes (XML catalog) up to the RLS service • DC04 used MySQL catalog at Tier1, RLS at Tier0 • RLS deployment at Tier1 sites • See talk from James Casey Maria Girone

  4. RLS Service Goals • RLS is a critical service for the correct operation of the Grid! • Minimal downtime for both scheduled and unscheduled interruptions • Good level of availability at iAS and DB level • Meet requirements of Data Challenges • In terms of performance (look-up / insert rate) and capacity (total number of GUID-PFN mappings and file-level meta-data entries) • Currently, the performance is not limited by the service itself • Prepare for future needs and increase reliability/ manageability Maria Girone

  5. RLS AppServers (certification) RLS AppServers (production) RLS AppServers (test) ALICE production ATLAS spare CMS LHCb DTEAM RLS DB (test) RLS DB (certification) RLS DB (production) RLS Service Overview • Currently deploys LRC and RMC middleware components from EDG • Distributed Replica Location Index not deployed in LCG-2 • For now, a central service deployed at CERN • RLS uses Oracle Application Server (iAS) and Database (DB) • Dedicated farm node (iAS) per VO • Shared disk server (DB) for production VOs • Similar set-up is used for testing and software certification Maria Girone

  6. Handling Interventions • High level – ‘run like an experiment’: • On-call team; primary responsible and backup • Documented procedures, training for on-call personnel, daily meetings • List of experts to call in case standard actions do not work • Planning of interventions • Most frequent: security patches • iAS: can transparently switch to new box using DNS alias change • Used for both scheduled and unscheduled interruptions • DB: short interruption to move to ‘stand-by’ DB • Total up-time achieved: 99.91% • Looking at Standard Oracle solutions for High Availability: • iAS clusters and DB clusters • Data Guard (for data protection) Maria Girone

  7. Experience in Data Challenges • The RLS was used for the first time in production during the CMS Data Challenge DC04 (3M PFNs and file metadata stored) • ATLAS and LHCb ramping up • The service was stable throughout DC04 • Looking up file information by GUID seems sufficiently fast • Clear problems wrt to the performance of the RLS • Partially due to the normal “learning curve” on all sides in using a new system • Bulk operations were missing in the deployed RLS version • Also, cross-catalog queries are not efficient by RLS design • Several solutions produced ‘in flight’ • EDG-based tools, POOL workarounds • Support for bulk operations now addressed by IT-GD (in edg-rls v2.2.7). POOL will support it in the next release (POOL V1.7) Maria Girone

  8. Towards a Distributed RLS • RLS in LCG-2 still lacks consistent replication between multiple catalog servers • EDG RLI component has not been deployed as part of LCG • Central single catalog expected to result in scalability and availability problems • Joint evaluation with CMS of Oracle asynchronous database replication as part of DC04 (in parallel to production) • Tested a minimal (two node) multi-master system between CERN and CNAF • Catalog inserts/update propagated in both directions • First Results • RLS application could be deployed with only minor changes • No stability and performance problems observed so far • Network problems and temporary server unavailability were handled gracefully • Setup could not unfortunately be tested in full production mode in DC04 due to lack of time/resource Maria Girone

  9. Next Generation RLS • LCG Grid Deployment group is currently working with the experiments to gather requirements for the next generation RLS • Taking into account the experience from DC04 • Build on DC04 work: move to replicated rather distributed catalogs? • Still need to prove • Stability and performance with production access patterns • Scaling to a sufficient number of replicas (4-6 Tier1 sites?) • Automated resolution of catalog conflicts that may arise as consequence of asynchronous replication • Propose to continue evaluation, possibly using Oracle streams • in the context of the Distributed Database Deployment activity, in the LCG deployment area Maria Girone

  10. Summary • The Replica Location Service is a central part of the LCG infrastructure • Strong requirements in terms of reliability of the service • Significant contribution from GridPP funded people • The LCG-RLS middleware and service have passed there first production test • Good service stability was achieved • Experience in Data Challenge proven to be essential for improving performance and scalability of the RLS middleware • Oracle replication tests are expected to provide important input to define replicated RLS and handling of distributed metadata in general Maria Girone

  11. The RLS Supported Configuration • A “Local Replica Catalogue” (LRC) • Contains GUID <-> PFN mapping for all local files • A “Replica Metadata Catalogue” (RMC) • Contains GUID <-> LFN mapping for all local files and all file metadata information • A “Replica Location Index” (RLI) <-- Not deployed in LCG-2 • Allows files at other sites to be found • All LRCs are configured to publish to all remote RLIs Maria Girone

More Related