1 / 22

DIRAC – the distributed production and analysis for LHCb

DIRAC – the distributed production and analysis for LHCb. A.Tsaregorodtsev, CPPM, Marseille. CHEP 2004 , 27September – 1 October 2004, Interlaken. Authors. DIRAC development team

zayit
Download Presentation

DIRAC – the distributed production and analysis for LHCb

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DIRAC – the distributed production and analysis for LHCb A.Tsaregorodtsev, CPPM, Marseille CHEP 2004 , 27September – 1 October 2004, Interlaken CHEP 2004 , 27 September - 1 October 2004, Interlaken

  2. Authors • DIRAC development team TSAREGORODTSEV Andrei, GARONNE Vincent, STOKES-REES Ian, GRACIANI-DIAZ Ricardo, SANCHEZ-GARCIA Manuel, CLOSIER Joel, FRANK Markus , KUZNETSOV Gennady, CHARPENTIER Philippe • Production site managers BLOUW Johan , BROOK Nicholas, EGEDE Ulrik, GANDELMAN Miriam , KOROLKO Ivan , PATRICK Glen , PICKFORD Andrew , ROMANOVSKI Vladimir , SABORIDO-SILVA Juan , SOROKO Alexander , TOBIN Mark , VAGNONI Vincenzo, WITEK Mariusz , BERNET Roland Universidade Federal do Rio de Janeiro ITEP CHEP 2004 , 27 September - 1 October 2004, Interlaken

  3. Outline • DIRAC in a nutshell • Design goals • Architecture and components • Implementation technologies • Interfacing to LCG • Conclusion http://dirac.cern.ch CHEP 2004 , 27 September - 1 October 2004, Interlaken

  4. DIRAC in a nutshell DIRAC – Distributed Infrastructure with Remote Agent Control • LHCb grid system for the Monte-Carlo simulation data production and analysis • Integrates computing resources available at LHCb production sites as well as on the LCG grid • Composed of a set of light-weight services and a network of distributed agents to deliver workload to computing resources • Runs autonomously once installed and configured on production sites • Implemented in Python, using XML-RPC service access protocol CHEP 2004 , 27 September - 1 October 2004, Interlaken

  5. DIRAC scale of resource usage • Deployed on 20 “DIRAC”, and 40 “LCG” sites • Effectively saturated LCG and all available computing resources during the 2004 Data Challenge • Supported 3500 simultaneous jobs across 60 sites • Produced, transferred, and replicated 65 TB of data, plus meta-data • Consumed over 425 CPU years during last 4 months See presentation by J.Closier Wed 16:30 [403] CHEP 2004 , 27 September - 1 October 2004, Interlaken

  6. DIRAC design goals • Light implementation • Must be easy to deploy on various platforms • Non-intrusive • No root privileges, no dedicated machines on sites • Must be easy to configure, maintain and operate • Using standard components and third party developments as much as possible • High level of adaptability • There will be always resources outside LCGn domain • Sites that can not afford LCG, desktops, … • We have to use them all in a consistent way • Modular design at each level • Adding easily new functionality CHEP 2004 , 27 September - 1 October 2004, Interlaken

  7. DIRAC architecture CHEP 2004 , 27 September - 1 October 2004, Interlaken

  8. DIRAC architecture • Services Oriented Architecture (SOA) • Inspired by OGSA/OGSI “grid services” concept • Followed LCG/ARDA RTAG architecture blueprint • ARDA/RTAG proposal • Open architecture with well defined interfaces • allowing for replaceable, alternative services • providing choices and competition CHEP 2004 , 27 September - 1 October 2004, Interlaken

  9. DIRAC Services and Resources User interfaces Job monitor Production manager GANGA UI User CLI BK query webpage FileCatalog browser BookkeepingSvc FileCatalogSvc DIRAC Job Management Service DIRAC services JobMonitorSvc WMS ConfigurationSvc MonitoringSvc JobAccountingSvc AccountingDB Agent Agent Agent DIRAC resources DIRAC Storage LCG Resource Broker DIRAC Sites CE 3 DIRAC CE gridftp bbftp DIRAC CE DIRAC CE DiskFile CE 2 CE 1 rfio CHEP 2004 , 27 September - 1 October 2004, Interlaken

  10. DIRAC workload management • Realizes PULL scheduling paradigm • Agents are requesting jobs whenever the corresponding resource is free • Using Condor ClassAd and Matchmaker for finding jobs suitable to the resource profile • Agents are steering job execution on site • Jobs are reporting their state and environment to central Job Monitoring service See poster presentation by V.Garonne, Wed [365] CHEP 2004 , 27 September - 1 October 2004, Interlaken

  11. Matching efficiency • Averaged 420ms match time over 60,000 jobs • Using Condor ClassAds and Matchmaker • Queued jobs grouped by categories • Matches performed by category • Typically 1,000 to 20,000 jobs queued CHEP 2004 , 27 September - 1 October 2004, Interlaken

  12. DIRAC: Agent modular design Agent Container • Agent is a container of pluggable modules • Modules can be added dynamically • Several agents can run on the same site • Equipped with different set of modules as defined in their configuration • Data management is based on using specialized agents running on the DIRAC sites JobAgent PendingJobAgent BookkeepingAgent TransferAgent MonitorAgent CustomAgent CHEP 2004 , 27 September - 1 October 2004, Interlaken

  13. File Catalogs • DIRAC incorporated 2 different File Catalogs • Replica tables in the LHCb Bookkeeping Database • File Catalog borrowed from the AliEn project • Both catalogs have identical XML-RPC service interfaces • Can be used interchangeably • This was done for redundancy and gaining experience CHEP 2004 , 27 September - 1 October 2004, Interlaken

  14. Data management tools • DIRAC Storage Element is a combination of a standard server and a description of its access in the Configuration Service • Pluggable transport modules: gridftp,bbftp,sftp,ftp,http, … • DIRAC ReplicaManager interface (API and CLI) • get(), put(), replicate(), register(), etc • Reliable file transfer • Request DB keeps outstanding transfer requests • A dedicated agent takes file transfer request and retries it until it is successfully completed • Using WMS for data transfer monitoring CHEP 2004 , 27 September - 1 October 2004, Interlaken

  15. Other Services • Configuration Service • Provides configuration information for various system components (services,agents,jobs) • Bookkeeping (Metadata+Provenance) database • Stores data provenance information • see presentation by C.Cioffi, Mon [392] • Monitoring and Accounting service • A set of services to monitor job states and progress and to accumulate statistics of resource usage • see presentation by M.Sanchez, Thu [388] CHEP 2004 , 27 September - 1 October 2004, Interlaken

  16. Implementation technologies CHEP 2004 , 27 September - 1 October 2004, Interlaken

  17. XML-RPC protocol • Standard, simple, available out of the box in the standard Python library • Both server and client • Using expat XML parser • Server • Very simple socket based • Multithreaded • Supports up to 40 Hz requests rates • Client • Dynamically built service proxy • Did not feel a need for something more complex • SOAP, WSDL, … CHEP 2004 , 27 September - 1 October 2004, Interlaken

  18. Instant Messaging in DIRAC • Jabber/XMPP IM • asynchronous, buffered, reliable messaging framework • connection based • Authenticate once • “tunnel” back to client – bi-directional connection with justoutbound connectivity (no firewall problems, works with NAT) • Used in DIRAC to • send control instructions to components (services, agents, jobs) • XML-RPC over Jabber • Broadcast or to specific destination • monitor the state of components • interactivity channel with running jobs See poster presentation by I.Stokes-Rees, Wed [368] CHEP 2004 , 27 September - 1 October 2004, Interlaken

  19. Services reliability issues • If something can fail, it will do • Regular back-up of underlying database • Journaling of all the write operations • Running more than one instance of the service • E.g. configuration service running at CERN and in Oxford • Running services reliably: • Runit set of tools: • Automatic restart on failure • Automatic logging with possibility of rotation • Simple interface to pause, stop, and send control signals to services • Runs service in daemon mode • Can be used entirely in user space http://smarden.org/runit/ CHEP 2004 , 27 September - 1 October 2004, Interlaken

  20. Interfacing to LCG CHEP 2004 , 27 September - 1 October 2004, Interlaken

  21. Dynamically deployed agents How to involve the resources where the DIRAC agents are not yet installed or can not be installed ? • Workload management with resource reservation • Sending agent as a regular job • Turning a WN into a virtual LHCb production site • This strategy was applied for DC04 production on LCG: • Effectively using LCG services to deploy DIRAC infrastructure on the LCG resources • Efficiency: • >90 % success rates for DIRAC jobs on LCG • While 60% success rates of LCG jobs • No harm for the DIRAC production system • One person ran the LHCb DC04 production in LCG • Most intensive use of LCG2 up to now ( > 200 CPU years ) CHEP 2004 , 27 September - 1 October 2004, Interlaken

  22. Conclusions • The Service Oriented Architecture is essential for building ad hoc grid systems meeting the needs of particular organizations out of the components available on the “market” • The concept of the light, easy to customize and deploy agents as components of a distributed work load management system proved to be very useful • The scalability of the system allowed to saturate all available resource during the recent Data Challenge exercise including LCG CHEP 2004 , 27 September - 1 October 2004, Interlaken

More Related