1 / 27

The ATLAS Strategy for Distributed Analysis on several Grid Infrastructures

The ATLAS Strategy for Distributed Analysis on several Grid Infrastructures. D. Liko, IT/PSS for the ATLAS Distributed Analysis Community. Overview. Distributed Analysis in ATLAS Grids, Computing Model The ATLAS Strategy Production system Direct submission Common Aspects Datamanagement

stephaniep
Download Presentation

The ATLAS Strategy for Distributed Analysis on several Grid Infrastructures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The ATLAS Strategy for Distributed Analysis on several Grid Infrastructures D. Liko, IT/PSS for the ATLAS Distributed Analysis Community

  2. Overview • Distributed Analysis in ATLAS • Grids, Computing Model • The ATLAS Strategy • Production system • Direct submission • Common Aspects • Datamanagement • Transformation • GUI • Initial experiences • Production system on LCG • PANDA in OSG • GANGA

  3. ATLAS Grid Infrastructure • Three grids • LCG • OSG • Nordugrid • Significant resources, but different middleware • Teams working on solutions are typically associated to a grid and its middleware • In principle ATLAS resources are available to all ATLAS users • Interest by our users to use their local systems in priority • Not only a central system, flexibility concerning middleware Poster 181: Prototype of the Swiss ATLAS Computing Infrastructure

  4. Distributed Analysis • At this point emphasis on batch model to implement the ATLAS Computing model • Interactive solutions are difficult to realize on top of the current middleware layer • We expect our users to send large batches of short jobs to optimize their turnaround • Scalability • Data Access • Analysis in parallel to production • Job Priorities

  5. ATLAS Computing Model • Data for analysis will be available distributed on all Tier-1 and Tier-2 centers • AOD & ESD • T1 & T2 are open for analysis jobs • The computing model foresees 50 % of grid resources to be allocated for analysis • Users will send jobs to the data and extract relevant data • typically NTuples or similar

  6. Requirements • Data for a year of data taking • AOD – 150 TB • ESD • Scalability • Last year up to 10000 jobs per day for production (job duration up to 24 hours) • Grid and our needs will grow • We expect that our analysis users will run much shorter jobs • Job delivery capacity of the order of 106 jobs per day • Peak capacity • Involves several grids • Longer jobs can reduce this number (but might not always be practical) • Job Priorities • Today we need short queues • In the future we need to steer the resource consumption of our physics and detector groups based on VOMS groups

  7. ATLAS Strategy • Production system • Seamless access to all ATLAS grid resources • Direct submission to GRID • LCG • LCG/gLite Resource Broker • CondorG • OSG • PANDA • Nordugrid • ARC Middleware

  8. Dulcinea Dulcinea CE Dulcinea Dulcinea Lexor CondorG CE ProdDB ATLAS Prodsys Dulcinea PANDA Dulcinea Dulcinea RB CG RB RB CE

  9. Production System • Provides a layer on top of the middleware • Increases the robustness by the system • Retrials and fallback mechanism both for workload and data management • Our grid experience is captured in the executors • Jobs can be run in all systems • Redesign based on the experiences of last year • New Supervisor - Eowyn • New Executors • Connects to new Data Management • Adaptation for Distributed Analysis • Configurable user jobs • Access control based on X509 Certificates • Graphical User Interface ATCOM Presentation 110: ATLAS Experience on Large Scale Production on the Grid

  10. LCG • Resource Broker • Scalability • Reliability • Throughput • New gLite Resource Broker • Bulk submission • Many other enhancements • Studied in ATLAS LCG/EGEE Taskforce • Special setup in Milano & Bolongna • gLite – 2-way Intel Xeon 2.8 CPU (with hyper-threading), 3 GByte memory • LCG – 2-way Intel Xeon 2.4 CPU (without hyper-threading), 2 GByte memory • Both are using the same BDII (52 CE in total) • Several bug fixes and optimization • Steady collaboration with the developers

  11. LCG vs gLite Resource Broker • Bulk submission much faster • Sandbox handling better and faster • Now the match making is the limiting factor • Strong effect from ranking

  12. CondorG • Conceptually similar to LCG RB, but different architecture • Scaling by increasing the number of schedulers • No logging & bookkeeping, but a scheduler keeps tracks of the job • Used in parallel during DC2 & Rome production and increased our use of grid resources • Submission via the Production System, but also direct submission is imaginable Presentation 401: A Grid of Grids using CondorG

  13. Last years experience • Adding CondorG based executor in the production system helped us to increase the number of jobs on LCG

  14. PANDA • New prodsys executor for OSG • Pilot jobs • Resource Brokering • Close integration with DDM • Operational in the production since December Presentation 347: PANDA: Production and Distributed Analysis System for ATLAS

  15. PANDA • Direct submission • Regional production • Analysis jobs • Key features for analysis • Analysis Transformations • Job-chaining • Easy job-submission • Monitoring • DDM end-user tool • Transformation repository

  16. ARC Middleware • Standalone ARC client software – 13 MB Installation • CE has extended functionality • Input files can be staged and are cached • Output files can be staged • Controlled by XRSL, an extended version of globus RSL • Brokering is part of the submission in the client software • Job delivery rates of 30 to 50 per min have been reported • Logging & bookkeeping on the site • Currently about 5000 CPUs, 800 available for ATLAS

  17. Common Aspects • Data management • Transformations • GUI

  18. ATLAS Data Management • Based on Datasets • PoolFileCatalog API is used to hide grid differences • On LCG, LFC acts as local replica catalog • Aims to provide uniform access to data on all grids • FTS is used to transfer data between the sites • Evidently Data management is a central aspect of Distributed Analysis • PANDA is closely integrated with DDM and operational • LCG instance was closely coupled with SC3 • Right now we run a smaller instance for test purposes • Final production version will be based on new middleware for SC4 (FPS) Presentation 75: A Scalable Distributed Data Management System for ATLAS

  19. Transformations • Common transformations is a fundamental aspect of the ATLAS strategy • Overall no homogeneous system …. but a common transformation system allows to run the same job on all supported systems • All system should support them • In the end the user can adapt easily to a new submission system, if he does not need to adapt his jobs • Separation of functionality in grid dependant wrappers and grid independent execution scripts. • A set of parameters is used to configure the specific job options • A new implementation in terms of python is under way

  20. GANGA – The GUI for the Grid • Common project with LHCb • Plugins allow define applications • Currently: Athena and Gaudi, ADA (DIAL) • And backends • Currently: Fork, LSF, PBS, Condor, LCG, gLite, DIAL and DIRAC Presentation 318: GANGA – A Grid User Interface

  21. New version 4 Job splitting GUI Work on plugins to various system is ongoing GANGA latest development

  22. Initial experiences • PANDA on OSG • Analysis with the Production System • GANGA

  23. PANDA on OSG • pathena • Lightweight submission interface to PANDA • DIAL • System submits analysis jobs to PANDA to get acces to grid resources • First users are working on the system Presentation 38: DIAL: Distributed Interactive Analysis of Large Datasets

  24. Distributed Analysis using Prodsys • Currently based on CondorG • Lexor based system on its way • GUI ATCOM • Central team operates the executor as a service • Several analysis were ported to the system • Selected users are testing it Poster 264: Distributed Analysis with the ATLAS Production System

  25. GANGA • Most relevant • Athena application • LCG backend • Evaluated by several users • Simulation & Analysis • Faster submission necessary • Prodsys/PANDA/gLite/CondorG • Feedback • All based on the CLI • New GUI will be presented soon

  26. Summary • Systems have been exposed to selected users • Positive feedback • Direct contact to the experts still essential • For this year – power users and grid experts … • Main issues • Data distribution → New DDM • Scalability → New Prodsys/PANDA/gLite/CondorG • Analysis in parallel to Production → Job Priorities

  27. As of today Distributed Analysis in ATLAS is still work in progress (the detector too) The expected data volume require us to perform analysis on the grid Important pieces are coming into place We will verify Distributed Analysis according to the ATLAS Computing Model in the context of SC4 Conclusions

More Related