1 / 33

ATLAS Grid

ATLAS Grid. Kaushik De University of Texas at Arlington HiPCAT THEGrid Workshop, UTA July 8, 2004. Goals and Status. Grids - need no introduction, after previous talks ATLAS - experiment at the LHC, being installed, over 2000 physicists, 1-2 petabytes of data per year, starting 2007-08

hallam
Download Presentation

ATLAS Grid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ATLAS Grid Kaushik De University of Texas at Arlington HiPCAT THEGrid Workshop, UTA July 8, 2004

  2. Goals and Status • Grids - need no introduction, after previous talks • ATLAS - experiment at the LHC, being installed, over 2000 physicists, 1-2 petabytes of data per year, starting 2007-08 • ATLAS needs the Grid • Deployment and testing well under way • Series of data challenges (using simulated data) to build computing/grid capability slowly up to full capacity before 2007 • UTA has played a leading role in ATLAS grids since 2002 • Focus of this talk - how to get involved in grid projects and make contributions as physicists and computer scientists • Need huge amount of application specific software to be developed, tested, optimized, deployed... Kaushik De, HipCAT THEGrid Workshop

  3. ATLAS Data Challenges • Original Goals (Nov 15, 2001) • Test computing model, its software, its data model, and to ensure the correctness of the technical choices to be made • Data Challenges should be executed at the prototype Tier centres • Data challenges will be used as input for a Computing Technical Design Report due by the end of 2003 (?) and for preparing a MoU • Current Status • Goals are evolving as we gain experience • Computing TDR ~end of 2004 • DC’s are ~yearly sequence of increasing scale & complexity • DC0 and DC1 (completed) • DC2 (2004), DC3, and DC4 planned • Grid deployment and testing is major part of DC’s Kaushik De, HipCAT THEGrid Workshop

  4. ATLAS DC1: July 2002-April 2003Goals : Produce the data needed for the HLT TDR Get as many ATLAS institutes involved as possibleWorldwide collaborative activityParticipation : 56 Institutes • Australia • Austria • Canada • CERN • China • Czech Republic • Denmark * • France • Germany • Greece • Israel • Italy • Japan • Norway * • Poland • Russia • Spain • Sweden * • Taiwan • UK • USA * • * using Grid Kaushik De, HipCAT THEGrid Workshop

  5. Process No. of events CPU Time CPU-days (400 SI2k) Volume of data kSI2k.months TB Simulation Physics evt. 107 415 30000 23 Simulation Single part. 3x107 125 9600 2 Lumi02 Pile-up 4x106 22 1650 14 Lumi10 Pile-up 2.8x106 78 6000 21 Reconstruction 4x106 50 3750 Reconstruction + Lvl1/2 2.5x106 (84) (6300) Total 690 (+84) 51000 (+6300) 60 DC1 Statistics (G. Poulard, July 2003) Kaushik De, HipCAT THEGrid Workshop

  6. U.S. DC1 Data Production • Year long process, Summer 2002-2003 • Played 2nd largest role in ATLAS DC1 • Exercised both farm and grid based production • 10 U.S. sites participating • Tier 1: BNL, Tier 2 prototypes: BU, IU/UC, Grid Testbed sites: ANL, LBNL, UM, OU, SMU, UTA (UNM & UTPA will join for DC2) • Generated ~2 million fully simulated, piled-up and reconstructed events • U.S. was largest grid-based DC1 data producer in ATLAS • Data used for HLT TDR, Athens physics workshop, reconstruction software tests... Kaushik De, HipCAT THEGrid Workshop

  7. BNL - U.S. Tier 1, 2000 nodes, 5% for ATLAS, 10 TB, HPSS through Magda LBNL - pdsf cluster, 400 nodes, 5% for ATLAS (more if idle ~10-15% used), 1TB Boston U. - prototype Tier 2, 64 nodes Indiana U. - prototype Tier 2, 64 nodes UT Arlington - 200 cpu’s, 70 TB Oklahoma U. - OSCER facility U. Michigan - test nodes ANL - test nodes, JAZZ cluster SMU - 6 production nodes UNM - Los Lobos cluster U. Chicago - test nodes U.S. ATLAS Grid Testbed Kaushik De, HipCAT THEGrid Workshop

  8. U.S. Production Summary • Exercised both farm and grid based production • Valuable large scale grid based production experience * Total ~30 CPU YEARS delivered to DC1 from U.S. * Total produced file size ~20TB on HPSS tape system, ~10TB on disk. * Black - majority grid produced, Blue - majority farm produced Kaushik De, HipCAT THEGrid Workshop

  9. DC1 Production Systems • Local batch systems - bulk of production • GRAT - grid scripts, generated ~50k files produced in U.S. • NorduGrid - grid system, ~10k files in Nordic countries • AtCom - GUI, ~10k files at CERN (mostly batch) • GCE - Chimera based, ~1k files produced • GRAPPA - interactive GUI for individual user • EDG/LCG - test files only • + systems I forgot… • Great diversity - but need convergence soon! Kaushik De, HipCAT THEGrid Workshop

  10. GRAT Software • GRid Applications Toolkit • developed by KD, Horst Severini, Mark Sosebee, and students • Based on Globus, Magda & MySQL • Shell & Python scripts, modular design • Rapid development platform • Quickly develop packages as needed by DC • Physics simulation (GEANT/ATLSIM) • Pileup production & data management • Reconstruction • Test grid middleware, test grid performance • Modules can be easily enhanced or replaced, e.g. EDG resource broker, Chimera, replica catalogue… (in progress) Kaushik De, HipCAT THEGrid Workshop

  11. DC1 Production Experience • Grid paradigm works, using Globus • Opportunistic use of existing resources, run anywhere, from anywhere, by anyone... • Successfully exercised grid middleware with increasingly complex tasks • Simulation: create physics data from pre-defined parameters and input files, CPU intensive • Pile-up: mix ~2500 min-bias data files into physics simulation files, data intensive • Reconstruction: data intensive, multiple passes • Data tracking: multiple steps, one -> many -> many more mappings Kaushik De, HipCAT THEGrid Workshop

  12. DC2: goals • The goal includes: • Full use of Geant4; POOL; LCG applications • Pile-up and digitization in Athena • Deployment of the complete Event Data Model and the Detector Description • Simulation of full ATLAS and 2004 combined Testbeam • Test the calibration and alignment procedures • Use widely the GRID middleware and tools • Large scale physics analysis • Computing model studies (document end 2004) • Run as much as possible of the production on Grids • Demonstrate use of multiple grids • (slides are from Gilbert Poulard’s talk couple of weeks ago in ATLAS) Kaushik De, HipCAT THEGrid Workshop

  13. Bytestream Raw Digits ESD Digits (RDO) MCTruth Bytestream Raw Digits Mixing Reconstruction Events HepMC Hits MCTruth Geant4 Digitization Bytestream Raw Digits ESD Digits (RDO) MCTruth Pythia Events HepMC Hits MCTruth Reconstruction Geant4 Digitization Digits (RDO) MCTruth Events HepMC Hits MCTruth Geant4 Pile-up Bytestream Raw Digits ESD Bytestream Raw Digits Mixing Reconstruction Events HepMC Hits MCTruth Digits (RDO) MCTruth Geant4 Pile-up Bytestream Raw Digits 20 TB 5 TB 20 TB 30 TB ~5 TB Event Mixing Digitization (Pile-up) Reconstruction Detector Simulation Event generation Byte stream Persistency: Athena-POOL TB Physics events Min. bias Events Piled-up events Mixed events Mixed events With Pile-up Volume of data for 107 events Task Flow for DC2 data Kaushik De, HipCAT THEGrid Workshop

  14. DC2 operation • Consider DC2 as a three-part operation: • part I: production of simulated data (June-July 2004) • needs Geant4, digitization and pile-up in Athena, POOL persistency • “minimal” reconstruction just to validate simulation suite • will run “preferably” on “Grid” • part II: test of Tier-0 operation (August 2004) • needs full reconstruction software following RTF report design, definition of AODs and TAGs • (calibration/alignment and) reconstruction will run on Tier-0 prototype as if data were coming from the online system (at 10% of the rate) • Input is ByteStream (Raw Data) • output (ESD+AOD) will be distributed to Tier-1s in real time for analysis • part III: test of distributed analysis on the Grid (Sept.-Oct. 2004) • access to event and non-event data from anywhere in the world both in organized and chaotic ways • in parallel: run distributed reconstruction on simulated data (from RODs) Kaushik De, HipCAT THEGrid Workshop

  15. Process No. of events Time duration CPU power Volume of data At CERN Off site months kSI2k TB TB TB Simulation 107 1 2000* 20 4 16 Phase I (June -July) RDO (Hits) 107 1 ~100 20 4 16 Pile-up (*) Digitization RDO 107 1 600 35 7 ~30 Event mixing & Byte-stream 107 1 (small) 25 25 0 Total Phase I 107 1 2800 ~100 ~40 ~50 Reconstruction Tier-0 107 0.5 600 5 5 10 Phase II (mid-August) Reconstruction Tier-1 107 2 600 5 0 5 Total 107 110 ~45 ~65 DC2 resources (based on release 8.0.5) Kaushik De, HipCAT THEGrid Workshop

  16. New Production System for DC2 • Goals • Automated data production system for all ATLAS facilities • Common database for all production - Oracle currently • Common supervisor run by all facilities/managers - Windmill • Common data management system - Don Quichote • Executors developed by middleware experts (Capone, LCG, NorduGrid, batch systems, CanadaGrid...) • Final verification of data done by supervisor Kaushik De, HipCAT THEGrid Workshop

  17. ATLAS Production system prodDB AMI dms Don Quijote Windmill super super super super super soap jabber jabber jabber soap LCG exe LCG exe NG exe G3 exe LSF exe Capone Dulcinea Lexor RLS RLS RLS LCG NG Grid3 LSF Kaushik De, HipCAT THEGrid Workshop

  18. ATLAS Production System • Components • Supervisor: Windmill (Grid3) • Executors: • Capone (Grid3) • Dulcinea (NG) • Lexor (LCG) • “Legacy systems” (FZK; Lyon) • Data Management System (DMS): DonQuijote (CERN) • Bookkeeping: AMI (LPSC) Kaushik De, HipCAT THEGrid Workshop

  19. Jabber communication pathway supervisors executors 1. lexor 2. dulcinea 3. capone 4. legacy numJobsWanted executeJobs getExecutorData getStatus fixJob killJob Windmill execution sites (grid) execution sites (grid) Don Quijote (file catalog) Prod DB (jobs database) Supervisor -Executors Kaushik De, HipCAT THEGrid Workshop

  20. Windmill - Supervisor • Supervisor development/U.S. DC production team • UTA: Kaushik De, Mark Sosebee, Nurcan Ozturk + students • BNL: Wensheng Deng, Rich Baker • OU: Horst Severini • ANL: Ed May • Windmill web page • http://www-hep.uta.edu/windmill • Windmill status • DC2 started June 24th, 2004 • Windmill is being used by all three grids in ATLAS • thousands of jobs worldwide every day • exciting and busy times - bringing interoperability to grids Kaushik De, HipCAT THEGrid Workshop

  21. XML switch (Jabber Server) XMPP (XML) XMPP (XML) Web server SOAP supervisor agent executor agent Windmill Messaging • All messaging is XML based • Agents communicate using Jabber (open chat) protocol • Agents have same command line interface - GUI in future • Agents & web server can run at same or different locations • Executor accesses grid directly and/or thru web services Kaushik De, HipCAT THEGrid Workshop

  22. Jabber Clients Jabber Clients Jabber Server XMPP Intelligent Agents • Supervisor/executor are intelligent communication agents • uses Jabber open source instant messaging framework • Jabber server routes XMPP messages - acts as XML data switch • reliable p2p asynchronous message delivery through firewalls • built in support for dynamic ‘directory’, ‘discovery’, ‘presence’ • extensible - we can add monitoring, debugging agents easily • provides ‘chat’ capability for free - collaboration among operators • Jabber grid proxy under development (LBNL - Agarwal) Kaushik De, HipCAT THEGrid Workshop

  23. Capone Executor • Various executors are being developed • Capone - U.S. VDT executor by U. of Chicago and Argonne • Lexor - LCG executor mostly by Italian groups • NorduGrid, batch (Munich), Canadian, Australian(?) • Capone is based on GCE (Grid Computing Environment) • (VDT Client/Server, Chimera, Pegasus, Condor, Globus) • Status: • Python module • Process “thread” for each job • Archive of managed jobs • Job management • Grid monitoring • Aware of key parameters (e.g. available CPUs, jobs running) Kaushik De, HipCAT THEGrid Workshop

  24. Message protocols Web Service Jabber Translation Windmill ADA CPE Grid Stub DonQuixote Capone Architecture • Message interface • Web Service • Jabber • Translation level • Windmill • CPE (Capone Process Engine) • Processes • Grid • Stub • DonQuixote from Marco Mambelli Kaushik De, HipCAT THEGrid Workshop

  25. Don Quijote • Data Management for the ATLAS Automatic Production System • Allow transparent registration and movement of replicas between all Grid “flavors” used by ATLAS (across Grids) • Grid3; Nordugrid; LCG • (support for legacy systems might be introduced soon) • Avoid creating yet another catalog • which Grid middleware wouldn't recognize (e.g Resource Brokers) • use existing “native” catalogs and data management tools • provide a unified interface • Accessible as a service • lightweight clients Kaushik De, HipCAT THEGrid Workshop

  26. Monitoring & Accounting • At a (very) early stage in DC2 • Needs more discussion within ATLAS • Metrics being defined • Development of a coherent approach • Current efforts: • Job monitoring “around” the production database • Publish on the web, in real time, relevant data concerning the running of DC-2 and event production • SQL queries are submitted to the Prod DB hosted at CERN • Result is HTML formatted and web published • A first basic tool is already available as a prototype • On LCG: effort to verify the status of the Grid • two main tasks: site monitoring and job monitoring • based on GridICE, a tool deeply integrated with the current production Grid middleware • On Grid3: MonaLisa • On NG: NG monitoring Kaushik De, HipCAT THEGrid Workshop

  27. DC2 Status DC2 Monitoring • http://www.nordugrid.org/applications/prodsys/index.html Kaushik De, HipCAT THEGrid Workshop

  28. DC2 production (Phase 1) • Will be done as much as possible on Grid • We are ready to use the 3 grid flavours • LCG-2, Grid3+ and NorduGrid • All 3 look “stable” (adiabatic evolution) • Newcomers: • Interface LCG to Grid Canada • UVic, NRC and Alberta accept LCG jobs via TRIUMF interface CE • Keep the possibility to use “legacy” batch systems • Data will be stored on Tier1s • Will use several “catalogs”; DQ will take care of them • Current plan • ~20% on Grid3 • ~20% on NorduGrid • ~60% on LCG • To be adjusted based on experience Kaushik De, HipCAT THEGrid Workshop

  29. Current Grid3 Status (3/1/04)(http://www.ivdgl.org/grid2003) • 28 sites, multi-VO • shared resources • ~2000 CPUs • dynamic – roll in/out Kaushik De, HipCAT THEGrid Workshop

  30. NorduGrid Resources: details • NorduGrid middleware is deployed in: • Sweden (15 sites) • Denmark (10 sites) • Norway (3 sites) • Finland (3 sites) • Slovakia (1 site) • Estonia (1 site) • Sites to join before/during DC2 (preliminary): • Norway (1-2 sites) • Russia (1-2 sites) • Estonia (1-2 sites) • Sweden (1-2 sites) • Finland (1 site) • Germany (1 site) • Many of the resources will be available for ATLAS DC2 via the NorduGrid middleware • Nordic countries will coordinate their shares • For others, ATLAS representatives will negotiate the usage Kaushik De, HipCAT THEGrid Workshop

  31. ROMA1 • 22 Countries • 58 Sites (45 Europe, 2 US, 5 Canada, 5 Asia, 1 HP) • Coming: New Zealand, China, • other HP (Brazil, Singapore) • 3800 cpu Sites in LCG-2: 4 June 2004 Kaushik De, HipCAT THEGrid Workshop

  32. After DC2: “continuous production” • We have requests for • Single particles simulation (a lot)! • Already as part of DC2 production (calibrations) • To be defined • The detector geometry (which layout?) • The luminosity if pile-up is required • Others? (eg. Cavern background) • Physics samples for the Physics workshop studies (June 2005) • DC2 uses ATLAS “Final Layout” • It is intended to move to “Initial Layout” • Assuming that the geometry description is ready by beginning of August we can foresee an intensive MC production starting ~mid-September • Initial thoughts: • ~ 50 Million Physics events; that means ~10 Million events per month from end-September to March 2005 • Production could be done either by the production team or by the Physics groups • The production system should be able to support both • The Distributed Analysis system will provide an interface for Job submission Kaushik De, HipCAT THEGrid Workshop

  33. Conclusion • Data Challenges are important for ATLAS software and computing infrastructure readiness • Grids will be the default testbed for DC2 • U.S. playing a major role in DC2 planning & production • 12 U.S. sites ready to participate in DC2 • Major U.S. role in production software development • New grid production system, led by UTA, being adopted worldwide for grid production in ATLAS • Physics analysis will be emphasis of DC2 - new experience • Stay tuned Kaushik De, HipCAT THEGrid Workshop

More Related