240 likes | 449 Views
ATLAS DC2 Production …on Grid3. M. Mambelli, University of Chicago for the US ATLAS DC2 team September 28, 2004 CHEP04. ATLAS Data Challenges. Purpose Validate the LHC computing model Develop distributed production & analysis tools Provide large datasets for physics working groups
E N D
ATLAS DC2 Production …on Grid3 M. Mambelli, University of Chicago for the US ATLAS DC2 team September 28, 2004 CHEP04
ATLAS Data Challenges • Purpose • Validate the LHC computing model • Develop distributed production & analysis tools • Provide large datasets for physics working groups • Schedule • DC1 (2002-2003): full software chain • DC2 (2004): automatic grid production system • DC3 (2006): drive final deployments for startup
ATLAS DC2 Production • Phase I: Simulation (Jul-Sep 04) • generation, simulation & pileup • produced datasets stored on Tier1 centers, then CERN (Tier0) • scale: ~10M events, 30 TB • Phase II: “Tier0 Test” @CERN (1/10 scale) • Produce ESD, AOD (reconstruction) • Stream to Tier1 centers • Phase III: Distributed analysis (Oct-Dec 04) • access to event and non-event data from anywhere in the world both in organized and chaotic ways cf. D. Adams, #115
ATLAS Production System • Components • Production database • ATLAS job definition and status • Supervisor (all Grids): Windmill(L. Goossens, #501) • Job distribution and verification system • Data Management: Don Quijote(M. Branco #142) • Provides ATLAS layer above Grid replica systems • Grid Executors • LCG: Lexor(D. Rebatto #364) • NorduGrid: Dulcinea(O. Smirnova #499) • Grid3: Capone(this talk)
ATLAS Global Architecture Don Quijote “DQ” prodDB (CERN) data management AMI (Metadata) Windmill super super super super soap jabber jabber soap LCG exe NG exe G3 exe Legacy exe Capone Dulcinea Lexor RLS RLS RLS Grid3 LCG Nordu Grid LSF this talk
Capone and Grid3 Requirements • Interface to Grid3 (GriPhyN VDT based) • Manage all steps in the job life cycle • prepare, submit, monitor, output & register • Manage workload and data placement • Process messages from Windmill Supervisor • Provide useful logging information to user • Communicate executor and job state information to Windmill (ProdDB)
Capone Execution Environment • GCE Server side • ATLAS releases and transformations • Pacman installation, dynamically by grid-based jobs • Execution sandbox • Chimera kickstart executable • Transformation wrapper scripts • MDS info providers (required site-specific attributes) • GCE Client side (web service) • Capone • Chimera/Pegasus, Condor-G (from VDT) • Globus RLS and DQ clients “GCE” = Grid Component Environment
Message protocols Web Service Jabber Translation Windmill ADA CPE Grid Stub DonQuijote Capone Architecture • Message interface • Web Service • Jabber • Translation layer • Windmill schema • CPE (Process Engine) • Processes • Grid3: GCE interface • Stub: local shell testing • DonQuijote (future)
Capone System Elements • GriPhyN Virtual Data System (VDS) • Transformation • A workflow accepting input data (datasets), parameters and producing output data (datasets) • Simple (executable)/Complex (DAG) • Derivation • Transformation where the parameters have been bound to actual parameters • Directed Acyclic Graph (DAG) • Abstract DAG (DAX) created by Chimera, with no reference to concrete elements in the Grid • Concrete DAG (cDAG) created by Pegasus, where CE, SE and PFN have been assigned • Globus, RLS, Condor
Monitoring RLS MDS GridCat SE DonQuijote MonALISA Capone ProdDB CE Windmill gsiftp WN gatekeeper Chimera VDC Condor-G schedd Pegasus GridMgr Capone Grid Interactions
RLS Monitoring Monitoring MDS GridCat MDS GridCat SE DonQuijote RLS MonALISA MonALISA SE Capone ProdDB DonQuijote CE Windmill gsiftp WN gatekeeper Chimera Capone ProdDB VDC CE Condor-G Windmill gsiftp schedd WN Pegasus GridMgr gatekeeper Chimera VDC Condor-G schedd Pegasus GridMgr A job in Capone (1, submission) • Reception • Job received from Windmill • Translation • Un-marshalling, ATLAS transformation • DAX generation • Chimera generates abstract DAG • Input file retrieval from RLS catalog • Check RLS for input LFNs (retrieval of GUID, PFN) • Scheduling: CE and SE are chosen • Concrete DAG generation and submission • Pegasus creates Condor submit files • DAGMan invoked to manage remote steps
Monitoring MDS GridCat RLS MonALISA SE DonQuijote Capone ProdDB CE Windmill gsiftp WN gatekeeper Chimera VDC Condor-G schedd Pegasus GridMgr A job in Capone (2, execution) • Remote job running / status checking • Stage-in of input files, create POOL FileCatalog • Athena (ATLAS code) execution • Remote Execution Check • Verification of output files and exit codes • Recovery of metadata (GUID, MD5sum, exe attributes) • Stage Out: transfer from CE site to destination SE • Output registration • Registration of the output LFN/PFN and metadata in RLS • Finish • Job completed successfully, communicates to Windmill that jobs is ready for validation • Job status is sent to Windmill during all the execution • Windmill/DQ validate & register output in ProdDB
Performance Summary (9/20/04) • Several physics and calibration samples produced • 56K job attempts at Windmill level • 9K of these aborted before grid submission: • mostly RLS down or selected CE down • “Full” success rate: 66% • Average success after submitted: 70% • Includes subsequent problems at submit host • Includes errors from development • 60 CPU-years consumed since July • 8 TB produced
ATLAS DC2 CPU usage G. Poulard, 9/21/04 Total ATLAS DC2 ~ 1470 kSI2k.months ~ 100000 jobs ~ 7.94 million events ~ 30 TB
CPU-day Mid July Sep 10 Ramp up ATLAS DC2
Job Distribution on Grid3 J. Shank, 9/21/04
Site Statistics (9/20/04) Average success rate by site: 70%
Capone & Grid3 Failure Statistics 9/20/04 • Total jobs (validated) 37713 • Jobs failed 19303 • Submission 472 • Execution 392 • Post-job check 1147 • Stage out 8037 • RLS registration 989 • Capone host interruptions 2725 • Capone succeed, Windmill fail 57 • Other 5139
Production lessons • Single points of failure • Production database • RLS, DQ, VDC and Jabber servers • One local network domain • Distributed RLS • System expertise (people) • Fragmented production software • Fragmented operations (defining/fixing jobs in the production database) • Client (Capone submit) hosts • Load and memory requirements for job management • Load caused by job state checking (interaction with Condor-G) • Many processes • No client host persistency • Need local database for job recovery next phase of development • DOEGrids certificate or certificate revocation list expiration
Production lessons (II) • Site infrastructure problems • Hardware problems • Software distribution, transformation upgrades • File systems (NFS major culprit); various solutions by site administrators • Errors in stage-out caused by poor network connections and gatekeeper load. • Fixed by adding I/O throttling, checking number of TCP connections • Lack of storage management (eg SRM) on sites means submitters do some cleanup remotely. Not a major problem so far, but we’ve not had much competition • Load on gatekeepers • Improved by moving md5sum off gatekeeper • Post job processing • Remote execution (mostly in pre/post job) error prone • Reason of the failure difficult to understand • No automated tools for validation
Operations Lessons • Grid3 iGOC and US Tier1 developed operations response model • Tier1 center • core services • “on-call” person available always • response protocol developed • iGOC • Coordinates problem resolution for Tier1 “off hours” • Trouble handling for non-ATLAS Grid3 sites. Problems resolved at weekly iVDGL operations meetings • Shift schedule (8-midnight since July 23) • 7 trained DC2 submitters • Keeps queues saturated, reports sites and system problems, cleans working directories • Extensive use of email lists • Partial use of alternatives like Web portals, IM
Conclusions • Completely new system • Grid3 simplicity requires more functionality and state management on the executor submit host • All functions of job planning, job state tracking, and data management (stage-in, out) managed by Capone rather than grid systems • clients exposed to all manner of grid failures • good for experience, but a client-heavy system • Major areas for upgrade to Capone system • Job state management and controls, state persistency • Generic transformation handling for user-level production
Authors • GIERALTOWSKI, Gerald (Argonne National Laboratory)MAY, Edward (Argonne National Laboratory)VANIACHINE, Alexandre (Argonne National Laboratory) SHANK, Jim (Boston University)YOUSSEF, Saul (Boston University)BAKER, Richard (Brookhaven National Laboratory)DENG, Wensheng (Brookhaven National Laboratory)NEVSKI, Pavel (Brookhaven National Laboratory)MAMBELLI, Marco (University of Chicago)GARDNER, Robert (University of Chicago)SMIRNOV, Yuri (University of Chicago)ZHAO, Xin (University of Chicago)LUEHRING, Frederick (Indiana University)SEVERINI, Horst (Oklahoma University)DE, Kaushik (University of Texas at Arlington)MCGUIGAN, Patrick (University of Texas at Arlington) OZTURK, Nurcan (University of Texas at Arlington)SOSEBEE, Mark (University of Texas at Arlington)
Acknowledgements • Windmill team (Kaushik De) • Don Quijote team (Miguel Branco) • ATLAS production group, Luc Goossens, CERN IT (prodDB) • ATLAS software distribution team (Alessandro de Salvo, Fred Luehring) • US ATLAS testbed sites and Grid3 site administrators • iGOC operations group • ATLAS Database group (ProdDB Capone-view displays) • Physics Validation group: UC Berkeley, Brookhaven Lab • More info • US ATLAS Grid http://www.usatlas.bnl.gov/computing/grid/ • DC2 shift procedures http://grid.uchicago.edu/dc2shift • US ATLAS Grid Tools & Services http://grid.uchicago.edu/gts/