The Capone Workflow Manager

The Capone Workflow Manager M. Mambelli, University of Chicago R. Gardner, University of Chicago J. Gieraltowsky, Argonne National Laboratory 14th February 2006 CHEP06, Mumbai, India

Capone • Workflow manager for Grid3 and OSG • Designed for ATLAS (managed and user) production • Used for testbed (Grid3 and OSG sites) troubleshooting and testing • Used as platform to test/integrate experimental technologies (PXE) in the Grid environment • Uses GriPhyN VDT (Globus, Codor, VDS) as grid middleware • Easy installation and support (released as Pacman package) • No more used for official ATLAS production

Past ATLAS Production with Capone • DC2 • Phase I: Simulation (Jul-Sep 04) • generation, simulation & pileup • Produced datasets stored on Tier1 centers, then CERN (Tier0) • scale: ~10M events, 30 TB • Phase II: “Tier0 Test” @CERN (1/10 scale) • Produce ESD, AOD (reconstruction) • Stream to Tier1 centers • Phase III: Distributed analysis (Oct-Dec 04) • access to event and non-event data from anywhere in the world both in organized and chaotic ways • Rome • Jan-May 2005 • Full chain Montecarlo production • User production • User production and Testing

ATLAS Global Architecture Don Quijote “DQ” prodDB (CERN) data management AMI (Metadata) Windmill or Eowyn super super super super soap Jabber/py Jabber/py soap LCG exe NG exe G3 exe Legacy exe Capone Dulcinea Lexor Lexor-CG RLS RLS RLS Grid3 LCG Nordu Grid LSF this talk

Capone and Grid Requirements • Interface to Grid3/OSG (GriPhyN VDT based) • Manage all steps in the job life cycle • prepare, submit, monitor, output & register • Manage workload and data placement • Process messages from Windmill Supervisor • Provide useful logging information to user • Communicate executor and job state information to Windmill (ProdDB)

Message protocols Web Service Jabber Translation Windmill User CPE Grid Stub Don Quijote Capone Architecture • Message interface • Web Service • Jabber • Translation layer • Windmill schema • CPE (Process Engine) • Processes • Grid3/OSG: GCE interface • Stub: local shell testing • DonQuijote (future) • Server side: GCE Server • ATLAS Releases and TRFs • Execution sandbox (kickstart) GCE Server

Monitoring RLS MDS GridCat SE DonQuijote MonALISA Capone ProdDB CE Windmill gsiftp WN gatekeeper Chimera VDC Condor-G schedd Pegasus GridMgr Capone Grid Interactions

Performance Summary of DC2 (Dec 04) • Several physics and calibration samples produced • 91K job attempts at Windmill level • 9K of these aborted before grid submission: • mostly RLS down or selected CE down • “Full” success rate: 63% • Average success after submitted: 70% • Includes subsequent problems at submit host • Includes errors from development

Performance Summary of Rome (5/2005) • Several physics and calibration samples produced • 253k job attempts at Windmill level • “Full” success rate: 73% • Includes subsequent problems at submit host • Includes errors from development • Scalability is a problem for short jobs • Submission rate • Handling of many small jobs • Data movement is also problematic

Capone Failure Statistics • Submission 2.4% • Execution 2.0% • Post-job check 5.9% • Stage out 41.6% • RLS registration 5.1% • Capone host interruptions 14.1% • Capone succeed, Windmill fail 0.3% • Other 26.6%

Production lessons • Single points of failure • Prodsys or grid components • System expertise (people) • Fragmented production software • Client (Capone submit) hosts • Load and memory requirements for job management • Load caused by job state checking (interaction with Condor-G) • Many processes (VDT dagman processes) • No client host persistency • Need local database for job recovery • Not enough tools for testing • Certificate problems (expiration, CRL expiration)

Improvements • DAG batching in Condor-G • To scale better by reducing the load in the submit host • Multiple stage, (persistent) servers multithreaded • Overcome Python thread limitation • Maintain server redundancy • Recoverability • Checkpointing • To recover from Capone or submit host failures • Rollback • Recovery procedures • Workarounds (retries, …) to Grid problems … … … …

Performance and Scalability Tests • Submit host • Dual CPU, 1.3GHz Xeon, 1GB RAM • Job mix • Event generation • Generic CPU usage (900sec, 30min) • File I/O • Testbed • 9 OSG sites (UTA_dpcc, UC_Teraport_OSG_ITB BU_ATLAS_Tier2, BNL_ATLAS, UC_ATLAS_Tier2 PSU_Grid3, IU_ATLAS_Tier2, OUHEP, SMU_Physics_Cluster) • Tests • Multiple tests, repetition and sustained rate • Job submission • Job recovery (system crash, DNS problem) • Sustained submission, overload

Tests results • Results (avg/min/max) • Submission rate to Capone (jobs/min): 541/281/1132 • Submission rate to the Grid (jobs/min): 18/15/48 • Number of jobs handled by Capone as visible issuing ./capone status summary: • Running jobs: 4019/0/6746 • Total jobs: 7176/0/8100 • Number of job as visible in Condor-G (less) • Only part of the execution • Remotely running/queued

Development and support practices • 2 developers team • pacman packaging and easy update (1 line installation or update) • 2 releases/branches starting Capone 1.0.x/1.1.x • stable one for production (only bug fixes) • development one (new features) • iGOC • Redirection of Capone problem • Collaboration in site troubleshooting. Problems resolved at weekly iVDGL operations meetings • Use of community tools: • Savannah portal (CVS, bugzilla, file repository) • Twiki (documentation) • Mailing lists and IM for communications and troubleshooting

Conclusions • More flexible execution model • Possibility to execute TRFs using shared or local disk areas • No need of preinstalled transformation (possibility to stage it in with the job) • Improved performance • Job checkpointing and recoverability from submit host failures • Max jobs no more limited by max number of python threads • Recovery actions for some Grid errors • Higher submission rate for clients • The submission rate to the Grid could be higher but there were always queued jobs • Feasibility in small team of development and support • Production and development versions • Extended documentation • Production and user support and troubleshooting

Acknowledgements • Windmill team (Kaushik De) • Don Quijote team (Miguel Branco) • ATLAS production group, Luc Goossens, CERN IT (prodDB) • ATLAS software distribution team (Alessandro de Salvo, Fred Luehring) • US ATLAS testbed sites and Grid3/OSG site administrators • iGOC operations group • ATLAS Database group (ProdDB Capone-view displays) • Physics Validation group: UC Berkeley, Brookhaven Lab • More info • Twiki https://uimon.cern.ch/twiki/bin/view/Atlas/Capone • Savanna portal http://griddev.uchicago.edu/savannah/projects/atgce/

The Capone Workflow Manager