1 / 17

The Capone Workflow Manager

The Capone Workflow Manager. M. Mambelli, University of Chicago R. Gardner, University of Chicago J. Gieraltowsky, Argonne National Laboratory 14 th February 2006 CHEP06, Mumbai, India. Capone. Workflow manager for Grid3 and OSG Designed for ATLAS (managed and user) production

Download Presentation

The Capone Workflow Manager

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Capone Workflow Manager M. Mambelli, University of Chicago R. Gardner, University of Chicago J. Gieraltowsky, Argonne National Laboratory 14th February 2006 CHEP06, Mumbai, India

  2. Capone • Workflow manager for Grid3 and OSG • Designed for ATLAS (managed and user) production • Used for testbed (Grid3 and OSG sites) troubleshooting and testing • Used as platform to test/integrate experimental technologies (PXE) in the Grid environment • Uses GriPhyN VDT (Globus, Codor, VDS) as grid middleware • Easy installation and support (released as Pacman package) • No more used for official ATLAS production

  3. Past ATLAS Production with Capone • DC2 • Phase I: Simulation (Jul-Sep 04) • generation, simulation & pileup • Produced datasets stored on Tier1 centers, then CERN (Tier0) • scale: ~10M events, 30 TB • Phase II: “Tier0 Test” @CERN (1/10 scale) • Produce ESD, AOD (reconstruction) • Stream to Tier1 centers • Phase III: Distributed analysis (Oct-Dec 04) • access to event and non-event data from anywhere in the world both in organized and chaotic ways • Rome • Jan-May 2005 • Full chain Montecarlo production • User production • User production and Testing

  4. ATLAS Global Architecture Don Quijote “DQ” prodDB (CERN) data management AMI (Metadata) Windmill or Eowyn super super super super soap Jabber/py Jabber/py soap LCG exe NG exe G3 exe Legacy exe Capone Dulcinea Lexor Lexor-CG RLS RLS RLS Grid3 LCG Nordu Grid LSF this talk

  5. Capone and Grid Requirements • Interface to Grid3/OSG (GriPhyN VDT based) • Manage all steps in the job life cycle • prepare, submit, monitor, output & register • Manage workload and data placement • Process messages from Windmill Supervisor • Provide useful logging information to user • Communicate executor and job state information to Windmill (ProdDB)

  6. Message protocols Web Service Jabber Translation Windmill User CPE Grid Stub Don Quijote Capone Architecture • Message interface • Web Service • Jabber • Translation layer • Windmill schema • CPE (Process Engine) • Processes • Grid3/OSG: GCE interface • Stub: local shell testing • DonQuijote (future) • Server side: GCE Server • ATLAS Releases and TRFs • Execution sandbox (kickstart) GCE Server

  7. Monitoring RLS MDS GridCat SE DonQuijote MonALISA Capone ProdDB CE Windmill gsiftp WN gatekeeper Chimera VDC Condor-G schedd Pegasus GridMgr Capone Grid Interactions

  8. Performance Summary of DC2 (Dec 04) • Several physics and calibration samples produced • 91K job attempts at Windmill level • 9K of these aborted before grid submission: • mostly RLS down or selected CE down • “Full” success rate: 63% • Average success after submitted: 70% • Includes subsequent problems at submit host • Includes errors from development

  9. Performance Summary of Rome (5/2005) • Several physics and calibration samples produced • 253k job attempts at Windmill level • “Full” success rate: 73% • Includes subsequent problems at submit host • Includes errors from development • Scalability is a problem for short jobs • Submission rate • Handling of many small jobs • Data movement is also problematic

  10. Capone Failure Statistics • Submission 2.4% • Execution 2.0% • Post-job check 5.9% • Stage out 41.6% • RLS registration 5.1% • Capone host interruptions 14.1% • Capone succeed, Windmill fail 0.3% • Other 26.6%

  11. Production lessons • Single points of failure • Prodsys or grid components • System expertise (people) • Fragmented production software • Client (Capone submit) hosts • Load and memory requirements for job management • Load caused by job state checking (interaction with Condor-G) • Many processes (VDT dagman processes) • No client host persistency • Need local database for job recovery • Not enough tools for testing • Certificate problems (expiration, CRL expiration)

  12. Improvements • DAG batching in Condor-G • To scale better by reducing the load in the submit host • Multiple stage, (persistent) servers multithreaded • Overcome Python thread limitation • Maintain server redundancy • Recoverability • Checkpointing • To recover from Capone or submit host failures • Rollback • Recovery procedures • Workarounds (retries, …) to Grid problems … … … …

  13. Performance and Scalability Tests • Submit host • Dual CPU, 1.3GHz Xeon, 1GB RAM • Job mix • Event generation • Generic CPU usage (900sec, 30min) • File I/O • Testbed • 9 OSG sites (UTA_dpcc, UC_Teraport_OSG_ITB BU_ATLAS_Tier2, BNL_ATLAS, UC_ATLAS_Tier2 PSU_Grid3, IU_ATLAS_Tier2, OUHEP, SMU_Physics_Cluster) • Tests • Multiple tests, repetition and sustained rate • Job submission • Job recovery (system crash, DNS problem) • Sustained submission, overload

  14. Tests results • Results (avg/min/max) • Submission rate to Capone (jobs/min): 541/281/1132 • Submission rate to the Grid (jobs/min): 18/15/48 • Number of jobs handled by Capone as visible issuing ./capone status summary: • Running jobs: 4019/0/6746 • Total jobs: 7176/0/8100 • Number of job as visible in Condor-G (less) • Only part of the execution • Remotely running/queued

  15. Development and support practices • 2 developers team • pacman packaging and easy update (1 line installation or update) • 2 releases/branches starting Capone 1.0.x/1.1.x • stable one for production (only bug fixes) • development one (new features) • iGOC • Redirection of Capone problem • Collaboration in site troubleshooting. Problems resolved at weekly iVDGL operations meetings • Use of community tools: • Savannah portal (CVS, bugzilla, file repository) • Twiki (documentation) • Mailing lists and IM for communications and troubleshooting

  16. Conclusions • More flexible execution model • Possibility to execute TRFs using shared or local disk areas • No need of preinstalled transformation (possibility to stage it in with the job) • Improved performance • Job checkpointing and recoverability from submit host failures • Max jobs no more limited by max number of python threads • Recovery actions for some Grid errors • Higher submission rate for clients • The submission rate to the Grid could be higher but there were always queued jobs • Feasibility in small team of development and support • Production and development versions • Extended documentation • Production and user support and troubleshooting

  17. Acknowledgements • Windmill team (Kaushik De) • Don Quijote team (Miguel Branco) • ATLAS production group, Luc Goossens, CERN IT (prodDB) • ATLAS software distribution team (Alessandro de Salvo, Fred Luehring) • US ATLAS testbed sites and Grid3/OSG site administrators • iGOC operations group • ATLAS Database group (ProdDB Capone-view displays) • Physics Validation group: UC Berkeley, Brookhaven Lab • More info • Twiki https://uimon.cern.ch/twiki/bin/view/Atlas/Capone • Savanna portal http://griddev.uchicago.edu/savannah/projects/atgce/

More Related