Condor DAGMan: Introduction & Update

Condor DAGMan:Introduction & Update

DAGMan • Directed Acyclic Graph Manager • DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. • (e.g., “Don’t run job “B” until job “A” has completed successfully.”)

Why is This Important? • Most real science involves complex sequences of tasks – on many resources at many sites. • E.g., move data, compute, check, move back, etc. • … and many types of jobs working together • Condor, Grid (Condor-G), MPI, shell scripts, etc. • Failures are a certainty, so recoverability of the sequence – not just the jobs – is crucial.

Job A Job B Job C Job D What is a DAG? • A DAG is the datastructure used by DAGMan to represent these dependencies. • Each job is a “node” in the DAG. • Each node can have any number of “parent” or “children” nodes – as long as there are no loops!

Job A Job B Job C Job D Defining a DAG • A DAG is defined by a .dagfile, listing each of its nodes and their dependencies: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D • each node will run the Condor or Grid job specified by its accompanying Condor submit file

Submitting a DAG • To start your DAG, just run condor_submit_dag with your .dag file, and Condor will start a personal DAGMan daemon to begin running your jobs: % condor_submit_dag diamond.dag • condor_submit_dag submits a Scheduler Universe job to run DAGMan under Condor… so DAGMan itself will be robust in case of failure, machine reboots, etc.

Running a DAG • DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies. DAGMan A Condor Job Queue .dag File A B C D

Running a DAG (cont’d) • DAGMan holds & submits jobs to the Condor queue at the appropriate times. DAGMan A Condor Job Queue B B C C D

Running a DAG (cont’d) • In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG. DAGMan A Condor Job Queue Rescue File B X D

Recovering a DAG • Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG. DAGMan A Condor Job Queue Rescue File B C C D

Finishing a DAG • Once the DAG is complete, the DAGMan job itself is finished, and exits. DAGMan A Condor Job Queue B C D

Additional DAGMan Features • Provides other knobs handy for job management… • nodes can have PRE & POST scripts • job submission can be “throttled” • NEW: failed nodes can be automatically re-tried a configurable number of times

PRE Job A Job B Job C Job D POST PRE & POST Scripts • Executes locally on the submit host before or after job submission… • Example: # diamond.dag PRE A prepare-A.sh Job A a.sub Job B b.sub Job C c.sub Job D d.sub POST D double-check.sh Parent A Child B C Parent B C Child D • PRE/POST scripts are part of node

DAG “Throttling” • You can tell DAGMan to limit the maximum number of jobs it submits at any one time • condor_submit_dag -maxjobs N • useful for managing resource limitations (e.g., licenses) • You can also can limit the number of simultaneous PRE or POST scripts. • Added after Vladimir Litvin’s 7000-node DAG started 7000 PRE scripts on his machine!

Job A Job B Job C Job D Node RETRY • Tells DAGMan to re-run a node multiple times if necessary… • Example: # diamond.dag Job A a.sub Job B b.sub RETRY B 5 Job C c.sub RETRY C 5 Job D d.sub Parent A Child B C Parent B C Child D

DAGMan Progress • Testing… lots of testing. • 10,000+ node DAGs run smoothly • Developed automated DAG testing tools to generate random DAGs and test for correct execution (Ning Lin & Will McDonald) • Lots of bugs fixed

DAGMan Progress (cont’d) • New features • Improved logging (timestamps, etc.) • More efficient recovery • Node RETRY capability • DAG info in condor_q (with –dag flag) • Robust in more failure cases • Recursive DAGs for conditional execution • DAGMan for Windows (Ray Pingree)

DAGMan Success • DAGMan is becoming part of the common framework for running on the grid. • Particle Physics Data Grid (PPDG) • Grid Physics Network (GriPhyN) • Many Super Computing 2001 demos • more…

MCAT; GriPhyN catalogs MDS MDS GDMP DAGMAN, Kangaroo GSI, CAS Globus GRAM GridFTP; GRAM; SRM DAGMan in the GriPhyN Architecture Application DAG Catalog Services Monitoring Planner Info Services DAG Repl. Mgmt. Executor Policy/Security Reliable Transfer Service Compute Resource Storage Resource diagram by Ian Foster (Argonne)

DAGMan in PPDG Tools diagram by Jim Amundson (Fermilab)

What’s Next? • More flexible control of node execution • Currently implicit: “all my parents returned 0”. • Why not, “all parents returned 0 AND ran for more than two hours” or “parent A returned 0 and parent B returned 42”? • 1st step: represent DAG nodes internally as ClassAds • Allows DAGMan to decide when to run nodes based on arbitrary requirements

What’s Next? (cont’d) • Extend DAGMan to utilize DaP Scheduler (DaP?) to intelligently schedule data transfers along with Condor and Condor-G jobs. Condor DAGMan Condor-G DaP Scheduler

Thank You! • Interested in seeing more? • Come to the DAGMan BoF • Wednesday 9am - noon • Room 3393, Computer Sciences (1210 W. Dayton St.) • Email us: • condor-admin@cs.wisc.edu • Try it! • http://www.cs.wisc.edu/condor

Condor DAGMan: Introduction & Update