360 likes | 373 Views
Job Submission with Globus, Condor, and Condor-G. Selim Kalayci Florida International University 07/21/2009. Note: Slides are compiled from various TeraGrid Documentations. Grid Job Management using Globus. Common WS interface to schedulers Unix, Condor, LSF, PBS, SGE, …
E N D
Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid Documentations
Grid Job Management using Globus • Common WS interface to schedulers • Unix, Condor, LSF, PBS, SGE, … • More generally: interface for process execution management • Lay down execution environment • Stage data • Monitor & manage lifecycle • Kill it, clean up
Grid Job Management Goals Provide a service to securely: • Create an environment for a job • Stage files to/from environment • Cause execution of job process(es) • Via various local resource managers • Monitor execution • Signal important state changes to client • Enable client access to output files • Streaming access during execution
GRAM • GRAM: Globus Resource Allocation and Management • GRAM is a Globus Toolkit component • For Grid jobmanagement • GRAM is a unifying remote interface to Resource Managers • Yet preserves local site security/control • Remote credential management • File staging via RFT and GridFTP
A Simple Example • First, login to queenbee.loni-lsu.teragrid.org • Command example: % globusrun-ws -submit -c /bin/date Submitting job...Done.Job ID: uuid:002a6ab8-6036-11d9-bae6-0002a5ad41e5Termination time: 01/07/2005 22:55 GMTCurrent job state: ActiveCurrent job state: CleanUpCurrent job state: DoneDestroying job...Done. • A successful submission will create a new ManagedJob resource with its own unique EPR for messaging • Use –o option to create the EPR file % globusrun-ws -submit –o job.epr -c /bin/date
A Simple Example(2) • To see the output, use –s (stream) option % globusrun-ws -submit –s -c /bin/date Termination time: 06/14/2007 18:07 GMT Current job state: Active Current job state: CleanUp-Hold Wed Jun 13 14:07:54 EDT 2007 Current job state: CleanUp Current job state: Done Destroying job...Done. Cleaning up any delegated credentials...Done. • If you want to send the output to a file, use –so option % globusrun-ws -submit –s –so job.out -c /bin/date … % cat job.out Wed Jun 13 14:07:54 EDT 2007
A Simple Example(3) • Submitting your job to different schedulers • Fork % globusrun-ws -submit -Ft Fork -s -c /bin/date (Actually, the default is Fork. So, you can skip it in this case.) • SGE % globusrun-ws -submit -Ft PBS-s -c /bin/date • Submitting to a remote site % globusrun-ws -submit -F tg-login.frost.ncar.teragrid.org -c /bin/date
Batch Job Submissions % globusrun-ws -submit -batch -o job_epr -c /bin/sleep 50Submitting job...Done.Job ID: uuid:f9544174-60c5-11d9-97e3-0002a5ad41e5Termination time: 01/08/2005 16:05 GMT % globusrun-ws -status -j job_eprCurrent job state: Active % globusrun-ws -status -j job_eprCurrent job state: Done % globusrun-ws -kill -j job_eprRequesting original job description...Done.Destroying job...Done.
Resource Specification Language (RSL) • RSL is the language used by the clients to submit a job. • All job submission parameters are described in RSL, including the executable file and arguments. • You can specify the type and capabilities of resources to execute your job. • You can also coordinate Stage-in and Stage-out operations through RSL.
Submitting a job through RSL • Command: % globusrun-ws -submit -f touch.xml • Contents of touch.xml file: <job> <executable>/bin/touch</executable> <argument>touched_it</argument></job>
Condor • is a software system that creates an HTC environment • Created at UW-Madison • Condor is a specialized workload management system for compute-intensive jobs. • Detects machine availability • Harnesses available resources • Uses remote system calls to send R/W operations over the network • Provides powerful resource management by matching resource owners with consumers (broker)
How Condor works Condor provides: • a job queueing mechanism • scheduling policy • priority scheme • resource monitoring, and • resource management. Users submit their serial or parallel jobs to Condor, Condor places them into a queue, … chooses when and where to run the jobs based upon a policy, … carefully monitors their progress, and … ultimately informs the user upon completion.
Condor - features • Checkpoint & migration • Remote system calls • Able to transfer data files and executables across machines • Job ordering • Job requirements and preferences can be specified via powerful expressions
Condor lets you manage a large number of jobs. • Specify the jobs in a file and submit them to Condor • Condor runs them and keeps you notified on their progress • Mechanisms to help you manage huge numbers of jobs (1000’s), all the data, etc. • Handles inter-job dependencies (DAGMan) • Users can set Condor's job priorities • Condor administrators can set user priorities
Condor-G • Condor-G is a specialization of Condor. It is also known as the “Globus universe” or “Grid universe”. • Condor-G can submit jobs to Globus resources, just like globusrun-ws. • Condor-G combines the inter-domain resource management protocols of the Globus Toolkit and the intra-domain resource and job management methods of Condor for managing Grid jobs.
Condor-G … • does whatever it takes to run your jobs, even if … • The gatekeeper is temporarily unavailable • The job manager crashes • Your local machine crashes • The network goes down
Remote Resource Access: Globus Globus JobManager Globus GRAM Protocol “globusrun myjob …” fork() Organization A Organization B
Globus Globus JobManager Globus GRAM Protocol “globusrun myjob …” fork() Organization A Organization B
Globus + Condor Globus JobManager Globus GRAM Protocol “globusrun myjob …” Submit to Condor Condor Pool Organization A Organization B
Globus + Condor Globus JobManager Globus GRAM Protocol “globusrun …” Submit to Condor Condor Pool Organization A Organization B
Condor-G + Globus + Condor Globus JobManager Condor-G Globus GRAM Protocol myjob1 myjob2 myjob3 myjob4 myjob5 … Submit to Condor Condor Pool Organization A Organization B
Just to be fair… • The gatekeeper doesn’t have to submit to a Condor pool. • It could be PBS, LSF, Sun Grid Engine… • Condor-G will work fine whatever the remote batch system is.
Four Steps to Run a Job with Condor • These choices tell Condor • how • when • where to run the job, • and describe exactly what you want to run. • Choose a Universe for your job • Make your job batch-ready • Create a submit description file • Run condor_submit
1. Choose a Universe • There are many choices • Vanilla: any old job • Standard: checkpointing & remote I/O • Java: better for Java jobs • MPI: Run parallel MPI jobs • Virtual Machine: Run a virtual machine as job • … • For now, we’ll just consider vanilla
2. Make your job batch-ready • Must be able to run in the background: • no interactive input, windows, GUI, etc. • Condor is designed to run jobs as a batch system, with pre-defined inputs for jobs • Can still use STDIN, STDOUT, and STDERR (the keyboard and the screen), but files are used for these instead of the actual devices • Organize data files
3. Create a Submit Description File • A plain ASCII text file • Condor does not care about file extensions • Tells Condor about your job: • Which executable to run and where to find it • Which universe • Location of input, output and error files • Command-line arguments, if any • Environment variables • Any special requirements or preferences
Simple Submit Description File # myjob.submit file # Simple condor_submit input file # (Lines beginning with # are comments) # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = vanilla Executable = analysis Log = my_job.log Queue
4. Run condor_submit • You give condor_submit the name of the submit file you have created: condor_submit my_job.submit • condor_submit parses the submit file
Another Submit Description File # Example condor_submit input file # (Lines beginning with # are comments) # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = vanilla Executable = /home/wright/condor/my_job.condor Input = my_job.stdin Output = my_job.stdout Error = my_job.stderr Arguments = -arg1 -arg2 Queue
“Clusters” and “Processes” • If your submit file describes multiple jobs, we call this a “cluster” • Each job within a cluster is called a “process” or “proc” • If you only specify one job, you still get a cluster, but it has only one process • A Condor “Job ID” is the cluster number, a period, and the process number (“23.5”) • Process numbers always start at 0
Example Submit Description File for a Cluster # Example condor_submit input file that defines # a cluster of two jobs with different iwd Universe = vanilla Executable = my_job Arguments = -arg1 -arg2 InitialDir = run_0 Queue Becomes job 2.0 InitialDir = run_1 Queue Becomes job 2.1
Submit Description File for a BIG Cluster of Jobs • The initial directory for each job is specified with the $(Process) macro, and instead of submitting a single job, we use “Queue 600” to submit 600 jobs at once • $(Process) will be expanded to the process number for each job in the cluster (from 0 up to 599 in this case), so we’ll have “run_0”, “run_1”, … “run_599” directories • All the input/output files will be in different directories!
Submit Description File for a BIG Cluster of Jobs # Example condor_submit input file that defines # a cluster of 600 jobs with different iwd Universe = vanilla Executable = my_job Arguments = -arg1 –arg2 InitialDir = run_$(Process) Queue 600
Other Condor commands • condor_q – show status of job queue • condor_status – show status of compute nodes • condor_rm – remove a job • condor_hold – hold a job temporarily • condor_release – release a job from hold
Submitting more complex jobs • express dependencies between jobs WORKFLOWS • Condor DAGMan. • Next week
Hands-on Lab • http://users.cs.fiu.edu/~skala001/Condor_Lab.htm