Batch Systems • In a number of scientific computing environments, multiple users must share a compute resource: • research clusters • supercomputing centers • On multi-user HPC clusters, the batch system is a key component for aggregating compute nodes into a single, sharable computing resource • The batch system becomes the “nerve center” for coordinating the use of resources and controlling the state of the system in a way that must be “fair” to its users • As current and future expert users of large-scale compute resources, you need to be familiar with the basics of a batch system
Batch Systems • The core functionality of all batch systems are essentially the same, regardless of the size or specific configuration of the compute hardware: • Multiple Job Queues: • queues provide an orderly environment for managing a large number of jobs • queues are defined with a variety of limits for maximum run times, memory usage, and processor counts; they are often assigned different priority levels as well • may be interactive or non-interactive • Job Control: • submission of individual jobs to do some work (eg. serial, or parallel HPC applications) • simple monitoring and manipulation of individual jobs, and collection of resource usage statistics (e.g., memory usage, CPU usage, and elapsed wall-clock time per job) • Job Scheduling • policy which decides priority between individual user jobs • allocates resources to scheduled jobs
Batch Systems • Job Scheduling Policies: • the scheduler must decide how to prioritize all the jobs on the system and allocate necessary resources for each job (processors, memory, file-systems, etc) • scheduling process can be easy or non-trivial depending on the size and desired functionality • first in, first out (FIFO)scheduling: jobs are simply scheduled in the order in which they are submitted • political scheduling: enables some users to have more priority than others • fairshare scheduling, scheduler ensures users have equal access over time • Additional features may also impact scheduling order: • advanced reservations - resources can be reserved in advance for a particular user or job • backfill -can be combined with any of the scheduling paradigms to allow smaller jobs to run while waiting for enough resources to become available for larger jobs • back-fill of smaller jobs helps maximize the overall resource utilization • back-fill can be your friend for small duration jobs
Batch Systems • Common batch systems you may encounter in scientific computing: • Platform LSF • PBS • Loadleveler (IBM) • SGE • All have similar functionality but different syntax • Reasonably straight forward to convert your job scripts from one system to another • Above all include specific batch system directives which can be placed in a shell script to request certain resources (processors, queues, etc). • We will focus on LSF primarily since it is the system running on Lonestar
Launch mpirun Queue Master Batch Submission Process internet Compute Nodes Server Head Submission: bsub < job C1 C2 C3 C4 Queue: Job Script waits for resources on Server Master: Compute Node that executes the job script, launches ALL MPI processes Launch: ssh to each compute node to start executable (e.g. a.out) ibrun ./a.out mpirun –np # ./a.out
LSF Batch System • Lonestar uses Platform LSF for both the batch queuing system and scheduling mechanism (provides similar functionality to PBS, but requires different commands for job submission and monitoring) • LSF includes global fairshare, a mechanism for ensuring no one user monopolizes the computing resources • Batch jobs are submitted on the front end and are subsequently executed on compute nodes as resources become available • Order of job execution depends on a variety of parameters: • Submission Time • Queue Priority: some queues have higher priorities than others • Backfill Opportunities: small jobs may be back-filled while waiting for bigger jobs to complete • Fairshare Priority: users who have recently used a lot of compute resources will have a lower priority than those who are submitting new jobs • Advanced Reservations: jobs my be blocked in order to accommodate advanced reservations (for example, during maintenance windows) • Number of Actively Scheduled Jobs: there are limits on the maximum number of concurrent processors used by each user
Lonestar Queue Definitions • Additional Queue Limits • In the normal and high queues, only a maximum of 512 processes can be used at one time. Jobs requiring more processors are deferred for possible scheduling until running jobs complete. For example, a single user can have the following job combinations eligible for scheduling: • Run 2 jobs requiring 256 procs • Run 4 jobs requiring 128 procs each • Run 8 jobs requiring 64 procs each • Run 16 jobs requiring 32 procs each • A maximum of 25 queued jobs per user is allowed at one time
LSF Fairshare • A global fairshare mechanism is implemented on Lonestar to provide fair access to its substantial compute resources • Fairshare computes a dynamic priority for each user and uses this priority in making scheduling decisions • Dynamic priority is based on the following criteria • Number of shares assigned • Resources used by jobs belonging to the user: • Number of job slots reserved • Run time of running jobs • Cumulative actual CPU time (not normalized), adjusted so that recently used CPU time is weighted more heavily than CPU time used in the distant past
Priority LSF Fairshare • bhpart: Command to see current fairshare priority. For example: lslogin1--> bhpart -r HOST_PARTITION_NAME: GlobalPartition HOSTS: all SHARE_INFO_FOR: GlobalPartition/ USER/GROUP SHARES PRIORITY STARTED RESERVED CPU_TIME RUN_TIME avijit 1 0.333 0 0 0.0 0 chona 1 0.333 0 0 0.0 0 ewalker 1 0.333 0 0 0.0 0 minyard 1 0.333 0 0 0.0 0 phaa406 1 0.333 0 0 0.0 0 bbarth 1 0.333 0 0 0.0 0 milfeld 1 0.333 0 0 2.9 0 karl 1 0.077 0 0 51203.4 0 vmcalo 1 0.000 320 0 2816754.8 7194752
Commonly Used LSF Commands Note: most of these commands support a “-l” argument for long listings. For example: bhist –l <jobID> will give a detailed history of a specific job. Consult the man pages for each of these commands for more information.
LSF Batch System • LSF Defined Environment Variables:
LSF Batch System • Comparison of LSF, PBS and Loadleveler commands that provide similar functionality
Batch System Concerns • Submission (need to know) • Required Resources • Run-time Environment • Directory of Submission • Directory of Execution • Files for stdout/stderr Return • Email Notification • Job Monitoring • Job Deletion • Queued Jobs • Running Jobs
Job name Stdout Output file name (%J = jobID) Stderr Output file name Submission queue Your Project Name Max Run Time (15 minutes) Echo pertinent environment info LSF: Basic MPI Job Script Total number of processes #!/bin/csh #BSUB -n 32 #BSUB -J hello #BSUB -o %J.out #BSUB -e %J.err #BSUB -q normal #BSUB -P A-ccsc #BSUB -W 0:15 echo "Master Host = "`hostname` echo "LSF_SUBMIT_DIR: $LS_SUBCWD" echo "PWD_DIR: "`pwd` ibrun ./hello Execution command Parallel application manager and mpirun wrapper script executable
Job name Stdout Output file name (%J = jobID) Stderr Output file name Submission queue Your Project Name Max Run Time (15 minutes) LSF: Extended MPI Job Script Total number of processes #!/bin/csh #BSUB -n 32 #BSUB -J hello #BSUB -o %J.out #BSUB -e %J.err #BSUB -q normal #BSUB -P A-ccsc #BSUB -W 0:15 #BSUB -w ‘ended(1123)' #BSUB -u email@example.com #BSUB -B #BSUB -N echo "Master Host = "`hostname` echo "LSF_SUBMIT_DIR: $LS_SUBCWD" ibrun ./hello Dependency on Job <1123> Email address Email when job begins execution Email job report informationupon completion
LSF: Job Script Submission • When submitting jobs to LSF using a job script, a redirection is required for bsub to read the commands. Consider the following script:lslogin1> cat job.script#!/bin/csh#BSUB -n 32#BSUB -J hello#BSUB -o %J.out#BSUB -e %J.err#BSUB -q normal#BSUB -W 0:15echo "Master Host = "`hostname`echo "LSF_SUBMIT_DIR: $LS_SUBCWD“echo "PWD_DIR: "`pwd`ibrun ./hello • To submit the job:lslogin1% bsub < job Re-direction is required!
LSF: Interactive Execution • Several ways to run interactively • Submit entire command to bsub directly:> bsub –q development -I -n 2 -W 0:15 ibrun ./helloYour job is being routed to the development queueJob <11822> is submitted to queue <development>.<<Waiting for dispatch ...>><<Starting on compute-1-0>> Hello, world! --> Process # 0 of 2 is alive. ->compute-1-0 --> Process # 1 of 2 is alive. ->compute-1-0 • Submit using normal job script and include additional -I directive:> bsub -I < job.script
Batch Script Suggestions • Echo issuing commands • (“set -x” and “set echo” for ksh and csh). • Avoid absolute pathnames • Use relative path names or environment variables ($HOME, $WORK) • Abort job when a critical command fails. • Print environment • Include the "env" command if your batch job doesn't execute the same as in an interactive execution. • Use “./” prefix for executing commands in the current directory • The dot means to look for commands in the present working directory. Not all systems include "." in your $PATH variable. (usage: ./a.out). • Track your CPU time
LSF Job Monitoring (showq utility) lslogin1% showq ACTIVE JOBS-------------------- JOBID JOBNAME USERNAME STATE PROC REMAINING STARTTIME 11318 1024_90_96x6 vmcalo Running 64 18:09:19 Fri Jan 9 10:43:53 11352 naf phaa406 Running 16 17:51:15 Fri Jan 9 10:25:49 11357 24N phaa406 Running 16 18:19:12 Fri Jan 9 10:53:46 23 Active jobs 504 of 556 Processors Active (90.65%) IDLE JOBS---------------------- JOBID JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 11169 poroe8 xgai Idle 128 10:00:00 Thu Jan 8 10:17:06 11645 meshconv019 bbarth Idle 16 24:00:00 Fri Jan 9 16:24:183 Idle jobs BLOCKED JOBS------------------- JOBID JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 11319 1024_90_96x6 vmcalo Deferred 64 24:00:00 Thu Jan 8 18:09:11 11320 1024_90_96x6 vmcalo Deferred 64 24:00:00 Thu Jan 8 18:09:11 17 Blocked jobs Total Jobs: 43 Active Jobs: 23 Idle Jobs: 3 Blocked Jobs: 17
LSF Job Monitoring (bjobs command) lslogin1% bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 11635 bbarth RUN normal lonestar 2*compute-8 *shconv009 Jan 9 16:24 2*compute-9-22 2*compute-3-25 2*compute-8-30 2*compute-1-27 2*compute-4-2 2*compute-3-9 2*compute-6-13 11640 bbarth RUN normal lonestar 2*compute-3 *shconv014 Jan 9 16:24 2*compute-6-2 2*compute-6-5 2*compute-3-12 2*compute-4-27 2*compute-7-28 2*compute-3-5 2*compute-7-5 11657 bbarth PEND normal lonestar *shconv028 Jan 9 16:38 11658 bbarth PEND normal lonestar *shconv029 Jan 9 16:38 11662 bbarth PEND normal lonestar *shconv033 Jan 9 16:38 11663 bbarth PEND normal lonestar *shconv034 Jan 9 16:38 11667 bbarth PEND normal lonestar *shconv038 Jan 9 16:38 11668 bbarth PEND normal lonestar *shconv039 Jan 9 16:38 Note: Use “bjobs -u all” to see jobs from all users.
LSF Job Monitoring (lsuser utility) lslogin1$ lsuser -u vap JOBID QUEUE USER NAME PROCS SUBMITTED 547741 normal vap vap_hd_sh_p96 14 Tue Jun 7 10:37:01 2005 HOST R15s R1m R15m PAGES MEM SWAP TEMP compute-11-11 2.0 2.0 1.4 4.9P/s 1840M 2038M 24320M compute-8-3 2.0 2.0 2.0 1.9P/s 1839M 2041M 23712M compute-7-23 2.0 2.0 1.9 2.3P/s 1838M 2038M 24752M compute-3-19 2.0 2.0 2.0 2.6P/s 1847M 2041M 23216M compute-14-19 2.0 2.0 2.0 2.1P/s 1851M 2040M 24752M compute-3-21 2.0 2.0 1.7 2.0P/s 1845M 2038M 24432M compute-13-11 2.0 2.0 1.5 1.8P/s 1841M 2040M 24752M
LSF Job Manipulation/Monitoring • To kill a running or queued job (takes ~30 seconds to complete):bkill <jobID> bkill -r <jobID> (Use when bkill alone won’t delete the job) • To suspend a queued job:bstop <jobId> • To resume a suspended job:bresume <jobID> • To see more information on why a job is pending:bjobs –p <jobID> • To see a historical summary of a job:bhist <jobID>lslogin1> bhist 11821Summary of time in seconds spent in various states:JOBID USER JOB_NAME PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL11821 karl hello 131 0 127 0 0 0 258