1 / 58

An Overview of the Portable Batch System

An Overview of the Portable Batch System. Gabriel Mateescu National Research Council Canada I M S B gabriel.mateescu@nrc.ca www.sao.nrc.ca/~gabriel/presentations/sgi_pbs. Outline. PBS highlights PBS components Resources managed by PBS Choosing a PBS scheduler

yanni
Download Presentation

An Overview of the Portable Batch System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Overview of the Portable Batch System Gabriel Mateescu National Research Council Canada I M S B gabriel.mateescu@nrc.ca www.sao.nrc.ca/~gabriel/presentations/sgi_pbs

  2. Outline • PBS highlights • PBS components • Resources managed by PBS • Choosing a PBS scheduler • Installation and configuration of PBS • PBS scripts and commands • Adding preemptive job scheduling to PBS

  3. PBS Highlights • Developed by Veridian / MRJ • Robust, portable, effective, extensible batch job queuing and resource management system • Supports different schedulers • Supports heterogeneous clusters • Open PBS - open source version • PBS Pro - commercial version

  4. Recent Versions of PBS • PBS 2.2, November 1999: • both the FIFO and SGI scheduler have bugs in enforcing resource limits • poor support for stopping & resuming jobs • OpenPBS 2.3, September 2000 • better FIFO scheduler: resource limits enforced, backfilling added • PBS Pro 5.0, September 2000 • claims support for job stopping/resuming, better scheduling, IRIX cpusets

  5. Resources managed by PBS • PBS manages jobs, CPUs, memory, hosts and queues • PBS accepts batch jobs, enqueues them, runs the jobs, and delivers output back to the submitter • Resources - describe attributes of jobs, queues, and hosts • Scheduler - chooses the jobs that fit within queue and cluster resources

  6. Main Components of PBS • Three daemons: • pbs_server server, • pbs_sched scheduler, • pbs_mom job executor & resource monitor • The server accepts commands and communicates with the daemons • qsub - submit a job • qstat - view queue and job status • qalter - change job’s attributes • qdel - delete a job

  7. Batch Queuing Job exclusive scheduling Queue A Queue B SGI Origin System Node (CPUs + memory)

  8. Resource Examples • ncpus number of CPUs per job • mem resident memory per job • pmem per-process memory • vmem virtual memory per job • cput CPU time per job • walltime real time per job • file file size per job

  9. Resource limits • resources_max - per job limit for a resource; determines whether a job fits in a queue • resources_default - default amount of a resource assigned to a job • resources_available - advice to the scheduler on how much of a resource can be used by all running jobs

  10. Choosing a Scheduler (1) • FIFO scheduler: • First-fit placement: enqueues a job in the first queue where it may fit even if it does not currently fit there and there is another queue where it will fit • Supports per job and (in version 2.3) per queue resource limits: ncpus, mem • Supports per server limits on the number of CPUs, and memory, (based on server attribute resources_available)

  11. Choosing a Scheduler (2) • Algorithms in FIFO scheduler • FIFO - sort jobs by job queuing time running the earliest job first • Backfill - relax FIFO rule for parallel jobs, as long as out-of-order jobs do not delay jobs submitted before by the FIFO order • Fair share: sort & schedule jobs based on past usage of the machine by the job owners • Round-robin - pick a job from each queue • By key - sort jobs by a set of keys: shortest_job_first, smallest_memory_first

  12. Choosing a Scheduler (3) • FIFO scheduler supports round robin load balancing as of version 2.3 • FIFO scheduler • allows decoupling the job requirements on the number of CPUs from that on the amount memory • simple first-fit placement can lead to the need that the user specifies an execution queue for the jobs, when the job could fit in more than one queue

  13. Choosing a Scheduler (4) • SGI scheduler • supports FIFO, fair share, backfilling, and attempts to avoid job starvation • supports both per job limits and per queue limits on number of CPUs, memory • per server limit is the number of node cards • makes a best effort in choosing a queue where to run a job. A job not having enough resources to run is kept in the submit queue • ties the number of cpus allocated to the memory allocated per job

  14. Resource allocation • SGI scheduler allocates nodes - node = [ PE_PER_NODE cpus, MB_PER_NODE Mbyte ] • Number of nodes N for a job is such that [ ncpus, mem] <= [ N*PE_PER_NODE, N* MB_PER_NODE ] where ncpus and mem are the job’s memory and cpu job limitsspecified, e.g., with#PBS -l mem • Job attributes Resource_List.{ncpus, mem} set to Resource_List.ncpus = N * PE_PER_NODE Resource_List.mem = N * MB_PER_NODE

  15. Queue and Server Limits • FIFO scheduler: • per job limits (ncpus, mem) are defined by resources_max queue attributes • as of version 2.3, resources_max also defines per queue limits • per server resource limits enforced with resources_available attributes

  16. Queue and Server Limits • SGI scheduler: • per job limits (ncpus, mem) are defined by resources_max queue attributes • resources_max also defines per queue limits • per server limit is given by the number of Origin node cards. Unlike the FIFO scheduler, resource_available limits are not enforced

  17. Job enqueing (1) • The scheduler places each job in some queue • This involves several tests for resources • Which queue a job is enqueued into depends on • what limits are tested • first-fit versus best fit placement • A job can fit in a queue if the resources requested by the job do not exceed the maximum value of the resources defined for the queue. For example, for the resource ncpus Resource_List.ncpus <= resources_max.ncpus

  18. Job enqueing (2) • A job fits in a queue if the amount of resources assigned to the queue plus the requested resources do not exceed the maximum number of resources for the queue. For example, for ncpus resources_assigned.ncpus + Resource_List.ncpus <= resources_max.ncpus • A job fits in the system if the sum of all assigned resources does not exceed the available resources. For example, for the ncpus resource, Σ resources_assigned.ncpus + Resource_List.ncpus <= resources_available.ncpus

  19. First fit versus best fit • The FIFO scheduler finds the first queue where a can fit and dispatches the job to that queue • if the jobs does not actually fit it will wait for the requested resources in the execution queue • The SGI scheduler keeps the job in the submit queue until it finds an execution queue where the job fits then dispatches the job to that queue • If queues are defined to have monotonically increasing resource limits (e.g., CPU time) , then first fit is not a penalty. • However, if a job can fit in several queues, then SGI scheduler will find a better schedule

  20. Limits on the number of running jobs • Per queue and per server limits on the number of running jobs: • max_running • max_user_run, max_group_run max number of running jobs per user or group • Unlike the FIFO scheduler, the SGI scheduler enforces these limits only on a per queue basis • It enforces MAX_JOBS from the scheduler config file - substitute for max_running

  21. SGI Origin Install (1) • Source files under OpenPBS_v2_3/src • Consider the SGI scheduler • Make sure the machine dependent values defines in scheduler.cc/samples/sgi_origin/toolkit.hmatch the actual machine hardware #define MB_PER_NODE ((size_t) 512*1024*1024) #define PE_PER_NODE 2 • May set PE_PER_NODE =1 to allocate half-nodes if MB_PER_NODE is set accordingly

  22. SGI Origin Install (2) • Bug fixes in scheduler.cc/samples/sgi_origin/pack_queues.c • Operator precedence bug (line 198): for ( qptr = qlist; qptr != NULL; qptr = qptr->next) { if (( ( qptr->queue->flags & QFLAGS_FULL ) == 0) { // bad operator precedence bypasses this function if ( !schd_evaluate_system(...) ) { // DONT_START_JOB (0) so don’t change allfull continue; } // ... }

  23. SGI Origin Install (3) • Fix of a logical bug in pack_queues.c: if a system limit is exceeded should not try to schedule the job for ( qptr = qlist; qptr != NULL; qptr = qptr->next) { if (( ( qptr->queue->flags & QFLAGS_FULL ) == 0) { if ( !schd_evaluate_system(...) ) { // DONT_START_JOB (0) so don’t change allfull continue; } // ... } for (qptr=(allfull)?NULL:qlist; qptr !=NULL; qptr=qptr->next) { // if allfull set, do not attempt to schedule }

  24. SGI Origin Install (4) • Fix of a logical bug in user_limits.c, function user_running() • This function counts number of running jobs so must test for equality between job status and ‘R’ user_running ( ...) { for ( job= queue->jobs; job != NULL; job = job->next) { if ( (job_state == ‘R’) && (!strcmp(job->owner,user) ) ) jobs_running++; // … }

  25. SGI Origin Install (5) • The limit npcus is not enforced in the function mom_over_limit(), located in the file mom_mach.c under the directory src/resmom/irix6array #define SGI_ZOMBIE_WRONG 1 int mom_over_limit( ... ) { // ... #if !defined(SGI_ZOMBIE_WRONG) return (TRUE); #endif // ... }

  26. SGI Origin Install (4) Script to run the configure command ___________________________________________________ #!/bin/csh -f set PBS_HOME=/usr/local/pbs set PBS_SERVER_HOME=/usr/spool/pbs # Select SGI or FIFO scheduler set SCHED="--set-sched-code=sgi_origin --enable-nodemask #set SCHED="--set-sched-code=fifo --enable-nodemask” $HOME/PBS/OpenPBS_v2_3/configure \ --prefix=$PBS_HOME \ --set-server-home=$PBS_SERVER_HOME \ --set-cc=cc --set-cflags="-Dsgi -D_SGI_SOURCE -64 -g" \ --set-sched=cc $SCHED --enable-array --enable-debug

  27. SGI Origin Install (5) ___________________________________________________ # cd /usr/local/pbs # makePBS # make # make install # cd /usr/spool/pbs the script from the previous slide sched_priv config decay_usage

  28. Configuring for SGI scheduler • Queue types • one submit queue • one or several execution queues • Per server limit on the number of running job • Load Control • Fair share scheduling • Past usage of the machine used in ranking the jobs • Decayed past usage per user is kept in sched_priv/decay_usage • Scheduler restart action • PBS manager tool: qmgr

  29. Queue definition • File sched_priv/config SUBMIT_QUEUE submit BATCH_QUEUES hpc,back MAX_JOBS 256 ENFORCE_PRIME_TIME False ENFORCE_DEDICATED_TIME False SORT_BY_PAST_USAGE True DECAY_FACTOR 0.75 SCHED_ACCT_DIR /usr/spool/pbs/server_priv/accounting SCHED_RESTART_ACTION RESUBMIT

  30. Load Control • Load control for SGI scheduler sched_priv/config TARGET_LOAD_PCT 90% TARGET_LOAD_VARIANCE -15%,+10% • Load Control for FIFO scheduler mom_priv/config $max_load 2.0 $ideal_load 1.0

  31. PBS for SGI scheduler • Qmgr tool s server managers=bob@n0.bar.com create queue submit s q submit queue_type = Execution s q submit resources_max.ncpus = 4 s q submit resources_max.ncpus = 1gb s q submit resources_default.mem = 256mb s q submit resources_default.ncpus = 1 s q submit resources_default.nice = 15 s q submit enabled = True s q submit started = True

  32. PBS for SGI scheduler create queue hpc s q hpc queue_type = Execution s q hpc resources_max.ncpus = 2 s q hpc resources_max.ncpus = 512mb s q hpc resources_default.mem = 256mb s q hpc resources_default.ncpus = 1 s q hpc acl_groups = marley s q hpc acl_group_enable = True s q hpc enabled = True s q hpc started = True

  33. PBS for SGI scheduler • Server attributes set server default_queue = submit s server acl_hosts = *.bar.com s server acl_host_enable = True s server scheduling = True s server query_other_jobs = True

  34. PBS for FIFO scheduler • File sched_config instead of config and queues are not defined there • Submit queue is Route queue s q submit queue_type = Route s q submit route_destinations = hpc s q submit route_destinations += back • Server attributes s server resources_available.mem = 1gb s server resources_available.ncpus = 4

  35. PBS Job Scripts • Job scripts contain PBS directives and shell commands #PBS -l ncpus=2 #PBS -l walltime=12:20:00 #PBS -m ae #PBS -c c=30 cd ${PB_O_WORKDIR} mpirun -np 2 foo.x

  36. Basic PBS commands • Jobs are submitted with qsub % qsub [-q hpc] foo.pbs 13.node0.bar.com • Job status is queried with qstat [-f|-a] to get job owner, name, queue, status, session ID, # CPUs, walltime % qstat -a 13 • Alter job attributes % qalter -l walltime 20:00:00 13

  37. Job Submission and Tracking • Find jobs in status R (running) or submitted by user bob % qselect -s R % qselect -u bob • Query queue status to find if the queue is enabled/started, and the number of jobs in the queue qstat [-f | -a ] -Q • Delete a job: qdel 13

  38. Job Environment and I/O • The job’s current directory is the submitter’s $HOME, which is also the default location for the files created by the job. Changed with cd in the script • The standard out and err of the job are spooled to JobName.{o|e}JobID in the submitter’s current directory. Override this with #PBS -o | -e pathname

  39. Tips • Trace the history of a job % tracejob - give a time-stamped sequence of events affecting a job • Cron jobs for cleaning up daemon work files under mom_logs, sched_logs, server_logs • #crontab -e 9 2 * * 0 find /usr/spool/pbs/mom_logs -type f -mtype +7 -exec rm {} \; 9 2 * * 0 find /usr/spool/pbs/sched_logs -type f -mtype +7 -exec rm {} \; 9 2 * * 0 find /usr/spool/pbs/server_logs -type f -mtype +7 -exec rm {} \;

  40. Sample PBS Front-End node1 node0 Execution server Submission server pbs_server, pbs_sched, pbs_mom qsub, qdel, ...

  41. PBS for clusters • File staging - copy files (other than stdout/stderr) from a submission-only host to the server #PBS -W stagein=/tmp/bar@n1:/home/bar/job1 #PBS -W stageout=/tmp/bar/job1/*@n1:/home/bar/job1 PBS uses the directory /tmp/bar/job1 as a scratch directory • File staging may precede job starting - helps in hiding latencies

  42. Setting up a PBS Cluster • Assume n1 runs the pbs_mom daemon • $PBS_SERVER_HOME/server_priv/nodes n0 np=2 gaussian n1 np=2 irix • n0:$PBS_SERVER_HOME/mom_priv/config $clienthost n1 $ideal_load 1.5 $max_load 2.0 • n1:$PBS_SERVER_HOME/mom_priv/config $ideal_load 1.5 $max_load 2.0

  43. Setting up a PBS Cluster • Qmgr tool s server managers=bob@n0.bar.com create queue hpc s q hpc queue_type = Execution s q hpc Priority = 100 s q hpc resources_max.ncpus = 2 s q hpc resources_max.nodect = 1 s q hpc acl_groups = marley s q hpc acl_group_enable = True

  44. Setting up a PBS Cluster • Server attributes set server default_node = n0 set server default_queue = hpc s server acl_hosts = *.bar.com s server acl_host_enable = True s s resources_default.nodect = 1 s s resources_default.nodes = 1 s s resources_default.neednodes = 1 set server max_user_run = 2

  45. PBS features • The job submitter can request a number of nodes with some properties • For example • request a node with the property gaussian: #PBS -l nodes=gaussian • request two nodes with the property irix #PBS -l nodes=2:irix

  46. PBS Security Features • All files used by PBS are owned by root and can be written only by root • Configuration files: sched_priv/config, mom_priv/config are readable only by root • $PBS_HOME/pbs_environment defines $PATH; it is writable only by root • pbs_mom daemon accepts connections from a privileged port on localhost or from a host listed in mom_priv/config • The server accepts commands from selected hosts and users

  47. Why preemptive scheduling? • Resource reservation (CPU, memory) is needed to achieve high job throughput • Static resource reservation may lead to low machine utilization, high job waiting times, and hence slow job turn-around • An approach is needed to achieve both high job throughput and rapid job turn-around

  48. Static Reservation Pitfall (1) Parallel Computer or Cluster Physics Group Biotech Group Partition boundary Node (CPU + memory) Job Requests

  49. Static Reservation Pitfall (2) • Physics Group’s Job 1 is assigned 3 nodes and dispatched • Biotech Group’s Job 2 is also dispatched, while Job 3 cannot execute before Job 2 finishes: there is only 1 node available for the group • However, there are enough resources for Job 3

  50. Proposed Approach (1) • Leverage the features of the Portable Batch System (PBS) • Extend PBS with preemptive job scheduling • All queues but one have reserved resources (CPUs, memory) and hold jobs that cannot be preempted. These are the dedicated queues • Define a queue for jobs that may be preempted: the background queue

More Related