an overview of the portable batch system
Download
Skip this Video
Download Presentation
An Overview of the Portable Batch System

Loading in 2 Seconds...

play fullscreen
1 / 58

An Overview of the Portable Batch System - PowerPoint PPT Presentation


  • 81 Views
  • Uploaded on

An Overview of the Portable Batch System. Gabriel Mateescu National Research Council Canada I M S B [email protected] www.sao.nrc.ca/~gabriel/presentations/sgi_pbs. Outline. PBS highlights PBS components Resources managed by PBS Choosing a PBS scheduler

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' An Overview of the Portable Batch System ' - yanni


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
an overview of the portable batch system

An Overview of the Portable Batch System

Gabriel Mateescu

National Research Council Canada

I M S B

[email protected]

www.sao.nrc.ca/~gabriel/presentations/sgi_pbs

outline
Outline
  • PBS highlights
  • PBS components
  • Resources managed by PBS
  • Choosing a PBS scheduler
  • Installation and configuration of PBS
  • PBS scripts and commands
  • Adding preemptive job scheduling to PBS
pbs highlights
PBS Highlights
  • Developed by Veridian / MRJ
  • Robust, portable, effective, extensible batch job queuing and resource management system
  • Supports different schedulers
  • Supports heterogeneous clusters
  • Open PBS - open source version
  • PBS Pro - commercial version
recent versions of pbs
Recent Versions of PBS
  • PBS 2.2, November 1999:
    • both the FIFO and SGI scheduler have bugs in enforcing resource limits
    • poor support for stopping & resuming jobs
  • OpenPBS 2.3, September 2000
    • better FIFO scheduler: resource limits enforced, backfilling added
  • PBS Pro 5.0, September 2000
    • claims support for job stopping/resuming, better scheduling, IRIX cpusets
resources managed by pbs
Resources managed by PBS
  • PBS manages jobs, CPUs, memory, hosts and queues
  • PBS accepts batch jobs, enqueues them, runs the jobs, and delivers output back to the submitter
  • Resources - describe attributes of jobs, queues, and hosts
  • Scheduler - chooses the jobs that fit within queue and cluster resources
main components of pbs
Main Components of PBS
  • Three daemons:
    • pbs_server server,
    • pbs_sched scheduler,
    • pbs_mom job executor & resource monitor
  • The server accepts commands and communicates with the daemons
    • qsub - submit a job
    • qstat - view queue and job status
    • qalter - change job’s attributes
    • qdel - delete a job
batch queuing
Batch Queuing

Job exclusive scheduling

Queue A

Queue B

SGI Origin System

Node (CPUs + memory)

resource examples
Resource Examples
  • ncpus number of CPUs per job
  • mem resident memory per job
  • pmem per-process memory
  • vmem virtual memory per job
  • cput CPU time per job
  • walltime real time per job
  • file file size per job
resource limits
Resource limits
  • resources_max - per job limit for a resource; determines whether a job fits in a queue
  • resources_default - default amount of a resource assigned to a job
  • resources_available - advice to the scheduler on how much of a resource can be used by all running jobs
choosing a scheduler 1
Choosing a Scheduler (1)
  • FIFO scheduler:
    • First-fit placement: enqueues a job in the first queue where it may fit even if it does not currently fit there and there is another queue where it will fit
    • Supports per job and (in version 2.3) per queue resource limits: ncpus, mem
    • Supports per server limits on the number of CPUs, and memory, (based on server attribute resources_available)
choosing a scheduler 2
Choosing a Scheduler (2)
  • Algorithms in FIFO scheduler
    • FIFO - sort jobs by job queuing time running the earliest job first
    • Backfill - relax FIFO rule for parallel jobs, as long as out-of-order jobs do not delay jobs submitted before by the FIFO order
    • Fair share: sort & schedule jobs based on past usage of the machine by the job owners
    • Round-robin - pick a job from each queue
    • By key - sort jobs by a set of keys: shortest_job_first, smallest_memory_first
choosing a scheduler 3
Choosing a Scheduler (3)
  • FIFO scheduler supports round robin load balancing as of version 2.3
  • FIFO scheduler
    • allows decoupling the job requirements on the number of CPUs from that on the amount memory
    • simple first-fit placement can lead to the need that the user specifies an execution queue for the jobs, when the job could fit in more than one queue
choosing a scheduler 4
Choosing a Scheduler (4)
  • SGI scheduler
    • supports FIFO, fair share, backfilling, and attempts to avoid job starvation
    • supports both per job limits and per queue limits on number of CPUs, memory
    • per server limit is the number of node cards
    • makes a best effort in choosing a queue where to run a job. A job not having enough resources to run is kept in the submit queue
    • ties the number of cpus allocated to the memory allocated per job
resource allocation
Resource allocation
  • SGI scheduler allocates nodes -

node = [ PE_PER_NODE cpus, MB_PER_NODE Mbyte ]

  • Number of nodes N for a job is such that

[ ncpus, mem] <= [ N*PE_PER_NODE, N* MB_PER_NODE ]

where ncpus and mem are the job’s memory and cpu job limitsspecified, e.g., with#PBS -l mem

  • Job attributes Resource_List.{ncpus, mem} set to

Resource_List.ncpus = N * PE_PER_NODE

Resource_List.mem = N * MB_PER_NODE

queue and server limits
Queue and Server Limits
  • FIFO scheduler:
    • per job limits (ncpus, mem) are defined by resources_max queue attributes
    • as of version 2.3, resources_max also defines per queue limits
    • per server resource limits enforced with resources_available attributes
queue and server limits1
Queue and Server Limits
  • SGI scheduler:
    • per job limits (ncpus, mem) are defined by resources_max queue attributes
    • resources_max also defines per queue limits
    • per server limit is given by the number of Origin node cards. Unlike the FIFO scheduler, resource_available limits are not enforced
job enqueing 1
Job enqueing (1)
  • The scheduler places each job in some queue
  • This involves several tests for resources
  • Which queue a job is enqueued into depends on
    • what limits are tested
    • first-fit versus best fit placement
  • A job can fit in a queue if the resources requested by the job do not exceed the maximum value of the resources defined for the queue. For example, for the resource ncpus

Resource_List.ncpus <= resources_max.ncpus

job enqueing 2
Job enqueing (2)
  • A job fits in a queue if the amount of resources assigned to the queue plus the requested resources do not exceed the maximum number of resources for the queue. For example, for ncpus

resources_assigned.ncpus + Resource_List.ncpus <= resources_max.ncpus

  • A job fits in the system if the sum of all assigned resources does not exceed the available resources. For example, for the ncpus resource,

Σ resources_assigned.ncpus + Resource_List.ncpus <= resources_available.ncpus

first fit versus best fit
First fit versus best fit
  • The FIFO scheduler finds the first queue where a can fit and dispatches the job to that queue
    • if the jobs does not actually fit it will wait for the requested resources in the execution queue
  • The SGI scheduler keeps the job in the submit queue until it finds an execution queue where the job fits then dispatches the job to that queue
  • If queues are defined to have monotonically increasing resource limits (e.g., CPU time) , then first fit is not a penalty.
  • However, if a job can fit in several queues, then SGI scheduler will find a better schedule
limits on the number of running jobs
Limits on the number of running jobs
  • Per queue and per server limits on the number of running jobs:
    • max_running
    • max_user_run, max_group_run max number of running jobs per user or group
  • Unlike the FIFO scheduler, the SGI scheduler enforces these limits only on a per queue basis
    • It enforces MAX_JOBS from the scheduler config file - substitute for max_running
sgi origin install 1
SGI Origin Install (1)
  • Source files under OpenPBS_v2_3/src
  • Consider the SGI scheduler
  • Make sure the machine dependent values defines in scheduler.cc/samples/sgi_origin/toolkit.hmatch the actual machine hardware

#define MB_PER_NODE ((size_t) 512*1024*1024)

#define PE_PER_NODE 2

  • May set PE_PER_NODE =1 to allocate half-nodes if MB_PER_NODE is set accordingly
sgi origin install 2
SGI Origin Install (2)
  • Bug fixes in scheduler.cc/samples/sgi_origin/pack_queues.c
  • Operator precedence bug (line 198):

for ( qptr = qlist; qptr != NULL; qptr = qptr->next) {

if (( ( qptr->queue->flags & QFLAGS_FULL ) == 0) {

// bad operator precedence bypasses this function

if ( !schd_evaluate_system(...) ) {

// DONT_START_JOB (0) so don’t change allfull

continue;

}

// ...

}

sgi origin install 3
SGI Origin Install (3)
  • Fix of a logical bug in pack_queues.c: if a system limit is exceeded should not try to schedule the job

for ( qptr = qlist; qptr != NULL; qptr = qptr->next) {

if (( ( qptr->queue->flags & QFLAGS_FULL ) == 0) {

if ( !schd_evaluate_system(...) ) {

// DONT_START_JOB (0) so don’t change allfull

continue;

}

// ...

}

for (qptr=(allfull)?NULL:qlist; qptr !=NULL; qptr=qptr->next) {

// if allfull set, do not attempt to schedule

}

sgi origin install 4
SGI Origin Install (4)
  • Fix of a logical bug in user_limits.c, function user_running()
  • This function counts number of running jobs so must test for equality between job status and ‘R’

user_running ( ...)

{

for ( job= queue->jobs; job != NULL; job = job->next) {

if ( (job_state == ‘R’) && (!strcmp(job->owner,user) ) )

jobs_running++;

// …

}

sgi origin install 5
SGI Origin Install (5)
  • The limit npcus is not enforced in the function mom_over_limit(), located in the file mom_mach.c under the directory src/resmom/irix6array

#define SGI_ZOMBIE_WRONG 1

int mom_over_limit( ... ) {

// ...

#if !defined(SGI_ZOMBIE_WRONG)

return (TRUE);

#endif

// ...

}

sgi origin install 41
SGI Origin Install (4)

Script to run the configure command

___________________________________________________

#!/bin/csh -f

set PBS_HOME=/usr/local/pbs

set PBS_SERVER_HOME=/usr/spool/pbs

# Select SGI or FIFO scheduler

set SCHED="--set-sched-code=sgi_origin --enable-nodemask

#set SCHED="--set-sched-code=fifo --enable-nodemask”

$HOME/PBS/OpenPBS_v2_3/configure \

--prefix=$PBS_HOME \

--set-server-home=$PBS_SERVER_HOME \

--set-cc=cc --set-cflags="-Dsgi -D_SGI_SOURCE -64 -g" \

--set-sched=cc $SCHED --enable-array --enable-debug

sgi origin install 51
SGI Origin Install (5)

___________________________________________________

# cd /usr/local/pbs

# makePBS

# make

# make install

# cd /usr/spool/pbs

the script from the previous slide

sched_priv

config

decay_usage

configuring for sgi scheduler
Configuring for SGI scheduler
  • Queue types
    • one submit queue
    • one or several execution queues
  • Per server limit on the number of running job
  • Load Control
  • Fair share scheduling
    • Past usage of the machine used in ranking the jobs
    • Decayed past usage per user is kept in sched_priv/decay_usage
  • Scheduler restart action
  • PBS manager tool: qmgr
queue definition
Queue definition
  • File sched_priv/config

SUBMIT_QUEUE submit

BATCH_QUEUES hpc,back

MAX_JOBS 256

ENFORCE_PRIME_TIME False

ENFORCE_DEDICATED_TIME False

SORT_BY_PAST_USAGE True

DECAY_FACTOR 0.75

SCHED_ACCT_DIR /usr/spool/pbs/server_priv/accounting

SCHED_RESTART_ACTION RESUBMIT

load control
Load Control
  • Load control for SGI scheduler

sched_priv/config

TARGET_LOAD_PCT 90%

TARGET_LOAD_VARIANCE -15%,+10%

  • Load Control for FIFO scheduler

mom_priv/config

$max_load 2.0

$ideal_load 1.0

pbs for sgi scheduler
PBS for SGI scheduler
  • Qmgr tool

s server [email protected]

create queue submit

s q submit queue_type = Execution

s q submit resources_max.ncpus = 4

s q submit resources_max.ncpus = 1gb

s q submit resources_default.mem = 256mb

s q submit resources_default.ncpus = 1

s q submit resources_default.nice = 15

s q submit enabled = True

s q submit started = True

pbs for sgi scheduler1
PBS for SGI scheduler

create queue hpc

s q hpc queue_type = Execution

s q hpc resources_max.ncpus = 2

s q hpc resources_max.ncpus = 512mb

s q hpc resources_default.mem = 256mb

s q hpc resources_default.ncpus = 1

s q hpc acl_groups = marley

s q hpc acl_group_enable = True

s q hpc enabled = True

s q hpc started = True

pbs for sgi scheduler2
PBS for SGI scheduler
  • Server attributes

set server default_queue = submit

s server acl_hosts = *.bar.com

s server acl_host_enable = True

s server scheduling = True

s server query_other_jobs = True

pbs for fifo scheduler
PBS for FIFO scheduler
  • File sched_config instead of config and queues are not defined there
  • Submit queue is Route queue

s q submit queue_type = Route

s q submit route_destinations = hpc

s q submit route_destinations += back

  • Server attributes

s server resources_available.mem = 1gb

s server resources_available.ncpus = 4

pbs job scripts
PBS Job Scripts
  • Job scripts contain PBS directives and shell commands

#PBS -l ncpus=2

#PBS -l walltime=12:20:00

#PBS -m ae

#PBS -c c=30

cd ${PB_O_WORKDIR}

mpirun -np 2 foo.x

basic pbs commands
Basic PBS commands
  • Jobs are submitted with qsub

% qsub [-q hpc] foo.pbs

13.node0.bar.com

  • Job status is queried with qstat [-f|-a] to get job owner, name, queue, status, session ID, # CPUs, walltime

% qstat -a 13

  • Alter job attributes

% qalter -l walltime 20:00:00 13

job submission and tracking
Job Submission and Tracking
  • Find jobs in status R (running) or submitted by user bob

% qselect -s R

% qselect -u bob

  • Query queue status to find if the queue is enabled/started, and the number of jobs in the queue

qstat [-f | -a ] -Q

  • Delete a job: qdel 13
job environment and i o
Job Environment and I/O
  • The job’s current directory is the submitter’s $HOME, which is also the default location for the files created by the job. Changed with cd in the script
  • The standard out and err of the job are spooled to JobName.{o|e}JobID in the submitter’s current directory. Override this with

#PBS -o | -e pathname

slide39
Tips
  • Trace the history of a job

% tracejob - give a time-stamped sequence of events affecting a job

  • Cron jobs for cleaning up daemon work files under mom_logs, sched_logs, server_logs
  • #crontab -e

9 2 * * 0 find /usr/spool/pbs/mom_logs -type f -mtype +7 -exec rm {} \;

9 2 * * 0 find /usr/spool/pbs/sched_logs -type f -mtype +7 -exec rm {} \;

9 2 * * 0 find /usr/spool/pbs/server_logs -type f -mtype +7 -exec rm {} \;

sample pbs front end
Sample PBS Front-End

node1

node0

Execution server

Submission server

pbs_server, pbs_sched, pbs_mom

qsub, qdel, ...

pbs for clusters
PBS for clusters
  • File staging - copy files (other than stdout/stderr) from a submission-only host to the server

#PBS -W stagein=/tmp/[email protected]:/home/bar/job1

#PBS -W stageout=/tmp/bar/job1/*@n1:/home/bar/job1

PBS uses the directory /tmp/bar/job1 as a scratch directory

  • File staging may precede job starting - helps in hiding latencies
setting up a pbs cluster
Setting up a PBS Cluster
  • Assume n1 runs the pbs_mom daemon
  • $PBS_SERVER_HOME/server_priv/nodes

n0 np=2 gaussian

n1 np=2 irix

  • n0:$PBS_SERVER_HOME/mom_priv/config

$clienthost n1

$ideal_load 1.5

$max_load 2.0

  • n1:$PBS_SERVER_HOME/mom_priv/config

$ideal_load 1.5

$max_load 2.0

setting up a pbs cluster1
Setting up a PBS Cluster
  • Qmgr tool

s server [email protected]

create queue hpc

s q hpc queue_type = Execution

s q hpc Priority = 100

s q hpc resources_max.ncpus = 2

s q hpc resources_max.nodect = 1

s q hpc acl_groups = marley

s q hpc acl_group_enable = True

setting up a pbs cluster2
Setting up a PBS Cluster
  • Server attributes

set server default_node = n0

set server default_queue = hpc

s server acl_hosts = *.bar.com

s server acl_host_enable = True

s s resources_default.nodect = 1

s s resources_default.nodes = 1

s s resources_default.neednodes = 1

set server max_user_run = 2

pbs features
PBS features
  • The job submitter can request a number of nodes with some properties
  • For example
    • request a node with the property gaussian:

#PBS -l nodes=gaussian

    • request two nodes with the property irix

#PBS -l nodes=2:irix

pbs security features
PBS Security Features
  • All files used by PBS are owned by root and can be written only by root
  • Configuration files: sched_priv/config, mom_priv/config are readable only by root
  • $PBS_HOME/pbs_environment defines $PATH; it is writable only by root
  • pbs_mom daemon accepts connections from a privileged port on localhost or from a host listed in mom_priv/config
  • The server accepts commands from selected hosts and users
why preemptive scheduling
Why preemptive scheduling?
  • Resource reservation (CPU, memory) is needed to achieve high job throughput
  • Static resource reservation may lead to low machine utilization, high job waiting times, and hence slow job turn-around
  • An approach is needed to achieve both high job throughput and rapid job turn-around
static reservation pitfall 1
Static Reservation Pitfall (1)

Parallel Computer or Cluster

Physics Group

Biotech Group

Partition boundary

Node (CPU + memory)

Job Requests

static reservation pitfall 2
Static Reservation Pitfall (2)
  • Physics Group’s Job 1 is assigned 3 nodes and dispatched
  • Biotech Group’s Job 2 is also dispatched, while Job 3 cannot execute before Job 2 finishes: there is only 1 node available for the group
  • However, there are enough resources for Job 3
proposed approach 1
Proposed Approach (1)
  • Leverage the features of the Portable Batch System (PBS)
  • Extend PBS with preemptive job scheduling
  • All queues but one have reserved resources (CPUs, memory) and hold jobs that cannot be preempted. These are the dedicated queues
  • Define a queue for jobs that may be preempted: the background queue
proposed approach 2
Proposed Approach (2)
  • Each user belongs to a group and each group is authorized to submit jobs to some dedicated queues as well as to the background queue
  • The sum of the resources defined for the dedicated queues does not exceed the machine resources
  • The resources assigned to jobs in a dedicated queue do not exceed the queue resource limits
proposed approach 3
Proposed Approach (3)
  • Jobs fitting in a dedicated queue are dispatched, observing job owner’s access rights
  • Jobs not fitting in a dedicated queue are dispatched to the background queue, if there are enough available resources in the system
  • Jobs in the background queue borrow resources from the dedicated queues
proposed approach 4
Proposed Approach (4)
  • If a job entering the system would fit in a dedicated queue provided resources lent to the background queue are reclaimed, job preemption is triggered
  • Jobs from the background queue will be held to release the resources needed by a dedicated queue
  • Held jobs are re-queued and will be dispatched along with the other pending jobs
example 1
Example (1)

Two queues, each with 4 CPU capacity

Job Queue #CPU Submit CPU

time time

_________________________________

1 Physics 1 0 4 h

2 Biotech 2 0 4 h

3 Physics 4 0 3 h

4 Biotech 2 2 h 1 h

5 Physics 2 2 h 1 h

example 2
Example (2)

Turn-around times

with without

Job 1 4 h 4 h

Job2 4 h 4 h

Job 3 4 h 7 h

Job 4 3 h 3 h

Job 5 3 h 3 h

75 % reduction for job 3

key points
Key Points
  • Provide guaranteed resources per user group and per job
  • Allow resources not used by the dedicated queues to be borrowed by the background queue
  • Provide a mechanism for reclaiming resources lent to the background queue
  • Achieve low job waiting time and high job throughput
slide57

Benefits of the Approach

  • Reduce job waiting time by harnessing resources not used by the dedicated queues
  • Reduce job wall-time by reserving resources for all the jobs
  • Pending jobs fitting in dedicated queues can reclaim resources from jobs that borrowed those resources and run in the background queue
for more information
For more information
  • Veridian web site:

www.openpbs.org

www.pbspro.com

  • NRC - IMSB documentation and links

www.sao.nrc.ca/~gabriel/pbs/pbs_user.html

ad