an overview of the portable batch system n.
Download
Skip this Video
Download Presentation
An Overview of the Portable Batch System

Loading in 2 Seconds...

play fullscreen
1 / 58

An Overview of the Portable Batch System - PowerPoint PPT Presentation


  • 81 Views
  • Uploaded on

An Overview of the Portable Batch System. Gabriel Mateescu National Research Council Canada I M S B gabriel.mateescu@nrc.ca www.sao.nrc.ca/~gabriel/presentations/sgi_pbs. Outline. PBS highlights PBS components Resources managed by PBS Choosing a PBS scheduler

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'An Overview of the Portable Batch System' - yanni


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
an overview of the portable batch system

An Overview of the Portable Batch System

Gabriel Mateescu

National Research Council Canada

I M S B

gabriel.mateescu@nrc.ca

www.sao.nrc.ca/~gabriel/presentations/sgi_pbs

outline
Outline
  • PBS highlights
  • PBS components
  • Resources managed by PBS
  • Choosing a PBS scheduler
  • Installation and configuration of PBS
  • PBS scripts and commands
  • Adding preemptive job scheduling to PBS
pbs highlights
PBS Highlights
  • Developed by Veridian / MRJ
  • Robust, portable, effective, extensible batch job queuing and resource management system
  • Supports different schedulers
  • Supports heterogeneous clusters
  • Open PBS - open source version
  • PBS Pro - commercial version
recent versions of pbs
Recent Versions of PBS
  • PBS 2.2, November 1999:
    • both the FIFO and SGI scheduler have bugs in enforcing resource limits
    • poor support for stopping & resuming jobs
  • OpenPBS 2.3, September 2000
    • better FIFO scheduler: resource limits enforced, backfilling added
  • PBS Pro 5.0, September 2000
    • claims support for job stopping/resuming, better scheduling, IRIX cpusets
resources managed by pbs
Resources managed by PBS
  • PBS manages jobs, CPUs, memory, hosts and queues
  • PBS accepts batch jobs, enqueues them, runs the jobs, and delivers output back to the submitter
  • Resources - describe attributes of jobs, queues, and hosts
  • Scheduler - chooses the jobs that fit within queue and cluster resources
main components of pbs
Main Components of PBS
  • Three daemons:
    • pbs_server server,
    • pbs_sched scheduler,
    • pbs_mom job executor & resource monitor
  • The server accepts commands and communicates with the daemons
    • qsub - submit a job
    • qstat - view queue and job status
    • qalter - change job’s attributes
    • qdel - delete a job
batch queuing
Batch Queuing

Job exclusive scheduling

Queue A

Queue B

SGI Origin System

Node (CPUs + memory)

resource examples
Resource Examples
  • ncpus number of CPUs per job
  • mem resident memory per job
  • pmem per-process memory
  • vmem virtual memory per job
  • cput CPU time per job
  • walltime real time per job
  • file file size per job
resource limits
Resource limits
  • resources_max - per job limit for a resource; determines whether a job fits in a queue
  • resources_default - default amount of a resource assigned to a job
  • resources_available - advice to the scheduler on how much of a resource can be used by all running jobs
choosing a scheduler 1
Choosing a Scheduler (1)
  • FIFO scheduler:
    • First-fit placement: enqueues a job in the first queue where it may fit even if it does not currently fit there and there is another queue where it will fit
    • Supports per job and (in version 2.3) per queue resource limits: ncpus, mem
    • Supports per server limits on the number of CPUs, and memory, (based on server attribute resources_available)
choosing a scheduler 2
Choosing a Scheduler (2)
  • Algorithms in FIFO scheduler
    • FIFO - sort jobs by job queuing time running the earliest job first
    • Backfill - relax FIFO rule for parallel jobs, as long as out-of-order jobs do not delay jobs submitted before by the FIFO order
    • Fair share: sort & schedule jobs based on past usage of the machine by the job owners
    • Round-robin - pick a job from each queue
    • By key - sort jobs by a set of keys: shortest_job_first, smallest_memory_first
choosing a scheduler 3
Choosing a Scheduler (3)
  • FIFO scheduler supports round robin load balancing as of version 2.3
  • FIFO scheduler
    • allows decoupling the job requirements on the number of CPUs from that on the amount memory
    • simple first-fit placement can lead to the need that the user specifies an execution queue for the jobs, when the job could fit in more than one queue
choosing a scheduler 4
Choosing a Scheduler (4)
  • SGI scheduler
    • supports FIFO, fair share, backfilling, and attempts to avoid job starvation
    • supports both per job limits and per queue limits on number of CPUs, memory
    • per server limit is the number of node cards
    • makes a best effort in choosing a queue where to run a job. A job not having enough resources to run is kept in the submit queue
    • ties the number of cpus allocated to the memory allocated per job
resource allocation
Resource allocation
  • SGI scheduler allocates nodes -

node = [ PE_PER_NODE cpus, MB_PER_NODE Mbyte ]

  • Number of nodes N for a job is such that

[ ncpus, mem] <= [ N*PE_PER_NODE, N* MB_PER_NODE ]

where ncpus and mem are the job’s memory and cpu job limitsspecified, e.g., with#PBS -l mem

  • Job attributes Resource_List.{ncpus, mem} set to

Resource_List.ncpus = N * PE_PER_NODE

Resource_List.mem = N * MB_PER_NODE

queue and server limits
Queue and Server Limits
  • FIFO scheduler:
    • per job limits (ncpus, mem) are defined by resources_max queue attributes
    • as of version 2.3, resources_max also defines per queue limits
    • per server resource limits enforced with resources_available attributes
queue and server limits1
Queue and Server Limits
  • SGI scheduler:
    • per job limits (ncpus, mem) are defined by resources_max queue attributes
    • resources_max also defines per queue limits
    • per server limit is given by the number of Origin node cards. Unlike the FIFO scheduler, resource_available limits are not enforced
job enqueing 1
Job enqueing (1)
  • The scheduler places each job in some queue
  • This involves several tests for resources
  • Which queue a job is enqueued into depends on
    • what limits are tested
    • first-fit versus best fit placement
  • A job can fit in a queue if the resources requested by the job do not exceed the maximum value of the resources defined for the queue. For example, for the resource ncpus

Resource_List.ncpus <= resources_max.ncpus

job enqueing 2
Job enqueing (2)
  • A job fits in a queue if the amount of resources assigned to the queue plus the requested resources do not exceed the maximum number of resources for the queue. For example, for ncpus

resources_assigned.ncpus + Resource_List.ncpus <= resources_max.ncpus

  • A job fits in the system if the sum of all assigned resources does not exceed the available resources. For example, for the ncpus resource,

Σ resources_assigned.ncpus + Resource_List.ncpus <= resources_available.ncpus

first fit versus best fit
First fit versus best fit
  • The FIFO scheduler finds the first queue where a can fit and dispatches the job to that queue
    • if the jobs does not actually fit it will wait for the requested resources in the execution queue
  • The SGI scheduler keeps the job in the submit queue until it finds an execution queue where the job fits then dispatches the job to that queue
  • If queues are defined to have monotonically increasing resource limits (e.g., CPU time) , then first fit is not a penalty.
  • However, if a job can fit in several queues, then SGI scheduler will find a better schedule
limits on the number of running jobs
Limits on the number of running jobs
  • Per queue and per server limits on the number of running jobs:
    • max_running
    • max_user_run, max_group_run max number of running jobs per user or group
  • Unlike the FIFO scheduler, the SGI scheduler enforces these limits only on a per queue basis
    • It enforces MAX_JOBS from the scheduler config file - substitute for max_running
sgi origin install 1
SGI Origin Install (1)
  • Source files under OpenPBS_v2_3/src
  • Consider the SGI scheduler
  • Make sure the machine dependent values defines in scheduler.cc/samples/sgi_origin/toolkit.hmatch the actual machine hardware

#define MB_PER_NODE ((size_t) 512*1024*1024)

#define PE_PER_NODE 2

  • May set PE_PER_NODE =1 to allocate half-nodes if MB_PER_NODE is set accordingly
sgi origin install 2
SGI Origin Install (2)
  • Bug fixes in scheduler.cc/samples/sgi_origin/pack_queues.c
  • Operator precedence bug (line 198):

for ( qptr = qlist; qptr != NULL; qptr = qptr->next) {

if (( ( qptr->queue->flags & QFLAGS_FULL ) == 0) {

// bad operator precedence bypasses this function

if ( !schd_evaluate_system(...) ) {

// DONT_START_JOB (0) so don’t change allfull

continue;

}

// ...

}

sgi origin install 3
SGI Origin Install (3)
  • Fix of a logical bug in pack_queues.c: if a system limit is exceeded should not try to schedule the job

for ( qptr = qlist; qptr != NULL; qptr = qptr->next) {

if (( ( qptr->queue->flags & QFLAGS_FULL ) == 0) {

if ( !schd_evaluate_system(...) ) {

// DONT_START_JOB (0) so don’t change allfull

continue;

}

// ...

}

for (qptr=(allfull)?NULL:qlist; qptr !=NULL; qptr=qptr->next) {

// if allfull set, do not attempt to schedule

}

sgi origin install 4
SGI Origin Install (4)
  • Fix of a logical bug in user_limits.c, function user_running()
  • This function counts number of running jobs so must test for equality between job status and ‘R’

user_running ( ...)

{

for ( job= queue->jobs; job != NULL; job = job->next) {

if ( (job_state == ‘R’) && (!strcmp(job->owner,user) ) )

jobs_running++;

// …

}

sgi origin install 5
SGI Origin Install (5)
  • The limit npcus is not enforced in the function mom_over_limit(), located in the file mom_mach.c under the directory src/resmom/irix6array

#define SGI_ZOMBIE_WRONG 1

int mom_over_limit( ... ) {

// ...

#if !defined(SGI_ZOMBIE_WRONG)

return (TRUE);

#endif

// ...

}

sgi origin install 41
SGI Origin Install (4)

Script to run the configure command

___________________________________________________

#!/bin/csh -f

set PBS_HOME=/usr/local/pbs

set PBS_SERVER_HOME=/usr/spool/pbs

# Select SGI or FIFO scheduler

set SCHED="--set-sched-code=sgi_origin --enable-nodemask

#set SCHED="--set-sched-code=fifo --enable-nodemask”

$HOME/PBS/OpenPBS_v2_3/configure \

--prefix=$PBS_HOME \

--set-server-home=$PBS_SERVER_HOME \

--set-cc=cc --set-cflags="-Dsgi -D_SGI_SOURCE -64 -g" \

--set-sched=cc $SCHED --enable-array --enable-debug

sgi origin install 51
SGI Origin Install (5)

___________________________________________________

# cd /usr/local/pbs

# makePBS

# make

# make install

# cd /usr/spool/pbs

the script from the previous slide

sched_priv

config

decay_usage

configuring for sgi scheduler
Configuring for SGI scheduler
  • Queue types
    • one submit queue
    • one or several execution queues
  • Per server limit on the number of running job
  • Load Control
  • Fair share scheduling
    • Past usage of the machine used in ranking the jobs
    • Decayed past usage per user is kept in sched_priv/decay_usage
  • Scheduler restart action
  • PBS manager tool: qmgr
queue definition
Queue definition
  • File sched_priv/config

SUBMIT_QUEUE submit

BATCH_QUEUES hpc,back

MAX_JOBS 256

ENFORCE_PRIME_TIME False

ENFORCE_DEDICATED_TIME False

SORT_BY_PAST_USAGE True

DECAY_FACTOR 0.75

SCHED_ACCT_DIR /usr/spool/pbs/server_priv/accounting

SCHED_RESTART_ACTION RESUBMIT

load control
Load Control
  • Load control for SGI scheduler

sched_priv/config

TARGET_LOAD_PCT 90%

TARGET_LOAD_VARIANCE -15%,+10%

  • Load Control for FIFO scheduler

mom_priv/config

$max_load 2.0

$ideal_load 1.0

pbs for sgi scheduler
PBS for SGI scheduler
  • Qmgr tool

s server managers=bob@n0.bar.com

create queue submit

s q submit queue_type = Execution

s q submit resources_max.ncpus = 4

s q submit resources_max.ncpus = 1gb

s q submit resources_default.mem = 256mb

s q submit resources_default.ncpus = 1

s q submit resources_default.nice = 15

s q submit enabled = True

s q submit started = True

pbs for sgi scheduler1
PBS for SGI scheduler

create queue hpc

s q hpc queue_type = Execution

s q hpc resources_max.ncpus = 2

s q hpc resources_max.ncpus = 512mb

s q hpc resources_default.mem = 256mb

s q hpc resources_default.ncpus = 1

s q hpc acl_groups = marley

s q hpc acl_group_enable = True

s q hpc enabled = True

s q hpc started = True

pbs for sgi scheduler2
PBS for SGI scheduler
  • Server attributes

set server default_queue = submit

s server acl_hosts = *.bar.com

s server acl_host_enable = True

s server scheduling = True

s server query_other_jobs = True

pbs for fifo scheduler
PBS for FIFO scheduler
  • File sched_config instead of config and queues are not defined there
  • Submit queue is Route queue

s q submit queue_type = Route

s q submit route_destinations = hpc

s q submit route_destinations += back

  • Server attributes

s server resources_available.mem = 1gb

s server resources_available.ncpus = 4

pbs job scripts
PBS Job Scripts
  • Job scripts contain PBS directives and shell commands

#PBS -l ncpus=2

#PBS -l walltime=12:20:00

#PBS -m ae

#PBS -c c=30

cd ${PB_O_WORKDIR}

mpirun -np 2 foo.x

basic pbs commands
Basic PBS commands
  • Jobs are submitted with qsub

% qsub [-q hpc] foo.pbs

13.node0.bar.com

  • Job status is queried with qstat [-f|-a] to get job owner, name, queue, status, session ID, # CPUs, walltime

% qstat -a 13

  • Alter job attributes

% qalter -l walltime 20:00:00 13

job submission and tracking
Job Submission and Tracking
  • Find jobs in status R (running) or submitted by user bob

% qselect -s R

% qselect -u bob

  • Query queue status to find if the queue is enabled/started, and the number of jobs in the queue

qstat [-f | -a ] -Q

  • Delete a job: qdel 13
job environment and i o
Job Environment and I/O
  • The job’s current directory is the submitter’s $HOME, which is also the default location for the files created by the job. Changed with cd in the script
  • The standard out and err of the job are spooled to JobName.{o|e}JobID in the submitter’s current directory. Override this with

#PBS -o | -e pathname

slide39
Tips
  • Trace the history of a job

% tracejob - give a time-stamped sequence of events affecting a job

  • Cron jobs for cleaning up daemon work files under mom_logs, sched_logs, server_logs
  • #crontab -e

9 2 * * 0 find /usr/spool/pbs/mom_logs -type f -mtype +7 -exec rm {} \;

9 2 * * 0 find /usr/spool/pbs/sched_logs -type f -mtype +7 -exec rm {} \;

9 2 * * 0 find /usr/spool/pbs/server_logs -type f -mtype +7 -exec rm {} \;

sample pbs front end
Sample PBS Front-End

node1

node0

Execution server

Submission server

pbs_server, pbs_sched, pbs_mom

qsub, qdel, ...

pbs for clusters
PBS for clusters
  • File staging - copy files (other than stdout/stderr) from a submission-only host to the server

#PBS -W stagein=/tmp/bar@n1:/home/bar/job1

#PBS -W stageout=/tmp/bar/job1/*@n1:/home/bar/job1

PBS uses the directory /tmp/bar/job1 as a scratch directory

  • File staging may precede job starting - helps in hiding latencies
setting up a pbs cluster
Setting up a PBS Cluster
  • Assume n1 runs the pbs_mom daemon
  • $PBS_SERVER_HOME/server_priv/nodes

n0 np=2 gaussian

n1 np=2 irix

  • n0:$PBS_SERVER_HOME/mom_priv/config

$clienthost n1

$ideal_load 1.5

$max_load 2.0

  • n1:$PBS_SERVER_HOME/mom_priv/config

$ideal_load 1.5

$max_load 2.0

setting up a pbs cluster1
Setting up a PBS Cluster
  • Qmgr tool

s server managers=bob@n0.bar.com

create queue hpc

s q hpc queue_type = Execution

s q hpc Priority = 100

s q hpc resources_max.ncpus = 2

s q hpc resources_max.nodect = 1

s q hpc acl_groups = marley

s q hpc acl_group_enable = True

setting up a pbs cluster2
Setting up a PBS Cluster
  • Server attributes

set server default_node = n0

set server default_queue = hpc

s server acl_hosts = *.bar.com

s server acl_host_enable = True

s s resources_default.nodect = 1

s s resources_default.nodes = 1

s s resources_default.neednodes = 1

set server max_user_run = 2

pbs features
PBS features
  • The job submitter can request a number of nodes with some properties
  • For example
    • request a node with the property gaussian:

#PBS -l nodes=gaussian

    • request two nodes with the property irix

#PBS -l nodes=2:irix

pbs security features
PBS Security Features
  • All files used by PBS are owned by root and can be written only by root
  • Configuration files: sched_priv/config, mom_priv/config are readable only by root
  • $PBS_HOME/pbs_environment defines $PATH; it is writable only by root
  • pbs_mom daemon accepts connections from a privileged port on localhost or from a host listed in mom_priv/config
  • The server accepts commands from selected hosts and users
why preemptive scheduling
Why preemptive scheduling?
  • Resource reservation (CPU, memory) is needed to achieve high job throughput
  • Static resource reservation may lead to low machine utilization, high job waiting times, and hence slow job turn-around
  • An approach is needed to achieve both high job throughput and rapid job turn-around
static reservation pitfall 1
Static Reservation Pitfall (1)

Parallel Computer or Cluster

Physics Group

Biotech Group

Partition boundary

Node (CPU + memory)

Job Requests

static reservation pitfall 2
Static Reservation Pitfall (2)
  • Physics Group’s Job 1 is assigned 3 nodes and dispatched
  • Biotech Group’s Job 2 is also dispatched, while Job 3 cannot execute before Job 2 finishes: there is only 1 node available for the group
  • However, there are enough resources for Job 3
proposed approach 1
Proposed Approach (1)
  • Leverage the features of the Portable Batch System (PBS)
  • Extend PBS with preemptive job scheduling
  • All queues but one have reserved resources (CPUs, memory) and hold jobs that cannot be preempted. These are the dedicated queues
  • Define a queue for jobs that may be preempted: the background queue
proposed approach 2
Proposed Approach (2)
  • Each user belongs to a group and each group is authorized to submit jobs to some dedicated queues as well as to the background queue
  • The sum of the resources defined for the dedicated queues does not exceed the machine resources
  • The resources assigned to jobs in a dedicated queue do not exceed the queue resource limits
proposed approach 3
Proposed Approach (3)
  • Jobs fitting in a dedicated queue are dispatched, observing job owner’s access rights
  • Jobs not fitting in a dedicated queue are dispatched to the background queue, if there are enough available resources in the system
  • Jobs in the background queue borrow resources from the dedicated queues
proposed approach 4
Proposed Approach (4)
  • If a job entering the system would fit in a dedicated queue provided resources lent to the background queue are reclaimed, job preemption is triggered
  • Jobs from the background queue will be held to release the resources needed by a dedicated queue
  • Held jobs are re-queued and will be dispatched along with the other pending jobs
example 1
Example (1)

Two queues, each with 4 CPU capacity

Job Queue #CPU Submit CPU

time time

_________________________________

1 Physics 1 0 4 h

2 Biotech 2 0 4 h

3 Physics 4 0 3 h

4 Biotech 2 2 h 1 h

5 Physics 2 2 h 1 h

example 2
Example (2)

Turn-around times

with without

Job 1 4 h 4 h

Job2 4 h 4 h

Job 3 4 h 7 h

Job 4 3 h 3 h

Job 5 3 h 3 h

75 % reduction for job 3

key points
Key Points
  • Provide guaranteed resources per user group and per job
  • Allow resources not used by the dedicated queues to be borrowed by the background queue
  • Provide a mechanism for reclaiming resources lent to the background queue
  • Achieve low job waiting time and high job throughput
slide57

Benefits of the Approach

  • Reduce job waiting time by harnessing resources not used by the dedicated queues
  • Reduce job wall-time by reserving resources for all the jobs
  • Pending jobs fitting in dedicated queues can reclaim resources from jobs that borrowed those resources and run in the background queue
for more information
For more information
  • Veridian web site:

www.openpbs.org

www.pbspro.com

  • NRC - IMSB documentation and links

www.sao.nrc.ca/~gabriel/pbs/pbs_user.html