An overview of the portable batch system
Download
1 / 58

An Overview of the Portable Batch System - PowerPoint PPT Presentation


  • 80 Views
  • Uploaded on

An Overview of the Portable Batch System. Gabriel Mateescu National Research Council Canada I M S B [email protected] www.sao.nrc.ca/~gabriel/presentations/sgi_pbs. Outline. PBS highlights PBS components Resources managed by PBS Choosing a PBS scheduler

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' An Overview of the Portable Batch System ' - yanni


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
An overview of the portable batch system

An Overview of the Portable Batch System

Gabriel Mateescu

National Research Council Canada

I M S B

[email protected]

www.sao.nrc.ca/~gabriel/presentations/sgi_pbs


Outline
Outline

  • PBS highlights

  • PBS components

  • Resources managed by PBS

  • Choosing a PBS scheduler

  • Installation and configuration of PBS

  • PBS scripts and commands

  • Adding preemptive job scheduling to PBS


Pbs highlights
PBS Highlights

  • Developed by Veridian / MRJ

  • Robust, portable, effective, extensible batch job queuing and resource management system

  • Supports different schedulers

  • Supports heterogeneous clusters

  • Open PBS - open source version

  • PBS Pro - commercial version


Recent versions of pbs
Recent Versions of PBS

  • PBS 2.2, November 1999:

    • both the FIFO and SGI scheduler have bugs in enforcing resource limits

    • poor support for stopping & resuming jobs

  • OpenPBS 2.3, September 2000

    • better FIFO scheduler: resource limits enforced, backfilling added

  • PBS Pro 5.0, September 2000

    • claims support for job stopping/resuming, better scheduling, IRIX cpusets


Resources managed by pbs
Resources managed by PBS

  • PBS manages jobs, CPUs, memory, hosts and queues

  • PBS accepts batch jobs, enqueues them, runs the jobs, and delivers output back to the submitter

  • Resources - describe attributes of jobs, queues, and hosts

  • Scheduler - chooses the jobs that fit within queue and cluster resources


Main components of pbs
Main Components of PBS

  • Three daemons:

    • pbs_server server,

    • pbs_sched scheduler,

    • pbs_mom job executor & resource monitor

  • The server accepts commands and communicates with the daemons

    • qsub - submit a job

    • qstat - view queue and job status

    • qalter - change job’s attributes

    • qdel - delete a job


Batch queuing
Batch Queuing

Job exclusive scheduling

Queue A

Queue B

SGI Origin System

Node (CPUs + memory)


Resource examples
Resource Examples

  • ncpus number of CPUs per job

  • mem resident memory per job

  • pmem per-process memory

  • vmem virtual memory per job

  • cput CPU time per job

  • walltime real time per job

  • file file size per job


Resource limits
Resource limits

  • resources_max - per job limit for a resource; determines whether a job fits in a queue

  • resources_default - default amount of a resource assigned to a job

  • resources_available - advice to the scheduler on how much of a resource can be used by all running jobs


Choosing a scheduler 1
Choosing a Scheduler (1)

  • FIFO scheduler:

    • First-fit placement: enqueues a job in the first queue where it may fit even if it does not currently fit there and there is another queue where it will fit

    • Supports per job and (in version 2.3) per queue resource limits: ncpus, mem

    • Supports per server limits on the number of CPUs, and memory, (based on server attribute resources_available)


Choosing a scheduler 2
Choosing a Scheduler (2)

  • Algorithms in FIFO scheduler

    • FIFO - sort jobs by job queuing time running the earliest job first

    • Backfill - relax FIFO rule for parallel jobs, as long as out-of-order jobs do not delay jobs submitted before by the FIFO order

    • Fair share: sort & schedule jobs based on past usage of the machine by the job owners

    • Round-robin - pick a job from each queue

    • By key - sort jobs by a set of keys: shortest_job_first, smallest_memory_first


Choosing a scheduler 3
Choosing a Scheduler (3)

  • FIFO scheduler supports round robin load balancing as of version 2.3

  • FIFO scheduler

    • allows decoupling the job requirements on the number of CPUs from that on the amount memory

    • simple first-fit placement can lead to the need that the user specifies an execution queue for the jobs, when the job could fit in more than one queue


Choosing a scheduler 4
Choosing a Scheduler (4)

  • SGI scheduler

    • supports FIFO, fair share, backfilling, and attempts to avoid job starvation

    • supports both per job limits and per queue limits on number of CPUs, memory

    • per server limit is the number of node cards

    • makes a best effort in choosing a queue where to run a job. A job not having enough resources to run is kept in the submit queue

    • ties the number of cpus allocated to the memory allocated per job


Resource allocation
Resource allocation

  • SGI scheduler allocates nodes -

    node = [ PE_PER_NODE cpus, MB_PER_NODE Mbyte ]

  • Number of nodes N for a job is such that

    [ ncpus, mem] <= [ N*PE_PER_NODE, N* MB_PER_NODE ]

    where ncpus and mem are the job’s memory and cpu job limitsspecified, e.g., with#PBS -l mem

  • Job attributes Resource_List.{ncpus, mem} set to

    Resource_List.ncpus = N * PE_PER_NODE

    Resource_List.mem = N * MB_PER_NODE


Queue and server limits
Queue and Server Limits

  • FIFO scheduler:

    • per job limits (ncpus, mem) are defined by resources_max queue attributes

    • as of version 2.3, resources_max also defines per queue limits

    • per server resource limits enforced with resources_available attributes


Queue and server limits1
Queue and Server Limits

  • SGI scheduler:

    • per job limits (ncpus, mem) are defined by resources_max queue attributes

    • resources_max also defines per queue limits

    • per server limit is given by the number of Origin node cards. Unlike the FIFO scheduler, resource_available limits are not enforced


Job enqueing 1
Job enqueing (1)

  • The scheduler places each job in some queue

  • This involves several tests for resources

  • Which queue a job is enqueued into depends on

    • what limits are tested

    • first-fit versus best fit placement

  • A job can fit in a queue if the resources requested by the job do not exceed the maximum value of the resources defined for the queue. For example, for the resource ncpus

    Resource_List.ncpus <= resources_max.ncpus


Job enqueing 2
Job enqueing (2)

  • A job fits in a queue if the amount of resources assigned to the queue plus the requested resources do not exceed the maximum number of resources for the queue. For example, for ncpus

    resources_assigned.ncpus + Resource_List.ncpus <= resources_max.ncpus

  • A job fits in the system if the sum of all assigned resources does not exceed the available resources. For example, for the ncpus resource,

    Σ resources_assigned.ncpus + Resource_List.ncpus <= resources_available.ncpus


First fit versus best fit
First fit versus best fit

  • The FIFO scheduler finds the first queue where a can fit and dispatches the job to that queue

    • if the jobs does not actually fit it will wait for the requested resources in the execution queue

  • The SGI scheduler keeps the job in the submit queue until it finds an execution queue where the job fits then dispatches the job to that queue

  • If queues are defined to have monotonically increasing resource limits (e.g., CPU time) , then first fit is not a penalty.

  • However, if a job can fit in several queues, then SGI scheduler will find a better schedule


Limits on the number of running jobs
Limits on the number of running jobs

  • Per queue and per server limits on the number of running jobs:

    • max_running

    • max_user_run, max_group_run max number of running jobs per user or group

  • Unlike the FIFO scheduler, the SGI scheduler enforces these limits only on a per queue basis

    • It enforces MAX_JOBS from the scheduler config file - substitute for max_running


Sgi origin install 1
SGI Origin Install (1)

  • Source files under OpenPBS_v2_3/src

  • Consider the SGI scheduler

  • Make sure the machine dependent values defines in scheduler.cc/samples/sgi_origin/toolkit.hmatch the actual machine hardware

    #define MB_PER_NODE ((size_t) 512*1024*1024)

    #define PE_PER_NODE 2

  • May set PE_PER_NODE =1 to allocate half-nodes if MB_PER_NODE is set accordingly


Sgi origin install 2
SGI Origin Install (2)

  • Bug fixes in scheduler.cc/samples/sgi_origin/pack_queues.c

  • Operator precedence bug (line 198):

    for ( qptr = qlist; qptr != NULL; qptr = qptr->next) {

    if (( ( qptr->queue->flags & QFLAGS_FULL ) == 0) {

    // bad operator precedence bypasses this function

    if ( !schd_evaluate_system(...) ) {

    // DONT_START_JOB (0) so don’t change allfull

    continue;

    }

    // ...

    }


Sgi origin install 3
SGI Origin Install (3)

  • Fix of a logical bug in pack_queues.c: if a system limit is exceeded should not try to schedule the job

    for ( qptr = qlist; qptr != NULL; qptr = qptr->next) {

    if (( ( qptr->queue->flags & QFLAGS_FULL ) == 0) {

    if ( !schd_evaluate_system(...) ) {

    // DONT_START_JOB (0) so don’t change allfull

    continue;

    }

    // ...

    }

    for (qptr=(allfull)?NULL:qlist; qptr !=NULL; qptr=qptr->next) {

    // if allfull set, do not attempt to schedule

    }


Sgi origin install 4
SGI Origin Install (4)

  • Fix of a logical bug in user_limits.c, function user_running()

  • This function counts number of running jobs so must test for equality between job status and ‘R’

    user_running ( ...)

    {

    for ( job= queue->jobs; job != NULL; job = job->next) {

    if ( (job_state == ‘R’) && (!strcmp(job->owner,user) ) )

    jobs_running++;

    // …

    }


Sgi origin install 5
SGI Origin Install (5)

  • The limit npcus is not enforced in the function mom_over_limit(), located in the file mom_mach.c under the directory src/resmom/irix6array

    #define SGI_ZOMBIE_WRONG 1

    int mom_over_limit( ... ) {

    // ...

    #if !defined(SGI_ZOMBIE_WRONG)

    return (TRUE);

    #endif

    // ...

    }


Sgi origin install 41
SGI Origin Install (4)

Script to run the configure command

___________________________________________________

#!/bin/csh -f

set PBS_HOME=/usr/local/pbs

set PBS_SERVER_HOME=/usr/spool/pbs

# Select SGI or FIFO scheduler

set SCHED="--set-sched-code=sgi_origin --enable-nodemask

#set SCHED="--set-sched-code=fifo --enable-nodemask”

$HOME/PBS/OpenPBS_v2_3/configure \

--prefix=$PBS_HOME \

--set-server-home=$PBS_SERVER_HOME \

--set-cc=cc --set-cflags="-Dsgi -D_SGI_SOURCE -64 -g" \

--set-sched=cc $SCHED --enable-array --enable-debug


Sgi origin install 51
SGI Origin Install (5)

___________________________________________________

# cd /usr/local/pbs

# makePBS

# make

# make install

# cd /usr/spool/pbs

the script from the previous slide

sched_priv

config

decay_usage


Configuring for sgi scheduler
Configuring for SGI scheduler

  • Queue types

    • one submit queue

    • one or several execution queues

  • Per server limit on the number of running job

  • Load Control

  • Fair share scheduling

    • Past usage of the machine used in ranking the jobs

    • Decayed past usage per user is kept in sched_priv/decay_usage

  • Scheduler restart action

  • PBS manager tool: qmgr


Queue definition
Queue definition

  • File sched_priv/config

    SUBMIT_QUEUE submit

    BATCH_QUEUES hpc,back

    MAX_JOBS 256

    ENFORCE_PRIME_TIME False

    ENFORCE_DEDICATED_TIME False

    SORT_BY_PAST_USAGE True

    DECAY_FACTOR 0.75

    SCHED_ACCT_DIR /usr/spool/pbs/server_priv/accounting

    SCHED_RESTART_ACTION RESUBMIT


Load control
Load Control

  • Load control for SGI scheduler

    sched_priv/config

    TARGET_LOAD_PCT 90%

    TARGET_LOAD_VARIANCE -15%,+10%

  • Load Control for FIFO scheduler

    mom_priv/config

    $max_load 2.0

    $ideal_load 1.0


Pbs for sgi scheduler
PBS for SGI scheduler

  • Qmgr tool

    s server [email protected]

    create queue submit

    s q submit queue_type = Execution

    s q submit resources_max.ncpus = 4

    s q submit resources_max.ncpus = 1gb

    s q submit resources_default.mem = 256mb

    s q submit resources_default.ncpus = 1

    s q submit resources_default.nice = 15

    s q submit enabled = True

    s q submit started = True


Pbs for sgi scheduler1
PBS for SGI scheduler

create queue hpc

s q hpc queue_type = Execution

s q hpc resources_max.ncpus = 2

s q hpc resources_max.ncpus = 512mb

s q hpc resources_default.mem = 256mb

s q hpc resources_default.ncpus = 1

s q hpc acl_groups = marley

s q hpc acl_group_enable = True

s q hpc enabled = True

s q hpc started = True


Pbs for sgi scheduler2
PBS for SGI scheduler

  • Server attributes

    set server default_queue = submit

    s server acl_hosts = *.bar.com

    s server acl_host_enable = True

    s server scheduling = True

    s server query_other_jobs = True


Pbs for fifo scheduler
PBS for FIFO scheduler

  • File sched_config instead of config and queues are not defined there

  • Submit queue is Route queue

    s q submit queue_type = Route

    s q submit route_destinations = hpc

    s q submit route_destinations += back

  • Server attributes

    s server resources_available.mem = 1gb

    s server resources_available.ncpus = 4


Pbs job scripts
PBS Job Scripts

  • Job scripts contain PBS directives and shell commands

    #PBS -l ncpus=2

    #PBS -l walltime=12:20:00

    #PBS -m ae

    #PBS -c c=30

    cd ${PB_O_WORKDIR}

    mpirun -np 2 foo.x


Basic pbs commands
Basic PBS commands

  • Jobs are submitted with qsub

    % qsub [-q hpc] foo.pbs

    13.node0.bar.com

  • Job status is queried with qstat [-f|-a] to get job owner, name, queue, status, session ID, # CPUs, walltime

    % qstat -a 13

  • Alter job attributes

    % qalter -l walltime 20:00:00 13


Job submission and tracking
Job Submission and Tracking

  • Find jobs in status R (running) or submitted by user bob

    % qselect -s R

    % qselect -u bob

  • Query queue status to find if the queue is enabled/started, and the number of jobs in the queue

    qstat [-f | -a ] -Q

  • Delete a job: qdel 13


Job environment and i o
Job Environment and I/O

  • The job’s current directory is the submitter’s $HOME, which is also the default location for the files created by the job. Changed with cd in the script

  • The standard out and err of the job are spooled to JobName.{o|e}JobID in the submitter’s current directory. Override this with

    #PBS -o | -e pathname


Tips

  • Trace the history of a job

    % tracejob - give a time-stamped sequence of events affecting a job

  • Cron jobs for cleaning up daemon work files under mom_logs, sched_logs, server_logs

  • #crontab -e

    9 2 * * 0 find /usr/spool/pbs/mom_logs -type f -mtype +7 -exec rm {} \;

    9 2 * * 0 find /usr/spool/pbs/sched_logs -type f -mtype +7 -exec rm {} \;

    9 2 * * 0 find /usr/spool/pbs/server_logs -type f -mtype +7 -exec rm {} \;


Sample pbs front end
Sample PBS Front-End

node1

node0

Execution server

Submission server

pbs_server, pbs_sched, pbs_mom

qsub, qdel, ...


Pbs for clusters
PBS for clusters

  • File staging - copy files (other than stdout/stderr) from a submission-only host to the server

    #PBS -W [email protected]:/home/bar/job1

    #PBS -W [email protected]:/home/bar/job1

    PBS uses the directory /tmp/bar/job1 as a scratch directory

  • File staging may precede job starting - helps in hiding latencies


Setting up a pbs cluster
Setting up a PBS Cluster

  • Assume n1 runs the pbs_mom daemon

  • $PBS_SERVER_HOME/server_priv/nodes

    n0 np=2 gaussian

    n1 np=2 irix

  • n0:$PBS_SERVER_HOME/mom_priv/config

    $clienthost n1

    $ideal_load 1.5

    $max_load 2.0

  • n1:$PBS_SERVER_HOME/mom_priv/config

    $ideal_load 1.5

    $max_load 2.0


Setting up a pbs cluster1
Setting up a PBS Cluster

  • Qmgr tool

    s server [email protected]

    create queue hpc

    s q hpc queue_type = Execution

    s q hpc Priority = 100

    s q hpc resources_max.ncpus = 2

    s q hpc resources_max.nodect = 1

    s q hpc acl_groups = marley

    s q hpc acl_group_enable = True


Setting up a pbs cluster2
Setting up a PBS Cluster

  • Server attributes

    set server default_node = n0

    set server default_queue = hpc

    s server acl_hosts = *.bar.com

    s server acl_host_enable = True

    s s resources_default.nodect = 1

    s s resources_default.nodes = 1

    s s resources_default.neednodes = 1

    set server max_user_run = 2


Pbs features
PBS features

  • The job submitter can request a number of nodes with some properties

  • For example

    • request a node with the property gaussian:

      #PBS -l nodes=gaussian

    • request two nodes with the property irix

      #PBS -l nodes=2:irix


Pbs security features
PBS Security Features

  • All files used by PBS are owned by root and can be written only by root

  • Configuration files: sched_priv/config, mom_priv/config are readable only by root

  • $PBS_HOME/pbs_environment defines $PATH; it is writable only by root

  • pbs_mom daemon accepts connections from a privileged port on localhost or from a host listed in mom_priv/config

  • The server accepts commands from selected hosts and users


Why preemptive scheduling
Why preemptive scheduling?

  • Resource reservation (CPU, memory) is needed to achieve high job throughput

  • Static resource reservation may lead to low machine utilization, high job waiting times, and hence slow job turn-around

  • An approach is needed to achieve both high job throughput and rapid job turn-around


Static reservation pitfall 1
Static Reservation Pitfall (1)

Parallel Computer or Cluster

Physics Group

Biotech Group

Partition boundary

Node (CPU + memory)

Job Requests


Static reservation pitfall 2
Static Reservation Pitfall (2)

  • Physics Group’s Job 1 is assigned 3 nodes and dispatched

  • Biotech Group’s Job 2 is also dispatched, while Job 3 cannot execute before Job 2 finishes: there is only 1 node available for the group

  • However, there are enough resources for Job 3


Proposed approach 1
Proposed Approach (1)

  • Leverage the features of the Portable Batch System (PBS)

  • Extend PBS with preemptive job scheduling

  • All queues but one have reserved resources (CPUs, memory) and hold jobs that cannot be preempted. These are the dedicated queues

  • Define a queue for jobs that may be preempted: the background queue


Proposed approach 2
Proposed Approach (2)

  • Each user belongs to a group and each group is authorized to submit jobs to some dedicated queues as well as to the background queue

  • The sum of the resources defined for the dedicated queues does not exceed the machine resources

  • The resources assigned to jobs in a dedicated queue do not exceed the queue resource limits


Proposed approach 3
Proposed Approach (3)

  • Jobs fitting in a dedicated queue are dispatched, observing job owner’s access rights

  • Jobs not fitting in a dedicated queue are dispatched to the background queue, if there are enough available resources in the system

  • Jobs in the background queue borrow resources from the dedicated queues


Proposed approach 4
Proposed Approach (4)

  • If a job entering the system would fit in a dedicated queue provided resources lent to the background queue are reclaimed, job preemption is triggered

  • Jobs from the background queue will be held to release the resources needed by a dedicated queue

  • Held jobs are re-queued and will be dispatched along with the other pending jobs


Example 1
Example (1)

Two queues, each with 4 CPU capacity

Job Queue #CPU Submit CPU

time time

_________________________________

1 Physics 1 0 4 h

2 Biotech 2 0 4 h

3 Physics 4 0 3 h

4 Biotech 2 2 h 1 h

5 Physics 2 2 h 1 h


Example 2
Example (2)

Turn-around times

with without

Job 1 4 h 4 h

Job2 4 h 4 h

Job 3 4 h 7 h

Job 4 3 h 3 h

Job 5 3 h 3 h

75 % reduction for job 3


Key points
Key Points

  • Provide guaranteed resources per user group and per job

  • Allow resources not used by the dedicated queues to be borrowed by the background queue

  • Provide a mechanism for reclaiming resources lent to the background queue

  • Achieve low job waiting time and high job throughput


Benefits of the Approach

  • Reduce job waiting time by harnessing resources not used by the dedicated queues

  • Reduce job wall-time by reserving resources for all the jobs

  • Pending jobs fitting in dedicated queues can reclaim resources from jobs that borrowed those resources and run in the background queue


For more information
For more information

  • Veridian web site:

    www.openpbs.org

    www.pbspro.com

  • NRC - IMSB documentation and links

    www.sao.nrc.ca/~gabriel/pbs/pbs_user.html


ad