SLURM for Yorktown Bluegene/Q

SLURM for Yorktown Bluegene/Q

SLURM on Wat2q • Goals • Setup a scheduler for the Yorktown Bluegene system to increase research utilization of the system. • Become familiar with the Bluegene/Q SRM (system resource manager) interfaces as it is a model for future HPC control API’s. • Divide the Yorktown system into multipl[‘e submidplane blocks. • Develop scripts to allow users (optionally) to land on a specific submidplane block. • Get slurm to run the bgas.pl script automatically based on information in the SLURM sbatch command used to queue a job. • This requires that jobs be limited to running on complete partitions. • SLURM by default will attempt to run a job on part of a submidplane partition if that partition is already booted. • This is accomplished with prolog scripts.

SLURM Scheduling Jobs

SLURM Allocation Vs. Task Placement • Allocation is the selection of the resources needed for the job • Each job includes zero or more job steps (srun) • Each job step is comprised of one to multiple tasks • This is done by the “sbatch” command. • Task placement is the process of assigning a subset of the job’s allocated resources (cpus) to each task. • This is handled by the SLURM “srun” command invoked from within the script scheduled by “sbatch”.

Effectively this becomes a game of Tetris

Slurm documentation • Slurm docs can be found here: • http://slurm.schedmd.com/documentation.html • Typical commands:

SLURM functions • SLURMD carries out five key tasks and has five corresponding subsystems: • Machine Status • responds to SLURMCTLD requests for machine state information and sends asynchronous reports of state changes to help with queue control. • Job Status • responds to SLURMCTLD requests for job state information and sends asynchronous reports of state changes to help with queue control. • Remote Execution • starts, monitors, and cleans up after a set of processes (usually shared by a parallel job), as decided by SLURMCTLD (or by direct user intervention). • Stream Copy Service • handles all STDERR, STDIN, and STDOUT for remote tasks. This may involve redirection, and it always involves locally buffering job output to avoid blocking local tasks. • Job Control • propagates signals and job-termination requests to any SLURM-managed processes (often interacting with the Remote Execution subsystem).

Slurm software • SLURM daemons don’t execute directly on the compute nodes. • SLURM gets system state, allocates resources and other state from the Bluegene/Q control system. • This interface is entirely contained in a SLURM plugin (src/plugings/select/bluegene). • The user interacts bluegene with the following slurm commands. • sbatch. • srun. • scontrol. • squeue.

Slurm Architecture for Bluegene/Q

Job Launch Process

Sview of BlueGene system

Slurm naming conventions • Slurm names things with torus coordinates • Top level names use 4 dimension midplane coordinates. • Submidplane partitions use 5 dimension torus coordinates. Slurm name Bgq name Slurm name Bgq name Example larger blocks

Slurm queuing a JOB. • Use the sbatch command to queue a script that will run one or more jobs. • Within the script presented to the sbatch command do one or more “srun” commands. • The srun command will eventually cause a runjob command to be created. • For example: • This schedules the script rj01.sh to be run when a 64 node block on the partition “prod” is booted. sbatch –nodes=64 --partition=prod rj01.sh • Inside rj01.sh we have: #!/bin/bash srun --chdir=/bgusr/home1/bvt_scratch /bgusr/home1/bgqadmin/bvtapps/dgemmdiag/dgemmdiag.elf • The srun will call runjob as follows: runjob --exe /bgusr/home1/bgqadmin/bvtapps/dgemmdiag/dgemmdiag.elf --block RMP28Ap122959767 --cwd /bgusr/home1/bvt_scratch

Queuing a job with only one script. • Using sbatch/srun to queue a job typically requires two scripts, one to queue the job, (sbatch) and one to run one or more jobs (srun) once the block is allocated. • One can do this with a single script with this simple boilerplate. ##!/bin/bash if [ -z "$SLURM_JOBID" ]; then sbatch --gid=bqluan --time=5:00 --nodes=128 --ntasks-per-node=32 -O --qos=umax-128 $0 else srun --chdir=/gpfs/DDNgpfs2/bqluan/mushroomP \ --output=equilibrate-4V-21-new.out --error=equilibrate-4V-22-new.namd \ /gpfs/DDNgpfs1/smts/bin/bgq/namd2.9 equilibrate-4V-22-new.namd fi • The above script is a re-expression of the following (original) run job script runjob --block R01-M0-N04-128 --ranks-per-node 32 --cwd /gpfs/DDNgpfs2/bqluan/mushroomP \ --exe /gpfs/DDNgpfs1/smts/bin/bgq/namd2.9 \ --args equilibrate-4V-21-new.namd > equilibrate-4V-21-new.out 2> equilibrate-4V-21-new.err &

Srun/runjob decoder Srun option Runjob option • Launcher options is a catch-all for all other runjob options • For example: • --launcher-opts=“—timeout-300 –strace”

Partitions (SLURM queue names). • We have setup multiple basic slurm queues (partitions). • prod – regular production nodes (R00-M0, R00-M1, R01-M0, R01-M1). • bgas – full system bgas allocation (R00-M0, R00-M1, R01-M0, R01-M1). • There are a couple of midplane level reservations setup to run each day. • bgas_daily – active 3am to 3:30pm • bgas_full – 3:30 pm to 6pm. • The default queue/partition is the “prod” queue. • The queue/partition name is used by the prolog script to determine if it is necessary to switch the IO nodes to either BGAS or production.

SLURM small block divisions. • Block divisions as of May 2024. • bgq0000 (R00-M0) – divided into 16 32 way blocks. • bgq0001 (R00-M1) – divided into 32,64,128,256 way (overlapping blocks) • Bgq0010 (R01-M0) – divided into ,64,128,256 way (overlapping blocks) • Bgq0011 (R01-M1) – divided into ,64,128,256 way (overlapping blocks) • sbatch option “--nodes=xx” where xx is, either 32,64,128,256 will cause a job to land on one of the small block partitions. Slurm will pick which small block to run it on. • Prolog scripts ensure that partial blocks are not used (i.e. 2 32 way jobs running on the same 64 way block at the same time. • You can restrict which midplane that slurm will try to select its blocks from with the –nodelist=xxxx, where xxxx is bgq0000, bgq0001, bgq0010, or bgq0011.

Getting SLURM to run on a specific node card/block • To get slurm to land on a specific block we use the prolog script and the “nodelist” and “constraint” option for sbatch. • For example: sbatch --partition=prod –nodelist=bgq0000 --nodes=32 --constraint=N00-32 • NOTE: • The --nodes option and the constraint must agree as to the size. • A sub-block of that size MUST exist on the nodelist requested. • Valid constraints are: • Nxx-32, where xx is 00-15 • Nxx-64, where xx is 00,02,04,06,08,10,12,14 • Nxx-128, where xx is 00,04,08,12 • Nxx-256, where xx is 00,08 • If the block is not capable of being scheduled the job will be canceled and a message will appear in the stdout file (slurm-$jobid.out). • Trying to use the higher number Nxx cards for 64 and 32 ways is discouraged, because the system will try to run the jobs on the Lower Number cards first and down each node card in turn until it lands on the card it needs to run on.

SLURM Job order. • If the user uses the –constraints parameter to select a specific node card, the order that jobs are submitted on may not be respected. • This is because the prolog scripts can reject the node SLURM first selects either due to it trying to run on a block larger than requested, or by a constraint. • When the job is rejected on a specific node, it gets re-queued and this will cause some reordering. • If Job order is required one can use the --singleton and --jobname options as follows: • sbatch --job-name=a --dependency=singleton -N32 --constraint=N01-32 rj01.s • Another way to do this is with the “--dependency”: • after:job_id[:jobid...] : This job can begin execution after the specified jobs have begun execution. • afterany:job_id[:jobid...] : This job can begin execution after the specified jobs have terminated. • afternotok:job_id[:jobid...]: This job can begin execution after the specified jobs have terminated in some failed state (non-zero exit code, node failure, timed out, etc). • afterok:job_id[:jobid...] : This job can begin execution after the specified jobs have successfully executed (ran to completion with an exit code of zero).

SLURM – reservations. • Slurm can reserve an entire Midplane for jobs by a specific reservation id. • The current version can only reserve entire midplane blocks (not sub-midplane) • The September release of SLURM is supposed to have better sub-midplane capabilities for both node selection and reservations. • Creating a reseveration: scontrol create reservation user=myid starttime=now duration=120 \ nodes=bgq0001 • This will reply with a reservation id as follows: Reservation created: myid_5 • Using the reservation: sbatch --reservation=myid_5 –nodes=64 my.script This web page outlines reservations in more detail https://computing.llnl.gov/linux/slurm/reservations.html

Reservation Time-limit interaction. • For each job in there queue there is an execution timelimit imposed on it. • The default for this normally comes from the queue name. • It can be overridden at various levels such as the sbatch command line. • The initial default for the SLURM queues is 1 hour, so to over ride it use the --timeparameter on the sbatch as follows: sbatch –time=xxx nameofscript.sh • The xxx value is in minutes, other forms of date/times can be found in the sbatch man page: “man sbatch” • The job will not run if the timelimit overlaps a node reservation. • So for example, if there is a reservation every day at 3:30 for the entire machine and the time limit associated for the job will over lap that full system reservation, the job won’t run. Until after the reservation is over. • If the time-limit exceeds the queue/partition time-limit the job will be left in the pending state indefinitely.

QOS settings. • QOS (quality of service settings), are used by SLURM to control limits on the amount of resources a given user/group/account/job can consume at any one time. • Our initial deployment of SLURM will associate a default QOS setting limiting each user to the total number of compute nodes that they previously had as a static allocation. • This will be used to keep users from consuming all of the machine by submitting multiple sbatch commands, but still allow a user to run 3 32 way jobs if their normal allocaiton was 128 nodes. • Each user will have a “default QOS” setting associated with their ID as well as a list of qos settings they are allowed to use. • umax-32 == user max nodes = 32 • umax-64 == user max nodes = 64 • umax-128 == user max nodes == 128 • … • One can select one of the authorized qos settings in the sbatch command line as follows: sbatch –qos=umax-128 –nodes=32 xx.sh • The above command would allow the user to run 4 32 way jobs in parallel, before the queue would back up his jobs behind other work.

SLURM for Yorktown Bluegene/Q

SLURM for Yorktown Bluegene/Q

Presentation Transcript

Jamestown / Yorktown Foundation

TRASH CAN TRIVIA

THIS

Victory at Yorktown

Memory Speculation of the Blue Gene/Q Compute Chip Martin Ohmacht/ IBM BlueGene Team

Ideas on privacy vs. Authentication Authentication by online challenges. Charles H. Bennett IBM Research Yorktown Januar

Strang Middle School Technology Integration Session One Yorktown, New York February, 2003

Challenges for ProcControlAPI and DyninstAPI on BlueGene May 3, 2011

A World Turned Upside Down!

HPC at HCC Jun Wang

Chapter 4,sec.4

Battle of Yorktown 1781

Rhonda Fingar Safety Services, LLC Yorktown, VA. rhondafingar 757-817-2979

Winning the War

Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

Surrender at Yorktown

Converse BlueGene Emulator

Panel: BlueGene/L The Next 100 weeks

Proposed 2007 Acquisition

The BlueGene/L Supercomputer

Yorktown AP Scholars

HPC at HCC Jun Wang