230 likes | 319 Views
Explore job management basics, submission methods, queue life, Score accrual, and job termination processes at ALCF. Learn about Cobalt, script mode, and software environment optimization for beginners.
E N D
Critical Flags, Variables, and Other Important ALCF Minutiae Jini Ramprakash Technical Support Specialist Argonne Leadership Computing Facility
Presentation outline • It’s all about your job! • Job management • Job basics • Submission • Queuing • Execution • Termination • Software environment • Optimization for beginners • ALCF resources, outlined Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy
Job management • Cobalt (the ALCF resource scheduler) is used on all ALCF systems • Similar to PBS but not the same • Find more information at http://trac.mcs.anl.gov/projects/cobalt • Job management commands: • qsub: submit a job • qstat: query a job status • qdel: delete a job • qalter: alter batched job parameters • qmove: move job to different queue • qhold: place queued (non-running) job on hold • qrls: release hold on job • showres: show current and future reservations Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy
Job basics – submission • Two modes of submitting jobs • Basic • Script mode • Get all flags and options by running ‘man qsub’ • For example: • qsub -A alchemy -n 40960 --mode c1 -t 720 --env“OMP_NUM_THREADS=4” lead_to_gold • In English: Charge project “Alchemy” for this job. Run on 40960 nodes, with one MPI rank per node.Run for 720 minutes. Set the “OMP_NUM_THREADS” environment variable to 4. Run the “lead_to_gold” binary. Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy
qsub checks your submission for sanity • Did you specify a nodecount and walltime? Are they legal? • Is the mode you specified valid? • Did you ask for more than the minimum runtime? • Are you a member of the project you specified? Does that project have a usable allocation? • If so … all systems go! Get a JOBID, and put it in the queue Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy
Not there yet! Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy
Job basics - life in the queue • Periodically, your job’s score will increase • Periodically, the scheduler will decide if there are any jobs it wants to run • Check current state with qstat • At some point, your score will be high enough, and it will be YOUR TURN! Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy
Score accrual • Large jobs are prioritized • Jobs that have been waiting long are prioritized • INCITE/ALCC projects are prioritized • Negative allocations have a score cap lower than the starting score of other jobs Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy
Job basics - execution • Book-keeping • Put a start record in the database. Output a log file start record. Send email of job start if –notify was requested. Start job timers • Fire up to execute the job • Cobalt boots partition • runjob starts executable Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy
Script mode jobs • All jobs launch via runjob on the service nodes • Script mode jobs launch your script on a special login node • That script is responsible for calling runjob to launch the actual compute-node job • You are charged for the duration of the script Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy
Job basics – termination aka are we there yet? • Your requested wall-time ticks down. Either your runjob returns, or you run out of wall-time and your job is forcibly removed • Job-end cleanup happens • If your partition wasn’t cleaned up, that happens now • Job-end book-keeping happens • Database, log file, notify if requested Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy
Job basics – Termination, life after your job • If you had a job depending on you, it can be released to run. If you had a non-zero exit code, it moves to dep_fail instead • That night, the log files will be fed into clusterbank (the ALCF accounting system) to create charges Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy
Non-standard job events • Reservations and/or draining • qsub rejection • Job holds • Job redefinition (qalter) • Job removal (qdel) • Abnormal job failure • Why isn’t this job running? Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy
Software environment - SoftEnv • A tool for managing your environment • Sets your PATH to access desired front-end tools • Your compiler version can be changed here • Settings: • Maintained in the file ~/.soft • Add/remove keywords from ~/.soft to change environment • Make sure @default is at the very end • Commands: • softenv • a list of all keywords defined on the systems • resoft • reloads initial environment from ~/.soft file • softadd|remove keyword • Temporarily modify environment by adding/removing keywords • http://www.mcs.anl.gov/hs/software/systems/softenv/softenv-intro.html Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy
Software libraries • ALCF Supports two sets of libraries: • IBM system and provided libraries: /bgsys/drivers/ppcfloor • glibc • mpi • Site supported libraries and programs: /soft/ • PETSc • ESSL • And many others • See http://www.alcf.anl.gov/resource-guides/software-and-libraries Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy
Compiler wrappers • MPI wrappers for IBM XL cross-compilers: • MPI wrappers for GNU cross-compilers: Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy
Optimization for beginners • Suggested set of optimization levels from least to most optimization: • -O0 # best level for use with a debugger • -O2 # good level for verifying correctness, baseline perf • -O2 -qmaxmem=-1 -qhot=level=0 • -O3 -qstrict (preserves program semantics) • -O3 • -O3 -qhot=level=1 • -O4 • -O5 Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy
Optimization tips • -qlistopt generates a listing with all flags used in compilation • -qreport produces a listing, shows how code was optimized • Performance can decrease at higher levels of optimization, especially at -O4 or -O5 • May specify different optimization levels for different routines/files Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy
ALCF Resources – BG/Q systems • Mira – BG/Q system • 49,152 nodes / 786,432 cores • 786 TB of memory • Peak flop rate: 10 PF • Linpackflop rate: 8.1 PF • Cetus (T&D) – BG/Q system • 1024 nodes / 16,384 cores • 16 TB of memory • Peak flop rate: 208 TF • Vesta(T&D) -‐ BG/Q systems • 2,048 nodes / 32,768 cores • 32 TB of memory • Peak flop rate: 416 TF Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy
ALCF Resources – supporting systems • Tukey • Nvidiasystem • 100 nodes / 1600 x86 cores/ 200 M2070 GPUs • 6.4 TB x86 memory / 1.2 TB GPU memory • Peak flop rate: 220 TF • Storage • Scratch: 28.8 PB raw capacity, 240 GB/s bw (GPFS) • Home: 1.8 PB raw capacity, 45 GB/s bw (GPFS) • Storage upgrade planned in 2015 Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy
ALCF Resources Mira 48 racks/768K cores 10 PF Cetus(Dev) 1 rack/16K cores 208 TF Tukey (Viz) 100 nodes/1600 cores 200 NVIDIA GPUs 220 TF Networks 100Gb (via Esnet, internet2 UltraScienceNet) Vesta(Dev) 2racks/32K cores 416 TF Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy
Coming up next… • Data Transfers in the ALCF - Robert Scott, ALCF Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy
Thank You! • Questions? Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy