Discover Cluster Upgrades: Transition to Intel Haswells and SLES11.SP3

Discover Cluster Upgrades:Hello Haswells and SLES11 SP3, Goodbye WestmeresFebruary 3, 2015NCCS Brown Bag

Agenda • Discover Cluster Hardware Changes & Schedule – Brief Update • Using Discover SCU10 Haswell / SLES11 SP3 • Q & A

Discover Hardware Changes & Schedule update

Discover’s New Intel Xeon“Haswell” Nodes • Discover’s Intel Xeon “Haswell” nodes: • 28 cores per node, 2.6 GHz • Usable memory: 120 GB per node, ~4.25 GB per core (128 GB total) • FDR InfiniBand (56 Gbps), 1:1 blocking • SLES11 SP3 • NO SWAP space, but DO have lscratch and shmem disk space • SCU10: • 720* Haswell nodes general use (1,080 nodes total), 30,240 cores total, 1,229 TFLOPS peak total • *Up to 360 of the 720 nodes may be episodically allocated for priority work • SCU11: • ~600 Haswell nodes, 16,800 cores total, 683 TFLOPS peak

Discover Hardware Changes in a Nutshell TFLOPS for General User Work • January 30, 2015 (-70 TFLOPS): • Removed: 516 Westmere (12-core) nodes (SCU3, SCU4) • February 2, 2015 (+806 TFLOPS for general work): • Added: ~720* Haswell (28-core) nodes (2/3 of SCU10) • *Up to 360 of the 720 nodes may be episodically allocated to a priority project • Week of February 9, 2015 (-70 TFLOPS): • Removed: 516 Westmere (12-core) nodes (SCU1, SCU2) • Removed: 7 oldest (‘Dunnington’) Dalis (dali02-dali08) • Late February/early March 2015 (+713 TFLOPS for general work): • Added: 600 Haswell (28-core) nodes (SCU11)

Discover Node Count for General Work – Fall/Winter Evolution

Discover Processor Cores for General Work – Fall/Winter Evolution

Oldest Dali Nodes to Be Decommissioned • The oldest Dali nodes (dali02 – dali08) will be decommissioned starting February 9 (plenty of newer Dali nodes remain). • You should see no impact from the decommissioning of old Dali nodes, provided you have not been explicitly specifying one of the dali02 – dali08 node names when logging in.

Using Discover SCU10 and Haswell / SLES11 SP3

How to use SCU10 • 720 Haswell nodes on SCU10 available in sp3 partition • To be placed on a login node with the SP3development environment, after providing your NCCS LDAP password, specify “discover-sp3” at the “Host” prompt: Host: discover-sp3 • However, you may submit to the sp3 partition from any login node.

How to use SCU10 • To submit a job to the sp3 partition, use either: • Command line: sbatch --partition=sp3 --constraint=hasw myjob.sh • Or inline directives: #SBATCH --partition=sp3 #SBATCH --constraint=hasw

Porting your work: the fine print… • There is a small (but non-zero) chance your scripts and binaries will run with no changes at all. • Nearly all scripts and binaries will require changes to make best use of SCU10.

Porting your work: the fine print… • There is a small (but non-zero) chance your scripts and binaries will run with no changes at all. • Nearly all scripts and binaries will require changes to make best use of SCU10, sooo… With great power comes great responsibility. - Ben Parker (2002)

Adjust for new core count • Haswell nodes have 28 cores, 128 GB • >x2 memory/core from Sandy Bridge • Specify total cores/tasks needed, not nodes. • Example: for Sandy Bridge nodes: #SBATCH --ntasks=800 Not #SBATCH --nodes=50 • This allows SLURM to allocate whatever resources are available.

If you must control the details… • … still don’t use --nodes. • If you need more than ~4 GB/core, use fewer cores/node. #SBATCH --ntasks-per-node=N… • Assumes 1 task/core (the usual case). • Or specify required memory: #SBATCH --mem-per-cpu=N_MB… • SLURM will figure out how many nodes are needed to meet this specification.

Script changes summary • Avoid specifying --partition unless absolutely necessary. • And sometimes not even then… • Avoid specifying --nodes. • Ditto. • Let SLURM do the work for you. • That’s what it’s there for, and it allows for better resource utilization.

Source code changes • You might not need to recompile… • … but SP3 upgrade may require it. • SCU10 hardware is brand-new, possibly needing a recompile. • New features, e.g. AVX2 vector registers • SGI nodes, not IBM • FDR vs QDR Infiniband • NO SWAP SPACE!

And did I mention… • … NO SWAP SPACE! • This is critical. • When you run out of memory now, you won’t start to swap – your code will throw an exception. • Ameliorated by higher GB/core ratio… • … but we still expect some problems from this. • Use policeme to monitor the memory requirements of your code.

If you do recompile… • Current working compiler modules: • All Intel C compilers (ifortran not tested yet) • gcc 4.5, 4.81, 4.91 • g95 0.93 • Current working MPI modules: • SGI MPT • Intel 4.1.1.036 and later • MVAPICH2 1.81, 1.9, 1.9a, 2.0, 2.0a, 2.1a • OpenMPI: 1.8.1, 1.8.2, 1.8.3

MPI “gotchas” • Programs using old Intel MPI must be upgraded. • MVAPICH2 and OpenMPI have only been tested on single-node jobs. • All MPI modules (except SGI MPT) may experience stability issues when node counts are >~300. • Symptom: Abnormally long MPI teardown times.

cron jobs • discover-cronis still at SP1. • When running SP3-specific code, may need to ssh to SP3 node for proper execution. • Not extensively tested yet.

Sequential job execution • Jobs may not execute in submission order. • Small and interactive jobs favored during the day. • Large jobs favored at night. • If execution order is important, the dependencies must be specified to SLURM. • Multiple dependencies can be specified with the --dependency option. • Can depend on start, end, failure, error, etc.

Dependency example # String to hold the job IDs. job_ids='' # Submit the first parallel processing job, save the job ID. job_id=`sbatch job1.sh | cut -d ' ' -f 4` job_ids="$job_ids:$job_id" # Submit the second parallel processing job, save the job ID. job_id=`sbatch job2.sh | cut -d ' ' -f 4` job_ids="$job_ids:$job_id" # Submit the third parallel processing job, save the job ID. job_id=`sbatch job3.sh | cut -d ' ' -f 4` job_ids="$job_ids:$job_id" # Wait for the processing jobs to finish successfully, then # run the post-processing job. sbatch --dependency=afterok$job_ids postjob.sh

Coming attraction: shared nodes • SCU10 nodes will initially be exclusive: 1 job/node • This is how we roll on discover now. • May leave a lot of unused cores and/or memory. • Eventually, SCU10 nodes (and maybe others) will be shared among jobs. • Same or different users. • What does this mean?

Shared nodes (future) • You will no longer be able to assume that all of the node resources are for you. • Specifying task and memory requirements will ensure SLURM gets you what you need. • Your jobs must learn to “work and play well with others”. • Unexpected job interactions, esp. with I/O, may cause unusual behavior when nodes are shared.

Shared nodes (future, continued) • If you absolutely must have a minimum number of CPUs in a node, the --mincpus=N option to sbatch will ensure you get it.

Questions & AnswersNCCS User Services:support@nccs.nasa.gov301-286-9120https://www.nccs.nasa.govThank you

Supplemental Slides

Discover Compute Nodes, February 3, 2015 (Peak ~1,629 TFLOPS) • “Haswell” nodes, 28 cores per node, 2.6 GHz (new) • SLES11 SP3 • SCU10, 4.5 GB memory per core (new) • 720* nodes general use (1,080 nodes total), 30,240 cores total, 1,229 TFLOPS peak total (*360 nodes episodically allocated for priority work) • “Sandy Bridge” nodes, 16 cores per node, 2.6 GHz (no change) • SLES11 SP1 • SCU8, 2 GB memory per core • 480 nodes, 7,680 cores, 160 TFLOPS peak • SCU9, 4 GB memory per core • 480 nodes, 7,680 cores, 160 TFLOPS peak • “Westmere” nodes, 12 cores per node, 2 GB memory per core, 2.6 GHz • SLES11 SP1 • SCU1, SCU2 (SCUS 3, 4, and 7 already removed) • 516 nodes, 6,192 cores total, 70 TFLOPS peak

Discover Compute Nodes, March 2015 (Peak ~2,200 TFLOPS) • “Haswell” nodes, 28 cores per node • SLES11 SP3 • SCU10, 4.5 GB memory per core • 720* nodes general use (1,080 nodes total), 30,240 cores total, 1,229 TFLOPS peak total (*360 nodes episodically allocated for priority work) • SCU11, 4.5 GB memory per core (new) • ~600 nodes, 16,800 cores total, 683 TFLOPS peak • “Sandy Bridge” nodes, 16 cores per node (no change) • SLES11 SP1 • SCU8, 2 GB memory per core • 480 nodes, 7,680 cores, 160 TFLOPS peak • SCU9, 4 GB memory per core • 480 nodes, 7,680 cores, 160 TFLOPS peak • No remaining “Westmere” nodes

Jan. 26-30 Feb. 2-6 Feb. 9-13 Feb. 17-20 Feb. 23-27 Mar. 2-27 Discover COMPUTE SCU10 Integration SCU10 General Access: +720* Nodes SLES11, SP3 1,080 Nodes 30,240 Cores Intel Haswell 1,229 TF Peak SCU10 arrived in mid-November 2014. Following installation & resolution of initial power issues, the NCCS provisioned SCU10 with Discover images and integrated it with GPFS storage. NCCS stress testing and targeted high-priority use occurred in January 2015.(*360 nodes episodically allocated for priority work) SCU 1, 2, 3, 4 Decommissioning Drain: 516 Nodes Physical Installation Stress Testing SCU11 General Access: +600 Nodes Configur- ation SLES11, SP1 1,032 Nodes 12,384 Cores Intel Westmere 139 TF Peak To make room for the new SCU11 compute nodes, the nodes of Scalable Units 1, 2, 3, and 4 (12-core Westmeres installed in 2011) are being removed from operations during February. Removal of half of these nodes will coincide with the general access to SCU10, the remaining half during installation of SCU11. SCU11 Integration Remove: 516 Nodes Drain: 516 Nodes Remove: 516 Nodes SLES11, SP3 600 Nodes 16,800 Cores Intel Haswell 683 TF Peak SCU 11 (600 Haswell nodes) has been delivered, and will be installed starting Feb. 9th. Then the NCCS will provision the system with Discover images and integrate it with GPFS storage. Power and I/O connections from Westmere SCUs 1, 2, 3, and 4 are needed for SCU11. Thus, SCUs 1, 2, 3, and 4 must be removed prior to SCU11 integration. SCU 8 and 9 SLES11, SP1 960 Nodes 15,360 Cores Intel Sandy Bridge 320 TF Peak No changes during this period (January – March 2015). In November 2014, 480 nodes previously allocated for a high-priority project were made available for all user processing.

Discover “SBU” Computational Capacity for General Work – Fall/Winter Evolution

Total Discover Peak Computing Capability as a Function of Time (Intel Xeon Processors Only)

Total Number of Discover Intel Xeon Processor Cores as a Function of Time

Storage Augmentations • Dirac (Mass Storage) Disk Augmentation • 4 Petabytes usable (5 Petabytes “raw”), installed • Gradual data move: starts week of February 9 (many files, “inodes” to move) • Discover Storage Expansion • 8 Petabytes usable (10 Petabytes “raw”), installed • For both general use and targeted “Climate Downscaling” project • Phased deployment, including optimizing the arrangement existing project and user nobackup space

Discover Cluster Upgrades: Transition to Intel Haswells and SLES11.SP3

Discover Cluster Upgrades: Transition to Intel Haswells and SLES11.SP3

Presentation Transcript

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda:

Agenda

Agenda

AGENDA