The BlueGene/L Supercomputer

The BlueGene/L Supercomputer Manish Gupta IBM Thomas J. Watson Research Center

What is BlueGene/L? • One of the world’s fastest supercomputers • A new approach to design of scalable parallel systems • The current approach to large systems is to build clusters of large SMPs (NEC Earth Simulator, ASCI machines, Linux clusters) • Expensive switches for high performance • High electrical power consumption: low computing power density • Significant amount of resources devoted to improving single-thread performance • Blue Gene follows a more modular approach, with a simple building block (or cell) that can be replicated ad infinitum as necessary – aggregate performance is important • System-on-a-chip offers cost/performance advantages • Integrated networks for scalability • Familiar software environment, simplified for HPC

BlueGene/L Compute System-on-a-Chip ASIC

April 2004 BlueGene/L 500 MHz 4 rack prototype 4096 compute nodes 64 I/O nodes 16 TF/s peak 11.68 TF/s sustained October 2003 BG/L half rack prototype 500 Mhz 512 nodes/1024 proc. 2 TFlop/s peak 1.4 Tflop/s sustained BlueGene/L

BlueGene/L Sept 2004 BlueGene/L 700 MHz 8 rack prototype 36.01 TF/s sustained

BlueGene/L Networks 3 Dimensional Torus • Interconnects all compute nodes (65,536) • Virtual cut-through hardware routing • 1.4Gb/s on all 12 node links (2.1 GB/s per node) • 1 µs latency between nearest neighbors, 5 µs to the farthest • Communications backbone for computations • 0.7/1.4 TB/s bisection bandwidth, 68TB/s total bandwidth Global Tree • One-to-all broadcast functionality • Reduction operations functionality • 2.8 Gb/s of bandwidth per link • Latency of one way tree traversal 2.5 µs • Interconnects all compute and I/O nodes (1024) Low Latency Global Barrier and Interrupt • Latency of round trip 1.3 µs Ethernet • Incorporated into every node ASIC • Active in the I/O nodes (1:64) • All external comm. (file I/O, control, user interaction, etc.) Control Network

How to Make System Software Scale to 64K nodes? • Take existing software and keep on scaling it until we succeed? or • Start from scratch? • New languages and programming paradigms

Problems with “Existing Software” Approach • Reliability • If software fails on any node (independently) once in a month, a node failure would be expected on 64K node system every 40 seconds • Interference effect • Was about to send a message, but oops, got swapped out… • Resource limitations • Reserve a few buffers for every potential sender…. • Optimization point is different • What about small messages?

Interference problem S R

Interference problem S Swapped out Swapped in S R R

Interference problem S Swapped out Swapped in S R R Swapped out Swapped in R

Problems with “Existing Software” Approach • Reliability • If software fails on any node (independently) once in a month, a node failure would be expected on 64K node system every 40 seconds • Interference effect • Was about to send a message, but oops, got swapped out… • Resource limitations • Reserve a few buffers for every potential sender to hold early messages • Optimization point is different • What about small messages?

Problems with “New Software” Approach • Sure, message passing is tedious – has anything else been proven to scale? • Do you really want me to throw away my 1 million line MPI program and start fresh? • If I start fresh, what’s the guarantee my “new” way of programming wouldn’t be rendered obsolete by future innovations?

Our Solution • Simplicity • Avoid features not absolutely necessary for high performance computing • Using simplicity to achieve both efficiency and reliability • New organization of familiar functionality • Same interface, new implementation • Hierarchical organization • Message passing provides foundation • Research on higher level programming models using that base

BlueGene/L Software Hierarchical Organization • Compute nodes dedicated to running user application, and almost nothing else - simple compute node kernel (CNK) • I/O nodes run Linux and provide a more complete range of OS services – files, sockets, process launch, debugging, and termination • Service node performs system management services (e.g., heart beating, monitoring errors) - largely transparent to application/system software compute nodeapplication volume I/O nodeoperational surface service nodecontrol surface

I/O Node 0 I/O Node 1023 C-Node 0 C-Node 0 C-Node 63 C-Node 63 Linux Linux CNK CNK CNK CNK ciod ciod Blue Gene/L System Software Architecture Front-end Nodes tree Pset 0 File Servers Console Functional Ethernet torus Service Node DB2 MMCS I2C Scheduler Control Ethernet IDo chip JTAG Pset 1023

Programming Models and Development Environment • Familiar Aspects • SPMD model - Fortran, C, C++ with MPI (MPI1 + subset of MPI2) • Full language support • Automatic SIMD FPU exploitation • Linux development environment • User interacts with system through FE nodes running Linux – compilation, job submission, debugging • Compute Node Kernel provides look and feel of a Linux environment – POSIX system calls (with restrictions) • Tools – support for debuggers (Aetnus TotalView), hardware performance monitors (HPMLib), trace based visualization (Paraver) • Restrictions (lead to significant scalability benefits) • Strictly space sharing - one parallel job (user) per partition of machine, one process per processor of compute node • Virtual memory constrained to physical memory size • Implies no demand paging, only static linking • Other Issues: Mapping of applications to torus topology • More important for larger systems (multi-rack systems) • Working on techniques to provide transparent support

Execution Modes for Compute Node • Communication coprocessor mode:CPU 0 executes user application while CPU 1 handles communications • Preferred mode of operation for communication-intensive and memory bandwidth intensive codes • Requires coordination between CPUs, which is handled in libraries • Computation offload feature (optional): CPU 1 also executes some parts of user application offloaded by CPU 0 • Can be selectively used for compute-bound parallel regions • Asynchronous coroutine model (co_start / co_join) • Need careful sequence of cache line flush, invalidate, and copy operations to deal with lack of L1 cache coherence in hardware • Virtual node mode:CPU0 and CPU1 handle both computation and communication • Two MPI processes on each node, one bound to each processor • Distributed memory semantics – lack of L1 coherence not a problem

torus tree GI bgltorus Message Layer Torus Packet Layer Tree Packet Layer GI Device CIO Protocol The BlueGene/L MPICH2 organization (with ANL) Message passing Process management MPI PMI pt2pt datatype topo debug collectives Abstract Device Interface CH3 MM bgltorus simple uniprocessor mpd socket

Torus Network link bandwidth 0.25 Bytes/cycle/link (theoretical) 0.22 Bytes/cycle/link (effective) 12*0.22 = 2.64 Bytes/cycle/node Streaming memory bandwidth 4.3 Bytes/cycle/CPU memory copies are expensive CPU/network interface 204 cycles to read a packet; 50 – 100 cycles to write a packet Alignment restrictions Handling badly aligned data is expensive Short FIFOs Network needs frequent attention Network order semantics and routing Deterministic routing: in order, bad torus performance Adaptive routing: excellent network performance, out-of-order packets In-order semantics is expensive Dual core setup, memory coherency Explicit coherency management via “blind device” and cache flush primitives Requires communication between processors Best done in large chunks Coprocessor cannot manage MPI data structures CNK is single-threaded; MPICH2 is not thread safe Context switches are expensive Interrupt driven execution is slow Performance Limiting Factors in the MPI Design Hardware Software

Packetization and packet alignment SENDER RECEIVER • Constraint: Torus hardware only handles 16 byte aligned data • When sender/receiver alignments are same: • head and tail transmitted in a single “unaligned” packet • aligned packets go directly to/from torus FIFOs • When alignments differ, extra memory copy is needed • Sometimes torus read op. can be combined with re-alignment op.

The BlueGene/L Message Layer • Looks very much like LAPI, GAMA • Just a lot simpler ;-) • Simplest function: Deliver a buffer of bytes from one node to other • Can do this using one of many protocols • One-packet protocol • Rendezvous protocol • Eager protocol • Adaptive eager protocol! • Virtual node mode copy protocol! • Collective function protocols! • … and others

The thing to watch is overhead Bandwidth CPU load Co-processor Network load BlueGene/L network requires 16 byte aligned loads and stores Memory copies to resolve alignment issues Compromise solution: Deterministic routing insures good latency but creates network hotspots Adaptive routing avoids hotspots but doubles latency Currently: deterministic routing more advantageous at up to 4k nodes Balance may change as we scale to 64k nodes: shorter messages, more traffic Optimizing point-to-point communication (short messages: 0-10 KBytes) Not a factor: not enough network traffic

Barrier is implemented as an all-broadcast in each dimension BG/L torus hardware can send deposit packets on a line Low latency broadcast Since packets are short, likelihood of conflicts is low Latency = O(xsize+ysize+zsize) Allreduce for very short messages is implemented with a similar multi-phase algorithm Implemented by Yili Zheng (summer student) Optimizing collective performance:Barrier and short-message Allreduce Phase 1Phase 2Phase3

Barrier and short message Allreduce: Latency and Scaling Short-message Allreduce latency vs. message size Barrier latency vs. machine size

Dual FPU Architecture • Designed with input from compiler and library developers • SIMD instructions over both register files • FMA operations over double precision data • More general operations available with cross and replicated operands • Useful for complex arithmetic, matrix multiply, FFT • Parallel (quadword) loads/stores • Fastest way to transfer data between processors and memory • Data needs to be 16-byte aligned • Load/store with swap order available • Useful for matrix transpose

Strategy to Exploit SIMD FPU • Automatic code generation by compiler • User can help the compiler via pragmas and intrinsics • Pragma for data alignment: __alignx(16, var) • Pragma for parallelism • Disjoint: #pragma disjoint (*a, *b) • Independent: #pragma ibm independent loop • Intrinsics • Intrinsic function defined for each parallel floating point operation • E.g.: D = __fpmadd(B, C, A) => fpmadd rD, rA, rC, rB • Control over instruction selection, compiler retains responsibility for register allocation and scheduling • Using library routines where available • Dense matrix BLAS – e.g., DGEMM, DGEMV, DAXPY • FFT • MASS, MASSV

Example: Vector Add void vadd(double* a, double* b, double* c, int n) { int i; for (i=0; i<n; i++) { c[i] = a[i] + b[i]; } }

Compiler transformations for Dual FPU void vadd(double* a, double* b, double* c, int n) { int i; for (i=0; i<n-1; i+=2) { c[i] = a[i] + b[i]; c[i+1] = a[i+1] + b[i+1]; } for (; i<n; i++) c[i] = a[i] + b[i]; }

Compiler transformations for Dual FPU void vadd(double* a, double* b, double* c, int n) { int i; for (i=0; i<n-1; i+=2) { c[i] = a[i] + b[i]; c[i+1] = a[i+1] + b[i+1]; } for (; i<n; i++) c[i] = a[i] + b[i]; } LFPL (pa, sa) = (a[i], a[i+1]) LFPL (pb, sb) = (b[i], b[i+1]) FPADD (pc, sc) = (pa+pb, sa+sb) SFPL (c[i], c[i+1]) = (pc, sc)

Pragmas and Advanced Compilation Techniques void vadd(double* a, double* b, double* c, int n) { #pragma disjoint(*a, *b, *c) __alignx(16,a+0); __alignx(16,b+0); __alignx(16,c+0); int i; for (i=0; i<n; i++) { c[i] = a[i] + b[i]; } } Now Available (Using TPO) Interprocedural pointer alignment analysis Loop transformations to enable SIMD code generation in absence of compile-time alignment information loop versioning loop peeling Coming soon

LINPACK summary • Pass 1 hardware (@500 MHz) • #4 on June 2004 TOP500 list • 11.68 TFlop/s on 4096 nodes • 71% of peak • Pass 2 hardware (@ 700 MHz) • #8 on June 2004 TOP500 list • 8.65 TFlop/s on 2048 nodes • Improved recently to 8.87 TFlop/s (would have been #7) • 77% of peak • Achieved 36.01 TFlop/s with 8192 nodes on 9/16/04, beating Earth Simulator • 78% of peak

Cache coherence: a war story Buffer 1: sent from (by CPU 0) Buffer 2: received into (by CPU 1) Memory Main processor cannot touch loop: ld …, buffer st …, network bdnz loop • Last iteration: • branch predictor predicts branch taken • ld executes speculatively • cache miss causes first line of forbidden buffer area to be fetched into cache • system executes branch, rolls back speculative loads • does not roll back cache line fetch (because it’s nondestructive) Conclusion: CPU 0 ends up with stale data in cache But only when cache line actually survives before being used

HPC Challenge: Random Access Updates (GUP/s)

HPC Challenge: Latency (usec)

Measured MPI Send Bandwidth and Latency Latency @700 MHz = 3.3 + 0.090 * “Manhattan distance” + 0.045 * “Midplane hops” ls

Noise measurements (from Adolphy Hoisie) Ref: Blue Gene: A Performance and Scalability Report at the 512-Processor Milestone, PAL/LANL, LA-UR- 04-1114, March 2004.

SPPM on fixed grid size (BG/L 700 MHz)

ASCI Purple Benchmarks – UMT2K • UMT2K: Unstructured mesh radiation transport • Strong scaling – problem size fixed • Excellent scalability up to 128 nodes • load balancing problems on scaling up to 512 nodes, need algorithmic changes in original program

SAGE on fixed grid size (BG/L 700 MHz)

Effect of mapping on SAGE

Miranda results (by LLNL)

ParaDis on BG/L vs MCR (Linux Cluster) Peak 11.6 TF/s, Linpack: 7.634 TF/s) Study of Dislocation Dynamics in Metals Courtesy: Kim Yates MCR is a large (11.2 TF) tightly Coupled Linux cluster: 1,152 nodes, each with two 2.4-GHz Pentium 4 Xeon processors and 4 GB of memory.

CPMD History • Born at IBM Zurich from the original Car-Parrinello Code in 1993; • Developed in many other sites during the years (more than 150,000 lines of code); it has many unique features, e.g. path-integral MD, QM/MM interfaces, TD-DFT and LR calculations; • Since 2001 distributed free for academic institutions (www.cpmd.org); more than 5000 licenses in more than 50 countries.

CPMD results (BG/L 500 MHz)

BGL: 1TF/Rack vs QCDOC: ½ TF/Rack

BlueGene/L software team • YORKTOWN: • 10 people (+ students) • Activities on all areas of system software • Focus on development of MPI and new features • Does some test, but depends on Rochester • HAIFA: • 4 people • Focus on job scheduling • LoadLeveler • Interfacing with Poughkeepsie • ROCHESTER: • 15 people (plus performance & test) • Activities on all areas of system software • Most of development • Follows process between research and product • Main test center • INDIA RESEARCH LAB: • 3 people • Checkpoint/restart • Runtime error verification • Benchmarking • TORONTO • Fortran95, C, C++ Compilers

Conclusions • Using low power processor and chip-level integration is a promising path to supercomputing • We have developed a BG/L system software stack with Linux-like personality for user applications • Custom solution (CNK) on compute nodes for highest performance • Linux solution on I/O nodes for flexibility and functionality • Encouraging performance results – NAS Parallel Benchmarks, ASCI Purple Benchmarks, LINPACK, early applications showing good performance • Many challenges ahead, particularly in performance and reliability • Looking for collaborations • Work with broader class of applications on BG/L – investigate scaling issues • Research on higher level programming models

The BlueGene/L Supercomputer