MPI for BG/L

MPI for BG/L George Almási

Outline • Preliminaries • BG/L MPI Software Architecture • Optimization framework • Status & Future direction

Watson: John Gunnels BlueMatter team NCAR John Dennis, Henry Tufo LANL Adolfy Hoisie, Fabrizio Petrini, Darren Kerbyson IBM India Meeta Sharma,Rahul Garg LLNL Astron Functionality Testing Glenn Leckband, Jeff Garbisch (Rochester) Performance Testing Kurt Pinnow, Joe Ratterman (Rochester) Performance Analysis Jesus Labarta (UPC) Nils Smeds Bob Walkup, Gyan Bhanot, Frank Suits BG/L MPI Who’s who Users Testers Developers • MPICH2 framework • Bill Gropp, Rusty Lusk, Brian Toonen, Rajeev Thakur, others (ANL) • BG/L port: library core • Charles Archer (Rochester) • George Almasi, Xavier Martorell • Torus primitives • Nils Smeds • Philip Heidelberger • Tree primitives • Chris Erway • Burk Steinmacher Enablers • System software group (you know who you are)

The BG/L MPI Design Effort • Started off with constraints and ideas from everywhere, pulling in every direction • Use algorithm X for HW feature Y • MPI package choice, battle over required functionality • Operating system, job start management constraints • 90% of work was to figure out which ideas made immediate sense • Immediately implement • Implement in the long term, but ditch for the first year • Evaluate only when hardware becomes available • Forget it • Development framework established by January 2003 • Project grew alarmingly: • January 2003: 1 fulltime + 1 postdoc + 1 summer student • January 2004: ~ 30 people (implementation, testing, performance)

torus tree GI bgltorus bgltorus Message Layer Packet Layer Torus Device Tree Device GI Device CIO Protocol MPICH2 based BG/L Software Architecture Message passing Process management MPI PMI pt2pt datatype topo collectives Abstract Device Interface CH3 MM simple uniprocessor mpd socket

Connection Manager Progress Engine sendQ Rank 0 (0,0,0) recv Send manager sendQ Rank 1 (0,0,1) recv sendQ Rank 2 (0,0,2) recv … Dispatcher sendQ Rank n (x,y,z) recv … MPID_Request Message Data Send Queue user buffer (un)packetizer … msg1 msg2 msgP protocol & state info Architecture Detail: Message Layer

Torus Network link bandwidth 0.25 Bytes/cycle/link (theoretical) 0.22 Bytes/cycle/link (effective) 12*0.22 = 2.64 Bytes/cycle/node Streaming memory bandwidth 4.3 Bytes/cycle/CPU memory copies are expensive Dual core setup, memory coherency Explicit coherency management via “blind device” and cache flush primitives Requires communication between processors Best done in large chunks Coprocessor cannot manage MPI data structures Network order semantics and routing Deterministic routing: in order, bad torus performance Adaptive routing: excellent network performance, out-of-order packets In-order semantics is expensive CPU/network interface 204 cycles to read a packet; 50 – 100 cycles to write a packet Alignment restrictions Handling badly aligned data is expensive Short FIFOs Network needs frequent attention Only tree channel 1 is available to MPI CNK is single-threaded; MPICH2 is not thread safe Context switches are expensive Interrupt driven execution is slow Performance Limiting Factors in the MPI Design Hardware Software

The thing to watch is overhead Bandwidth CPU load Co-processor Network load Memory copies take care of alignment Deterministic routing insures MPI semantics Adaptive routing would double msg layer overhead Balance here may change as we scale to 64k nodes Today: ½ nearest-neighbor roundtrip latency:  3000 cycles About 6 s @ 500MHz Within SOW specs @ 700MHz Can improve 20-25% by shortening packets Optimizing short-message latency • Composition of roundtrip latency: Not a factor: not enough network traffic High level (MPI) HW 26% 32% msg layer 13% Per-packet overhead 29%

Most important thing to optimize for: CPU per packet overhead At maximum torus utilization, only 90 CPU cycles available to prepare/handle a packet! Sad (measured) reality: READ: 204, WRITE: 50-100 cycles Plus MPI overhead Packet overhead reduction Cooked packets: Contain destination address Assume intitial dialog (rendezvous) Rendezvous costs  3000 cycles Saves  100 cycles/packet Allows adaptively routed packets Permits coprocessor mode Coprocessor mode essential (Allows 180 cycles/CPU/packet) Explicit cache management  5000 cycles/message System support necessary Coprocessor library Scratchpad library Lingering RIT1 memory issues Adaptive routing essential MPI semantics achieved by initial deterministically routed scout packet Packet alignment issues handled with 0 memory copies Overlapping realignment with torus reading Drawback: only works well for long messages (10KBytes+) Optimizing MPI for High CPU Network Traffic(neighbor to neighbor communication)

Per-node bandwidth in heater mode Per-node bandwidth in coprocessor mode 2 2 1.8 1.8 1.6 1.6 1.4 1.4 1.2 1.2 Bandwidth (Bytes/cycle) Bandwidth (Bytes/cycle) 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 6 6 6 6 5 5 5 5 4 4 4 4 3 3 3 3 2 2 2 Receivers 2 Receivers Senders Senders 1 1 1 1 0 0 0 0 Per-node asymptotic bandwidth in MPI

The cost of packet re-alignment • The cost (cycles) of reading a packet from the torus into un-aligned memory 600 500 400 cycles 300 200 non-aligned receive 100 receive + copy Ideal 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 alignment

High network traffic Adaptive routing absolute necessity Short messages: Cannot use rendezvous protocol CPU load not a limiting factor Coprocessor irrelevant Message reordering solution Worst-case: up to 1000 cycles/packet Per CPU bandwidth limited to 10% of nominal peak Flow control solution Quasi-sync protocol: Ack packets for each unordered message Only works for messages long enough Tmsg > latency Situation not prevalent on the 8x8x8 network. Will be one of the scaling problems Cross-section increases with n2 #cpus increases with n3 Optimizing for high network traffic, short messages

MPI communication protocols • A mechanism to optimize MPI behavior based on communication requirements

MPI communication protocols and their uses Message size rendezvous protocol rendezvous protocol eager protocol quasi-sync Coprocessor limit Rendezvous limit short protocol CPU load Network load

Splitting resources between Cus 50% each of memory, cache 50% each of torus hardware Tree channel 0 used by CNK Tree channel 1 shared by CPUs Common memory: scratchpad Virtual node mode is good for Computationally intensive Small memory footprint Small/medium network traffic Deployed, used by BlueMatter team MPI in Virtual Node Mode

NAS BT 2D mesh communication pattern Map on 3D mesh/torus? Folding and inverting planes in the 3D mesh NAS BT scaling: Computation scales down with n-2 Communication scales down with n-1 NAS BT Scaling (virtual node mode) 100 naïve mapping 90 optimized mapping 80 70 60 Per-CPU performance (MOps/s/CPU) 50 40 30 20 10 0 729 121 169 225 289 361 441 529 625 841 961 Number of processors Optimal MPI task->torus mapping

Optimizing MPI Collective Operations • MPICH2 comes with default collective algorithms: • Functionally, we are covered • But default algorithms not suitable for torus topology • Written with ethernet-like networks in mind • Work has started on optimized collectives: • For torus network: broadcast, alltoall • For tree network: barrier, broadcast, allreduce • Work on testing for functionality and performance just begun • Rochester performance testing team

4S+2R 3S+2R 2S+2R 1S+2R 0S+2R Broadcast on a mesh (torus) Based on ideas from Vernon Austel, John Gunnels, Phil Heidelberger, Nils Smeds Implemented & measured by Nils Smeds

Tree Broadcast Bandwidth Tree Integer Allreduce Bandwidth 2.406E+08 2.500E+08 2.405E+08 2.000E+08 2.404E+08 1.500E+08 2.403E+08 Bandwidth (Bytes/s) Bandwidth (Bytes/s) 2.402E+08 1.000E+08 2.401E+08 8 8 5.000E+07 2.400E+08 32 32 Processors 0.000E+00 Processors 128 2.399E+08 128 256 256 1024 4096 1024 512 4096 16384 512 65536 16384 65536 262144 262144 1048576 4194304 1048576 4194304 Message size (Bytes) Message size (Bytes) Optimized Tree Collectives Implementation w/ Chris Erway & Burk Steinmacher Measurements from Kurt Pinnow

MPI-1 compliant Passes large majority of Intel/ANL MPI test suite Coprocessor mode available 50-70% improvement in bandwidth Regularly tested Not fully deployed Hampered by BLC 1.0 bugs Virtual node mode available Deployed Not tested regularly Process management User-defined process to torus mappings available Optimized collectives: Optimized torus broadcast Ready for deployment pending code review, optimizations Optimized tree broadcast, barrier, allreduce Almost ready for deployment Functionality: OK Performance: a good foundation BG/L MPI: Status Today (2/6/2004)

Anticipating this year: 4 racks in the near (?) future Don’t anticipate major scaling problems CEO milestone at end of year We are up to 29 of 216. That’s halfway on a log scale. We have not hit any “unprecedented” sizes yet. LLNL can run MPI jobs on more machines than we have. Fear factor: combination of congested network and short messages Lessons from last year Alignment problems Co-processor mode A coding nightmare Overlapping computation with communication Coprocessor cannot touch data w/o main processor cooperating Excessive CPU load hard to handle Even with coprocessor, still cannot handle 2.6 Bytes/cycle/nod (yet) Flow control Unexpected messages slow reception down Where are we going to hurt next?

Conclusion • In the middle of moving from functionality mode to performance centric mode • Rochester taking over functionality, routine performance testing • Teams in Watson & Rochester collaborating on collective performance • We don’t know how to run 64k MPI processes • Imperative to keep design fluid enough to counter surprises • Establishing a large community of measuring, analyzing behavior • A lot of performance work needed • New protocol(s) • Collectives on the torus, tree

MPI for BG/L

MPI for BG/L

Presentation Transcript