Some Large Machines… Richard Kaufmann compaq/hpc

Some Large Machines…Richard Kaufmann http://www.compaq.com/hpc

High Performance Technical ComputingSystems • Compaq’s HPTC business takes AlphaServers, and… • Adds a “Systems Area Network” (a message passing interconnect), and… • Software that allows users to use a “bunch of SMPs” as a system… • Highly focused on usability, maintainability and upgradability. • to make Supercomputers

Some Past Machines • 1994-1998: Memory Channel • Price Performance • BabyZilla: 8 Servers x 4 CPUs • AlphaServer 4100 Cluster • 32 * 1.2GF/CPU = 38.4 GFLOPS • LLNL, CERN, EBI, CASPUR, Celera, Sanger, MIT Pleides (7x4), more.. • Capability • TurboZilla: 8 Servers x 14 CPUs • LLNL, SNL, IPGP Paris, 32x8 Power, … • 1999- : Quadrics • 128 SMPs • LLNL (128x4), USGov 2 x 128x4, CEA France, WPAFB, ...

Future Machines(the public ones, anyway!) • NSF/PSC 6TF • DoE ASCI 30TF • CEA DAM xTF

EV67 .25 m, 667-8XX MHz 4-wide superscalar Out-of-order execution Backside L2 cache port EV6x .18 m, 1 - 1.4 GHz EV7 .18 m, >1000 MHz L2 cache on-chip RAMBUS Glueless MP EV8 .13 m, > 1400 MHz 8-wide superscalar SMT Alpha Is Why HPTC Folks Talk To Us!

“It’s The Memory System, &^$%^$%!!” • Increased memory bandwidth per FLOP • EV56: 250-350MB/s sustainable • EV6: ~2GB/s sustainable • 6X-8X improvement over EV56 for stride 1; further erosion of traditional supercomputers • 3X-4X for stride N (doubled cache line width) • Makes RISC machines more predictable; fewer “performance divots.” • Alpha (and its platforms) are clearly differentiated from the competition: • Significantly more memory bandwidth • Much higher memory capacities (32GB on the quad)

SingleCPUTRIAD

Scaled Computing Offerings • Small Servers • Midrange Servers • Big Servers • Groups of servers • Little groups of big servers • Availability, VLM applications • Big groups of little servers • HPTC (message passing) • Big groups of big servers • “The Gorillas,” e.g. the DoE

A Big Group of Little ServersTeraCluster Delivered September 29, 1999

ES40 Nodes (4 per rack)(Identical to CEA-Civil System)

Mid-Range AlphaServers • Compaq DS20e System • 2 CPUs, small PC tower • 5.13 GB/s peak, 1.3 GB/s McAlpin Memory B/W • Crossbar ties together memories, CPUs and I/O processors • No locality Issues • Compaq ES40 System • 4 CPUs, bigger cabinet • 5.13 GB/s peak, 2.5 GB/s McAlpin Memory B/W • Double I/O

Tied together with an extremely fast interconnect -- to solve single problems!

What’s a Fast SAN For? • Message Passing Applications • “Cluster of SMPs” predominant model for parallel codes. • “Single System” Characteristics • It’s easier to manage the system: • Single File System • Nearly-Uniform File System Performance

Scaled Computing via Message Passing • Three APIs supported • MPI, shmem, UPC • Selected because of customer demand • Art of the Pragmatic: need to achieve reasonable peak % -- at the expense of ease of programming. • Unexploited Alternatives • Not economically viable: Vectors • Problematic Scaling: Threads on Ultra-scale SMPs

Quadrics SAN • Switching network • Each connection is 340 MB/s bi-directional. • Multiple virtual circuits and load balancing to minimize contention. Expected < 30% degradation @ 128 SMPs. • Built out of 8-way crossbar chips (shown: 4 up links and 4 down links). • Latency: < 3 secs end-end from user app. ~5 secs end-end from MPI

Federated Switches • Larger switch configurations -- up to 1024 ports. • Made up of a number of “standard” parts. • Prototype under construction now.

The NSF / Pittsburgh Supercomputing Machine • 682 Quad-Processor Servers • 6 TeraFLOPS • Will be the largest non-defense computer in the world. • Tied together with two rails of Quadrics’ interconnect. • (Multiple adapters and switch layers  more bandwidth and failover) • PSC • Multi-decade relationship with DEC/Compaq. • This business is built on long-term relationships.

Proof Point: Sweep3D • Extremely nasty code • Tons of short messages per timestep • Key code for DoE Customer • And the CEA… • Usually pessimizes across a SAN

Time (Big is Bad!)

A Huge Linux Cluster • Sandia C-Plant • > 2,000 1U Alpha Single-CPU Servers • Myricom’s Myrinet Network

ASCI 30T • Compaq is building the world’s largest computer (no qualifier!) for Los Alamos National Laboratories • 374 32-way GS320’s • 30 TeraFLOPS • Eight rails of Quadrics • 4GB/s * 374 = 1.5TB/s of Message Passing Bandwidth! • Delivery in 2002. • Linear rack length: 1 KM!

The System • 374 Servers * 32 cpus = 11,968 processors • 12 TB of memory • 600TB Usable Storage • Made up of local file systems and a globally accessible file system • Eight rails of Quadrics - PCI-X Elan4 • Management console/front-ends

EV6x EV6x EV6x EV6x Mem Mem Mem Mem Modular Server Building Block Switch Elan I/O • One to four CPUs per building block • Four 1.6GB/sec switched CPU/memory connections simultaneously • 6.4GB/sec aggregate memory bandwidth per building block • Up to 8GB per memory board, 32GB per building block • 4 to 8 PCI buses (28 slots) per building block • No slot trade-offs for CPUs, memory, or I/O

EV67 EV67 EV67 EV67 EV67 EV67 EV67 EV67 EV67 EV67 EV67 EV67 EV67 EV67 EV67 EV67 Mem Mem Mem Mem EV67 EV67 EV67 EV67 EV67 EV67 EV67 EV67 EV67 EV67 EV67 EV67 EV67 EV67 EV67 EV67 Mem Mem Mem Mem Modular 32 Processor Architecture • Modular growth • 32 EV6x CPUs in single cabinet • Up to 48 CPUs with EV7 upgrade • Mixed CPU speeds supported • Bandwidth, memory, and I/O scale with processors: • 1.6GB/s memory bandwidth/CPU • Up to 256GB of memory • Up to 64 PCI buses • Maintains SMP programming model • Bisection bandwidth = 8 x 1.6 GB/s = 12.8GB/s Global Port Global Port Global Port Switch I/O Global Port Global Switch Global Port Global Port Global Port Switch I/O Global Port

Quadrics Roadmap: Adapters • Elan-3 adapter supports 208MB/s on PCI 64/33, • But can support 64/66. Future systems will get more bandwidth on existing network. • For 30T timeframe, adapter will support PCI-X • 1GB/s PCI Bus

Global File System Overview • Highly available • File Servers organized into groups of eight servers. • File unavailable only if all eight servers in a group are down. • File Servers split across two CFS Domains • Each domain has 32 file servers • Tightly-coupled caching and coherency mechanism within each domain • Proxy Mechanism • Scales coherency across domains (compute and other CFS) • All Nodes see a single coherent, scalable file system. • File Servers • Compute servers

System Availability • File Service • Resilient in the face of server failures • Multiple access paths to all storage • Redundant paths at all levels • Multiple servers • Compute Nodes • Auto configure out and restart partition - RMS • User-controlled resiliency • Checkpoint / Restart

Post-2000 • EV7 • Next-generation Interconnect

EV7: Same Box, New Wrapping Paper  • EV68c Core • Minimized Risk • Known Perf bottlenecks removed • 16 pending loads + 16 pending victims • 1.5 MB on-chip L2 Cache • 6-way set associative • RAMBUS • On-chip Router for glueless SMP • Only one I/O ASIC effort needed to build a 64-way SMP system!

Single CPU EV7 Feeds and Speeds • 8 RAMBUS Straws * 1.6GB/s • 12.8 GB/s Peak • Higher ratio of sustainable mem b/w 1.6+1.6=3.2 GB/s I/O (AGP, PCI, …) EV7 3.2+3.2=6.4 GB/s SMP Link (each of N,W,E,S) Multi-GB RAMBUS

EV7 EV7 EV7 EV7 EV7 EV7 EV7 EV7 EV7 EV7 EV7 EV7 EV7 EV7 EV7 EV7 Intra-SMP EV7 Feeds and Speeds • 6.4GB/s SMP links (3.2 in + 3.2 out) • 64-way SMPs supported in WildFire Cabinet (upgrade) or native cab. • 16-64 I/O Ports • Arranged in a 2D Torus • Bisection b/w = sqrt(CPUs) * 3.2 GB/s * 2

2002: 30-36TF • Basis of CEA System – First deployment. • 39TF in 256 SMPs mid-02

2003: 100TF in EV8 • 256SMPs * 390GF = 100TF • Each EV8 is (conservatively) 1.5GHz and 4 FLOPS/cycle = 6.1GF • An SMP is 64 chips, but 256 CPUs!

100 10TF 100GF 1TF

Challenges – there are so many… • Scaling • MPI • Core ops and collectives • Debugging • Optimizing • Multi-Rail Utilizations • Job Scheduling • Performance Debugging • Checkpoint Restart • Reliability/availability • Lots of parts • Lots of bits • Manufacturing/Integration/Installation etc…..

Some Large Machines… Richard Kaufmann compaq/hpc