Exascale : Power, Cooling, Reliability, and Future Arithmetic

Exascale:Power, Cooling,Reliability, andFuture Arithmetic John Gustafson HPC User Forum Seattle, September 2010

The First Cluster: The “Cosmic Cube” Chuck Seitz and Geoffrey Fox developed an alternative to Cray vector mainframes for $50,000, in 1984. Motivated by QCD physics problems, but soon found useful for a very wide range of apps. 64 Intel 8086-8087 nodes 700 Watts, total 6 cubic feet

But Fox & Seitz did not invent the Cosmic Cube. Stan Lee did. From Wikipedia:“A device created by a secret society of scientists to further their ultimate goal of world conquest.” Tales of Suspense #79, July 1966

Terascale Power Use Today (Not to Scale) TFLOP Machine Today 2000 W Heat removal (all levels, chip to facility) 40% of total power consumed 950 W Power supply loss 19% (81% efficient power supplies) 5 kW 1500 W 7.5 nJ/instruction @ 0.2 instruction/flop Control 10 TB disk @ 10 W per TB Disk 100 W 100 pJ comm. per flop Comm. 100 W 150 W 0.1 byte/flop @ 1.5 nJ per byte Memory 200 W 1 Tflops/s @ 200 pJ per flop Compute Derived from data from S. Borkar, Intel Fellow

Let’s See that Drawn to Scale… A SIMD accelerator approach gives up Control to reduce wattage per Tflops/s. Which can work, for some applications that are very regular and SIMD-like (vectorizable with long vectors).

Energy Cost by Operation Type Notice that 12000 pJ @ 3 GHz = 36 watts! SiCortex’s solution: drop the memory speed, but the performance dropped proportionately. Larger caches actually reduce power consumption.

Energy Cost of a Future HPC Cluster No cost for interconnect? Hmm… But while we’ve building HPC clusters, Google and Microsoft have been very, very busy…

Cloud computing has already eclipsed HPC for sheer scale • Cloud computing means using a remote data center to manage scalable, reliable, on-demand access to applications • Provides applications and infrastructure over the Internet • “Scalable” here means: • Possibly millions of simultaneous users of the app. • Exploiting thousand-fold parallelism in the app. From Tony Hey, Microsoft

Mega-Data Center Economy of Scale Over 18 million square feet. Each. • A 50,000-server facility is 6–7x more cost-effective than a 1,000 server facility in key respects • Don’t expect a TOP500 score soon. • Secrecy? • Or… not enough interconnect? Each data center is the size of 11.5 football fields Data courtesy of James Hamilton

Computing by the truckload • Build racks and cooling and communication together in a “container” • Hookups: power, cooling, and interconnect • I estimate each center is already over 70 megawatts… and 20 petaflops total! • But: designed for capacity computing, not capability computing From Tony Hey, Microsoft

Arming for search engine warfare • It’s starting to look like… • a steel mill! This work is licensed under a Creative Commons Attribution 3.0 U.S. License

A steel mill takes ~500 megawatts • Self-contained power plant • Is this where “economy of scale” will top out for clusters as well? • Half the steel mills in the US are abandoned • Maybe some should be converted to data centers!

With great power comes great responsibility —Uncle Ben Yes, and also some really big heat sinks.—John G.

An unpleasant math surprise lurks… 64-bit precision is looking long in the tooth. (gulp!) Bits x86 80 (stack only) 80 70 Cray 64 most vendors 64 CDC 60 Moore’s Law 60 50 40 Univac, IBM 36 30 Year Zuse 22 20 2010 1940 1950 1960 1970 1980 1990 2000 At 1 exaflop/s (1018), 15 decimals don’t last long.

It’s unlikely a code uses the best precision. • Too few bits gives unacceptable errors • Too many bits wastes memory, bandwidth, joules • This goes for integers as well as floating point Optimum precision 80 IEEE 754 double precision 64 48 32 16 0 All floating-point operations in application

Ways out of the dilemma… • Better hardware support for 128-bit, if only for use as a check • Interval arithmetic has promise, if programmers can learn to use it properly (not just apply it to point arithmetic methods) • Increasing precision automatically with the memory hierarchy might even allow a return to 32-bit • Maybe it’s time to restore Numerical Analysis to the standard curriculum of computer scientists?

The hierarchical precision idea More capacity Lower latency; More precision, range L2 Cache Local RAM L1 Cache Reg 2 million FP values 30-cycle latency 15 significant decimals 10–307 to 10+307 range 2 billion FP values 200-cycle latency 7 significant decimals 10–38 to 10+38 range 16 FP values 1-cycle latency 69 significant decimals 10–2525221 to 10+2525221 range 4096 FP values 6-cycle latency 33 significant decimals 10–9864 to 10+9864 range

Another unpleasant surprise lurking… No more automatic caches? • Hardware cache policies are designed to minimize miss rates, at the expense of low cache utilization (typically around 20%). • Memory transfers will soon be half the power consumed by a computer, and computing is already power-constrained. • Software will need to manage memory hierarchies explicitly. And source codes need to expose memory moves, not hide them.

Summary • Mega-data center clusters have eclipsed HPC clusters, but HPC can learn a lot from their methods in getting to exascale. • Clusters may grow to the size of steel mills, dictated by economies of scale. • We may have to rethink the use of 64-bit flops everywhere, for a variety of reasons. • Speculative data motion (like, automatic caching) reduces operations per watt… it’s on the way out.

Exascale : Power, Cooling, Reliability, and Future Arithmetic