Algorithms and Architecture

Algorithms and Architecture William D. GroppMathematics and Computer Sciencewww.mcs.anl.gov/~gropp

Algorithms • What is an algorithm? • A set of instructions to perform a task • How do we evaluate an algorithm? • Correctness • Accuracy • Not an absolute • Efficiency • Relative to current and future machines • How do we measure efficiency? • Often by counting floating point operations • Compare to “peak performance”

Real and IdealizedComputer Architectures • Any algorithm assumes an idealized architecture • Common choice: • Floating point work costs time • Data movement is free • Real systems: • Floating point is free (fully overlapped with other operations) • Data movement costs time…a lot of time • Classical complexity analysis for numerical algorithms is no longer correct (more precisely, no longer relevant) • Known since at least BLAS2 and BLAS3

Floating point relevant Floating point irrelevant CPU and Memory Performance DRAM Performance

More Recent Results(or meet Y2K, or correctness first)

Trends in Computer Architecture I • Latency to memory will continue to grow relative to CPU speed • Latency hiding techniques require finding increasing amounts of independent work: Little’s law implies • Number of concurrent memory references = Latency * rate • For 1 reference per cycle, this is already 100–1000 concurrent references

Trends in Computer Architecture II • Clock speeds will continue to increase • The rate of clock rate increase has increased recently  • Light travels 3 cm (in a vacuum) in one cycle of a 10 GHz clock • CPU chips won’t be causally connected within a single clock cycle, i.e., a signal will not cross the chip in a single clock cycle • Processors will be parallel!

Trends in Computer Architecture III • Power dissipation problems will force more changes • Current trends imply chips with energy densities greater than a nuclear reactor • Already a problem: The current issue of consumer reports looks at the likelihood of getting a serious burn from your laptop! • Will forcenew waysto get performance,such as extensiveparallelism

Consequences • Gap between memory and processor performance will continue to grow • Data motion will dominate the cost of many (most) calculations • The key is to find a computational cost abstraction that is as simple as possible but no simpler

Architecture Invariants • Performance is determined by memory performance • Memory system design for performance makes system performance less predictable • Fast memories possible, but • Expensive ($) • Large (meters3) • Power hungry (Watts) • Algorithms that don’t take these realities into account may be irrelevant

Node Performance • Current laptops now have a peak speed (based on clock rate) of over 1 Gflops (10 Cray1s!) • Observed (sustained) performance is often a small fraction of peak • Why is the gap between “peak” and “sustained” performance so large? • Lets look at a simple numerical kernel

Sparse Matrix-Vector Product • Common operation for optimal (in floating-point operations) solution of linear systems • Sample code:for row=1,n m = i[row] - i[row-1]; sum = 0; for k=1,m sum += *a++ * x[*j++]; y[i] = sum; • Data structures are a[nnz], j[nnz], i[n], x[n], y[n]

Simple Performance Analysis • Memory motion: • nnz (sizeof(double) + sizeof(int)) + n (2*sizeof(double) + sizeof(int)) • Assume a perfect cache (never load same data twice) • Computation • nnz multiply-add (MA) • Roughly 12 bytes per MA • Typical WS node can move 1-4 bytes/MA • Maximum performance is 8-33% of peak

More Performance Analysis • Instruction Counts: • nnz (2*load-double + load-int + mult-add) +n (load-int + store-double) • Roughly 4 instructions per MA • Maximum performance is 25% of peak (33% if MA overlaps one load/store) • (wide instruction words can help here) • Changing matrix data structure (e.g., exploit small block structure) allows reuse of data in register, eliminating some loads (x and j) • Implementation improvements (tricks) cannot improve on these limits

Realistic Measures of Peak PerformanceSparse Matrix Vector ProductOne vector, matrix size, m = 90,708, nonzero entries nz = 5,047,120 Thanks to Dinesh Kaushik; ORNL and ANL for compute time

Realistic Measures of Peak PerformanceSparse Matrix Vector Productone vector, matrix size, m = 90,708, nonzero entries nz = 5,047,120

What About CPU-Bound Operations? • Dense Matrix-Matrix Product • Most studied numerical program by compiler writers • Core of some important applications • More importantly, the core operation in High Performance Linpack • Benchmark used to “rate” the top 500 fastest systems • Should give optimal performance…

Hand-tuned Compiler From Atlas The Compiler Will Handle It (?) Large gap between natural code and specialized code Enormous effort required to get good performance

Performance for Real Applications • Dense matrix-matrix example shows that even for well-studied, compute-bound kernels, compiler-generated code achieves only a small fraction of available performance • “Fortran” code uses “natural” loops, i.e., what a user would write for most code • Others use multi-level blocking, careful instruction scheduling etc. • Algorithms design also needs to take into account the capabilities of the system, not just the hardware • Example: Cache-Oblivious Algorithms (http://supertech.lcs.mit.edu/cilk/papers/abstracts/abstract4.html)

V U V U V U V V U U V U V V V V U U V V U U U U V V V V V V V V U U U U V U V V V U U V U V V V V U U Challenges in Creating a Performance Model Based on Memory Accesses • Different levels of the memory hierarchies have significantly different performance • Cache behavior sensitive to details of data layout • Still no good calculus for predicting performance STREAM performance in MB/s versus data size Interleaved data causes data to be displaced while still needed for later steps

Parallel Performance Issues • Coordination of accesses • Ensuring that two (or more) processes/threads do not access data before it is ready • Related to main (not cache) memory behavoir and timing • Intrinsically nonscalable operations • Inner products • Opportunity for algorithm designers • Use nonblocking or split operations • Arrange algorithms so that other operations can take place while inner product is being assembled

Is Parallel Computing Hard? • Conjecture: • The difficulty in writing high-performance code is directly related to the ratio of memory access times • Reason: hiding latency requires blocking, prefetch, and other special techniques • Parallel Computing now the easy part: • (Easy only relative to performance programming for uniprocessors)

Aggregate Parallel Performance of PETSc-FUN3DIBM Power 4: 512 Processors (1.3 GHz)Pentium 4 Xeon Cluster: 250 Processors (2.4 GHz)

Is Performance Everything? “In August 1991, the Sleipner A, an oil and gas platform built in Norway for operation in the North Sea, sank during construction. The total economic loss amounted to about $700 million. After investigation, it was found that the failure of the walls of the support structure resulted from a serious error in the finite element analysis of the linear elastic model.” (http://www.ima.umn.edu/~arnold/disasters/sleipner.html)

Correctness and Accuracy • Many current algorithms designed to balance performance and accuracy • These choices often made when computers were 106 times slower than they are now • Is it time to re-examine these choices, particularly for applications that are now done on laptops?

New (and Not So New) Architectures • Commodity high-performance processors • Conventional desk/lap tops • Game systems • Cell phones (voice recognition, games) • DARPA High Productivity Computing Systems project • Vectors and Streams

Typical 6-Gflop Commodity System (circa 2000) Over 100 (32bit) Petaflops (0.1 Exaflops) already delivered!

Memory Interconnect Cache Cache PIMCPU PIMCPU PIMCPU PIMCPU PIMCPU PIMCPU CPU CPU CPU CPU Setup and “conventional” (scalar) code Memory Intensive Operations PIM-based Node Network Connection • Homogenous • All memory the same • Heterogeneous • Different memory access costs • Different processors(?)

IBM BlueGene/L Processor-rich system High performance, memory-oriented interconnect Processors near memory canreduce need to move data

(Somewhat) Exotic Solutions:HTMT • Radically new technologies: • Superconducting (RSFQ) logic (240+ GHz) • Multithreaded/stranded execution model • Multilevel PIM-enhanced memory • Optical networks • Optical (holographic) main store • http://htmt.caltech.edu/ • Still a (deep) memory hierarchy • Elements finding their way into new innovative architectures

XSW #127 XSW #126 “Conventional” Supercomputers Earth Simulator PN #636 PN #637 PN #638 PN #639

Algorithms • Exploit problem behavior at different scales • Multigrid • Domain Decomposition • Generalizes multigrid (or multigrid generalizes DD) • Provides a spectrum of robust, optimal methods for a wide range of problems • Nonlinear versions hold great promise • Continuation • Divide and conquer • Multipole and Wavelets • Cache-sensitive algorithms • See Karp in SIAM Review 1996 • Even the Mathematicians know about this now (McGeoch, AMS Notices March 2001)

Conclusions • Performance models should count data motion, not flops • Computers will continue to have multiple levels of memory hierarchy • Algorithms should exploit them • Computers will be parallel • Algorithms can make effective use of greater adaptivity to give better time-to-solution and accuracy • Denial is not a solution

Algorithms and Architecture