Memory-Aware Compilation

Memory-Aware Compilation Philip Sweany 10/20/2011

Architectural Diversity • “Simple” Load/Store • Instruction-level parallel • Heterogeneous multi-core parallelism • “Traditional” parallel architectures • Vector • MIMD • Many core • Next???

Load/Store Architecture • All arithmetic must take place in registers • Cache hits typically 3-5 cycles • Cache misses more like 100 cycles • Compiler tries to keep scalars in registers • Graph-coloring register assignment

Instruction-Level Parallelism (ILP) • ILP architectures include: • Multiple pipelined functional units • Static or dynamic scheduling • Compiler schedules instructions to reduce execution time • Local scheduling • Global scheduling • Software pipelining

“Typical” ILP Architecture • 8 “generic” pipelined functional units • Timing • Register operations require 1 cycle • Memory operations (load) require 5 cycles (hit) or 50 cycles (miss), pipelined of course • Stores are buffered so don’t require time directly.

Matrix Multiply Matrix_multiply a,b,c: int[4][4] for i from 0 to 3 for j from 0 to 3 c[i][j] = 0 for k from 0 to 3 c[i][j] += a[i][k] * b[k][j]

Single Loop Schedule (ILP) • t1 = a[i][k] # t2 = b[k][j] • nop • nop • nop • t3 = t1 * t2 • t0 += t3 --- t0 = c[i][j] before loop and c[i][j] = t0 after loop

Software Pipelining • Can “cover” any latency, removing nops from single-loop schedule IFF conditions are “right.” They are right for matrix multiply so, …

Software Pipelined Matrix Mult All the operations can be included in a single cycle, speeding up loop by a factor of 7. t1 = a[i][k], t2 = b[k][j], t3 = t1-5*t2-5, t0 +=t3

Improved Software Pipelining? • Unroll-and-Jam on nested loops can significantly shorten the execution time • Use of a cache-reuse model can give better schedules than assuming all cache accesses are hits and can reduce register requirements over assuming all accesses are cache misses.

Results of Software Pipelining Improvements • Using unroll-and-jam on 26 FORTRAN nested loops before performing modulo scheduling led to: • Decreased execution time for loops of up to 94.2%. On average, loops decreased execution time by 56.9% • Increased register requirements greatly, often by a factor of 5.

Results of Software Pipelining Improvements • Using a simple cache reuse model, our modulo scheduler • Improved execution time roughly 11% over an all-hit assumption with little change in register usage • Used 17.9% fewer registers than an all- miss assumption, while generating 8% slower code

“OMAP” Resources Chiron Tesla Shared Memory FPGA Ducati Multi-CPU

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures

Dependence-Based Compilation • Vectorization and Parallelization require a deeper analysis than optimization for scalar machines • Must be able to determine whether two accesses to the same array might be to the same location • Dependence is the theory that makes this possible • There is a dependence between two statements if they might access the same location, there is a path from one to the other, and one access is a write • Dependence has other applications • Memory hierarchy management—restructuring programs to make better use of cache and registers • Includes input dependences • Scheduling of instructions

Syllabus I • Introduction • Parallel and vector architectures. The problem of parallel programming. Bernstein's conditions and the role of dependence. Compilation for parallel machines and automatic detection of parallelism. • Dependence Theory and Practice • Fundamentals, types of dependences. Testing for dependence: separable, gcd and Banerjee tests. Exact dependence testing. Construction of direction and distance vectors. • Preliminary Transformations • Loop normalization, scalar data flow analysis, induction variable substitution, scalar renaming.

Syllabus II • Fine-Grain Parallel Code Generation • Loop distribution and its safety. The Kuck vectorization principle. The layered vector code-generation algorithm and its complexity. Loop interchange. • Coarse-Grain Parallel Code Generation • Loop Interchange. Loop Skewing. Scalar and array expansion. Forward substitution. Alignment. Code replication. Array renaming. Node splitting. Pattern recognition. Threshold analysis. Symbolic dependence tests. Parallel code generation and its problems. • Control Dependence • Types of branches. If conversion. Control dependence. Program dependence graph.

Syllabus III • Memory Hierarchy Management • The use of dependence in scalar register allocation and management of the cache memory hierarchy. • Scheduling for Superscalar and Parallel Machines Machines • Role of dependence. List Scheduling. Software Pipelining. Work scheduling for parallel systems. Guided Self-Scheduling • Interprocedural Analysis and Optimization • Side effect analysis, constant propagation and alias analysis. Flow-insensitive and flow-sensitive problems. Side effects to arrays. Inline substitution, linkage tailoring and procedure cloning. Management of interprocedural analysis and optimization. • Compilation of Other Languages. • C, Verilog, Fortran 90, HPF.

What is High Performance Computing? • What architectural models are there? • What system software is required? Standard? • How should we evaluate high performance? • Run time? • Run time x machine cost? • Speedup ? • Efficient use of CPU resources?

Memory-Aware Compilation