1 / 19

Memory-Aware Compilation

Memory-Aware Compilation. Philip Sweany 10/20/2011. Architectural Diversity. “ Simple ” Load/Store Instruction-level parallel Heterogeneous multi-core parallelism “ Traditional ” parallel architectures Vector MIMD Many core Next???. Load/Store Architecture.

faith
Download Presentation

Memory-Aware Compilation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Memory-Aware Compilation Philip Sweany 10/20/2011

  2. Architectural Diversity • “Simple” Load/Store • Instruction-level parallel • Heterogeneous multi-core parallelism • “Traditional” parallel architectures • Vector • MIMD • Many core • Next???

  3. Load/Store Architecture • All arithmetic must take place in registers • Cache hits typically 3-5 cycles • Cache misses more like 100 cycles • Compiler tries to keep scalars in registers • Graph-coloring register assignment

  4. Instruction-Level Parallelism (ILP) • ILP architectures include: • Multiple pipelined functional units • Static or dynamic scheduling • Compiler schedules instructions to reduce execution time • Local scheduling • Global scheduling • Software pipelining

  5. “Typical” ILP Architecture • 8 “generic” pipelined functional units • Timing • Register operations require 1 cycle • Memory operations (load) require 5 cycles (hit) or 50 cycles (miss), pipelined of course • Stores are buffered so don’t require time directly.

  6. Matrix Multiply Matrix_multiply a,b,c: int[4][4] for i from 0 to 3 for j from 0 to 3 c[i][j] = 0 for k from 0 to 3 c[i][j] += a[i][k] * b[k][j]

  7. Single Loop Schedule (ILP) • t1 = a[i][k] # t2 = b[k][j] • nop • nop • nop • t3 = t1 * t2 • t0 += t3 --- t0 = c[i][j] before loop and c[i][j] = t0 after loop

  8. Software Pipelining • Can “cover” any latency, removing nops from single-loop schedule IFF conditions are “right.” They are right for matrix multiply so, …

  9. Software Pipelined Matrix Mult All the operations can be included in a single cycle, speeding up loop by a factor of 7. t1 = a[i][k], t2 = b[k][j], t3 = t1-5*t2-5, t0 +=t3

  10. Improved Software Pipelining? • Unroll-and-Jam on nested loops can significantly shorten the execution time • Use of a cache-reuse model can give better schedules than assuming all cache accesses are hits and can reduce register requirements over assuming all accesses are cache misses.

  11. Results of Software Pipelining Improvements • Using unroll-and-jam on 26 FORTRAN nested loops before performing mod- ulo scheduling led to: • Decreased execution time for loops of up to 94.2%. On average, loops decreased execution time by 56.9% • Increased register requirements greatly, often by a factor of 5.

  12. Results of Software Pipelining Improvements • Using a simple cache reuse model, our modulo scheduler • Improved execution time roughly 11% over an all-hit assumption with little change in register usage • Used 17.9% fewer registers than an all- miss assumption, while generating 8% slower code

  13. “OMAP” Resources Chiron Tesla Shared Memory FPGA Ducati Multi-CPU

  14. Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures

  15. Dependence-Based Compilation • Vectorization and Parallelization require a deeper analysis than optimization for scalar machines • Must be able to determine whether two accesses to the same array might be to the same location • Dependence is the theory that makes this possible • There is a dependence between two statements if they might access the same location, there is a path from one to the other, and one access is a write • Dependence has other applications • Memory hierarchy management—restructuring programs to make better use of cache and registers • Includes input dependences • Scheduling of instructions

  16. Syllabus I • Introduction • Parallel and vector architectures. The problem of parallel programming. Bernstein's conditions and the role of dependence. Compilation for parallel machines and automatic detection of parallelism. • Dependence Theory and Practice • Fundamentals, types of dependences. Testing for dependence: separable, gcd and Banerjee tests. Exact dependence testing. Construction of direction and distance vectors. • Preliminary Transformations • Loop normalization, scalar data flow analysis, induction variable substitution, scalar renaming.

  17. Syllabus II • Fine-Grain Parallel Code Generation • Loop distribution and its safety. The Kuck vectorization principle. The layered vector code-generation algorithm and its complexity. Loop interchange. • Coarse-Grain Parallel Code Generation • Loop Interchange. Loop Skewing. Scalar and array expansion. Forward substitution. Alignment. Code replication. Array renaming. Node splitting. Pattern recognition. Threshold analysis. Symbolic dependence tests. Parallel code generation and its problems. • Control Dependence • Types of branches. If conversion. Control dependence. Program dependence graph.

  18. Syllabus III • Memory Hierarchy Management • The use of dependence in scalar register allocation and management of the cache memory hierarchy. • Scheduling for Superscalar and Parallel Machines Machines • Role of dependence. List Scheduling. Software Pipelining. Work scheduling for parallel systems. Guided Self-Scheduling • Interprocedural Analysis and Optimization • Side effect analysis, constant propagation and alias analysis. Flow-insensitive and flow-sensitive problems. Side effects to arrays. Inline substitution, linkage tailoring and procedure cloning. Management of interprocedural analysis and optimization. • Compilation of Other Languages. • C, Verilog, Fortran 90, HPF.

  19. What is High Performance Computing? • What architectural models are there? • What system software is required? Standard? • How should we evaluate high performance? • Run time? • Run time x machine cost? • Speedup ? • Efficient use of CPU resources?

More Related