1 / 29

Mooly Sagiv

Compiler Optimizations for Memory Hierarchy Chapter 20 http://research.microsoft.com/~trishulc/ http://www.cs.umd.edu/~tseng/ High Performance Compilers for Parellel Computing (Wolfe). Mooly Sagiv. Outline. Motivation Instruction Cache Optimizations Scalar Replacement of Aggregates

amory
Download Presentation

Mooly Sagiv

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compiler Optimizations for Memory HierarchyChapter 20http://research.microsoft.com/~trishulc/http://www.cs.umd.edu/~tseng/High Performance Compilers forParellel Computing (Wolfe) Mooly Sagiv

  2. Outline • Motivation • Instruction Cache Optimizations • Scalar Replacement of Aggregates • Data Cache Optimizations • Where does it fit in a compiler • Complementary Techniques • Preliminary Conclusion

  3. Motivation • Every year • CPUs are improving by 50%-60% • Main memory speed is improving 10% • So what? • What can we do? • Programmers • Compiler writers • Operating system designers • Hardware architectures

  4. A Typical Machine CPU memory bus Cache Main Memory Bus adaptor CPU I/O bus I/O controler I/O controler I/O controler network Graphics output Disk Disk

  5. Types of Locality in Programs • Temporal Locality • The same data is accessed many times in successive instructions • Example: while (…) { x = x + a; } • Spatial Locality • “Nearby” memory locations are accessed many times in successive instructions • Examplefor (i = 1; i < n; i++) { x[i] = x[i] + a; }

  6. Compiler Optimizations forMemory Hierarchy • Register allocation (Chapter 16) • Improve locality • Improve branch predication • Software prefetching • Improve memory allocation

  7. A Reasonable Assumption • The machine has two separate caches • Instruction cache • Data cache • Employ different compiler optimizations • Instruction cache optimizations • Data Cache optimizations

  8. Instruction-Cache Optimizations • Instruction Prefecthing • Procedure Sorting • Procedure and Block Placement • Intraprocedural Code Positioning(Pettis & Hensen 1990) • Procedure Splitting • Tailored for specific cache policy

  9. Instruction Prefetching • Many machines prefetch instruction of blocks predicted to be executed • Some RISC architectures support “software” prefecth • iprefetch address (Sparc-V9) • Criteria for inserting prefetching • Tprefetch - The latency of prefecting • t - The time that the address is known

  10. Procedure Sorting • Interprocedural Optimization • Place the caller and the callee close to each other • Applies for statically linked procedures • Create “undirected” call graph • Label arcs with execution frequencies • Use a greedy approach to select neighboring procedures

  11. 50 P1 P2 50 P5 40 100 20 P3 P4 5 3 32 90 P7 P6 40 P8

  12. Intraprocedural Code Positioning • Move infrequently executed code out of main body • “Straighten” the code • Higher fraction of fetched instructions are actually executed • Operates on a control flow graph • Edges are annotated with execution frequencies • Cover the graph with traces

  13. Intraprocedural Code Positioning • Input • Contrtol flow graph • Edges are annotated with execution frequencies • Bottom-up trace selection • Initially each basic block is a trace • Combine traces with the maximal edge from tail to head • Place traces from entry • Traces with many outgoing edges appear earlier • Successive traces are close • Fix up the code by inserting and deleting branches

  14. entry 20 30 B1 45 10 14 B2 B3 40 14 5 10 B4 B5 B6 B7 5 10 10 B8 B9 15 10 exit

  15. Procedure Splitting • Enhances the effectiveness of • Procedure sorting • Code positioning • Divides procedures into “hot” and “cold” parts • Place hot code in a separate section

  16. Scalar Replacement of Array Elements • Reduce the number of memory accesses • Improve the effectiveness of register allocation do i= 1..N do j=1..N do k=1..N C(i, j)= C(i, j) + A(i, k) * B(k, j) endo endo endo

  17. Data-Cache Optimizations • Loop transformations • Re-arrange loops in scientific code • Allow parallel/pipelined/vector execution • Improve locality • Data placement of dynamic storage • Software prefetching

  18. Unimodular transformations Loop Transformations • Loop interchange • Loop permutation • Loop skewing • Loop fusion • Loop distribution • Loop tiling

  19. Tiling • Perform array operations in small blocks • Rearrange the loops so that innermost loops fits in cache (due to fewer iterations) • Allow reuse in all tiled dimensions • Padding may be required to avoid cache conflicts

  20. do i= 1..N, T do j=1..N, T do k=1..N, T do ii=i, min(i+T-1, N) do jj=j, min(j+T-1, N) do kk=k, min(k+T-1, N) C(ii, jj)= C(ii, jj) + A(ii, kk) * B(kk, jj) endo endo endo endo endo endo

  21. Dynamic storage • Improve special locality at allocation time • Examples • Use type of data structure at malloc • Reorganize heap • Allocate the parent of tree node and the node close • Useful information • Types • Traversal patterns • Research Frontier

  22. void addList(struct List *list; struct Patient *patient) { struct list *b; while (list !=NULL) { b = list ; list = list->forward; } list = (struct List *)= ccmaloc(sizeof(struct List), b); list->patient = patient; list->back= b; list->forward=NULL; b->forward=list; }

  23. Software Prefetching • Requires special hardware (Alpha, PowerPC, Sparc-V9) • Reduces the cost of subsequent accesses in loops • Not limited to scientific code • More effective for large memory bandwidth

  24. struct node {int val; struct node *next ; struct node *jump; } … ptr= the_list->head; while (ptr->next) { prefetch(ptr->jump); … ptr= ptr->next struct node {int val; struct node *next ; } … ptr= the_list->head; while (ptr->next) { … ptr= ptr->next

  25. Scalar replacement of array references Data-cache optimizations A HIR Procedure integration … B HIR|MIR Global value numbering … In-line expansion … Interprocedural register allocation … C MIR|LIR D LIR E link-time Textbook Order constant-folding simplifications

  26. LIR(D) Inline expansion Leaf-routine optimizations Shrink wrapping Machine idioms Tail merging Branch optimization and conditional moves Dead code elimination Software pipelining, … Instruction Scheduling 1 Register allocation Instruction Scheduling 2 Intraprocedural I-cache optimizations Instruction prefetching Data prefertching Branch predication constant-folding simplifications

  27. Link-time optimizations(E) Interprocedural register allocation Aggregation global references Interprcudural I-cache optimizations

  28. Complementary Techniques • Cache aware data structures • Smart hardware • Cache aware garbage collection

  29. Preliminary Conclusion • For imperative programs current I-cache optimizations suffice to get good speed-ups (10%) • For D-cache optimizations: • Locality optimizations are effective for regular scientific code (46%) • Software prefetching is effective with large memory bandwidth • For pointer chasing programs more research is needed • Memory optimizations is a profitable area

More Related