1 / 21

Compiler Managed Partitioned Data Caches for Low Power

Compiler Managed Partitioned Data Caches for Low Power. Rajiv Ravindran *, Michael Chu, and Scott Mahlke Advanced Computer Architecture Lab Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor.

kemal
Download Presentation

Compiler Managed Partitioned Data Caches for Low Power

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compiler Managed Partitioned Data Caches for Low Power Rajiv Ravindran*, Michael Chu, and Scott Mahlke Advanced Computer Architecture Lab Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor * Currently with the Java, Compilers, and Tools Lab, Hewlett Packard, Cupertino, California

  2. Introduction: Memory Power • On-chip memories are a major contributor to system energy • Data caches  ~16% in StrongARM [Unsal et. al, ‘01] Hardware Software Banking, dynamic voltage/frequency, scaling, dynamic resizing + Transparent to the user + Handle arbitrary instr/data accesses – Limited program information – Reactive Software controlled scratch-pad, data/code reorganization + Whole program information + Proactive – No dynamic adaptability – Conservative

  3. Global program knowledge • Proactive optimizations • Dynamic adaptability • Efficient execution • Aggressive software optimizations Reducing Data Memory Power:Compiler Managed, Hardware Assisted Hardware Software Banking, dynamic voltage/frequency, scaling, dynamic resizing + Transparent to the user + Handle arbitrary instr/data accesses ー Limited program information ー Reactive Software controlled scratch-pad, data/code reorganization + Whole program information + Proactive ー No dynamic adaptability ー Conservative

  4. Data Caches: Tradeoffs Advantages Disadvantages • + Capture spatial/temporal locality • + Transparent to the programmer • + General than software scratch-pads • + Efficient lookups • – Fixed replacement policy • – Set index no program locality • – Set-associativity has high overhead • – Activate multiple data/tag-array • per access

  5. Traditional Cache Architecture tag set offset tag data lru tag data lru tag data lru tag data lru Replace =? =? =? =? 4:1 mux • Lookup  Activate all ways on every access • Replacement  Choose among all the ways

  6. Partitioned Cache Architecture Ld/St Reg [Addr] [k-bitvector] [R/U] tag set offset tag data lru tag data lru tag data lru tag data lru P0 P1 P2 P3 Replace =? =? =? =? 4:1 mux • Advantages • Improve performance by controlling replacement • Reduce cache access power by restricting number of accesses • Lookup  Restricted to partitions specified in bit-vector if ‘R’, else default to all partitions • Replacement  Restricted to partitions specified in bit-vector

  7. Partitioned Caches: Example for (i = 0; i < N1; i++) { … for (j = 0; j < N2; j++) y[i + j] += *w1++ + x[i + j] for (k = 0; k < N3; k++) y[i + k] += *w2++ + x[i + k] } ld1/st1 ld3 ld5 ld2/st2 ld4 ld6 way-0 way-1 way-2 tag data tag data tag data ld1 [100], R ld5 [010], R ld3 [001], R ld1, st1, ld2, st2 ld5, ld6 ld3, ld4 y w1/w2 x • Reduce number of tag checks per iteration from 12 to 4 !

  8. Compiler Controlled Data Partitioning • Goal: Place loads/stores into cache partitions • Analyze application’s memory characteristics • Cache requirements  Number of partitions per ld/st • Predict conflicts • Place loads/stores to different partitions • Satisfies its caching needs • Avoid conflicts, overlap if possible

  9. Cache Analysis: Estimating Number of Partitions • Minimal partitions to avoid conflict/capacity misses • Probabilistic hit-rate estimate • Use the working-set to compute number of partitions j-loop k-loop X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y B1 B1 B1 B1 M M M M • M has working-set size = 1

  10. Cache Analysis:Estimating Number Of Partitions • Avoid conflict/capacity misses for an instruction • Estimates hit-rate based on • Reuse-distance (D), total number of cache blocks (B), associativity (A) (Brehob et. al., ’99) D = 2 D = 1 D = 0 1 2 3 4 1 2 3 4 1 2 3 4 8 8 8 16 16 16 24 24 24 32 32 32 • Compute energy matrices in reality • Pick most energy efficient configuration per instruction

  11. Cache Analysis: Computing Interferences • Avoid conflicts among temporally co-located references • Model conflicts using interference graph X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y M4 M2 M1 M1 M4 M2 M1 M1 M4 M3 M1 M1 M4 M3 M1 M1 M4 D = 1 M1 D = 1 M2 D = 1 M3 D = 1

  12. Partition Assignment • Placement phase can overlap references • Compute combined working-set • Use graph-theoretic notion of a clique • For each clique, new D  Σ D of each node • Combined D for all overlaps  Max (All cliques) M4 D = 1 Clique 1 M1 D = 1 M2 D = 1 Clique 1 : M1, M2, M4  New reuse distance (D) = 3 Clique 2 : M1, M3, M4  New reuse distance (D) = 3 Combined reuse distance  Max(3, 3) = 3 M3 D = 1 Clique 2

  13. Experimental Setup • Trimaran compiler and simulator infrastructure • ARM9 processor model • Cache configurations: • 1-Kb to 32-Kb • 32-byte block size • 2, 4, 8 partitions vs. 2, 4, 8-way set-associative cache • Mediabench suite • CACTI for cache energy modeling

  14. Reduction in Tag & Data-Array Checks 8 8-part 4-part 2-part 7 6 5 Average way accesses 4 3 2 1 0 1-K 2-K 4-K 8-K 16-K 32-K Average Cache size • 36% reduction on a 8-partition cache

  15. Improvement in Fetch Energy 16-Kb cache 60 2-part vs 2-way 4-part vs 4-way 8-part vs 8-way 50 40 30 Percentage energy improvement 20 10 0 epic cjpeg djpeg unepic Average pegwitenc pegwitdec rawcaudio rawdaudio mpeg2dec mpeg2enc pgpencode pgpdecode gsmencode gsmdecode g721encode g721decode

  16. Summary • Maintain the advantages of a hardware-cache • Expose placement and lookup decisions to the compiler • Avoid conflicts, eliminate redundancies • 24% energy savings for 4-Kb with 4-partitions • Extensions • Hybrid scratch-pad and caches • Disable selected tags  convert them to scratch-pads • 35% additional savings in 4-Kb cache with 1 partition as SP

  17. Thank You & Questions

  18. Cache Analysis Step 1: Instruction Fusioning • Combine ld/st that accesses the same set of objects • Avoids coherence and duplication • Points-to analysis for (i = 0; i < N1; i++) { … for (j = 0; j < readInput1(); j++) y[i + j] += *w1++ + x[i + j] for (k = 0; k < readInput2(); k++) y[i + k] += *w2++ + x[i + k] } ld1/st1 ld3 ld5 M1 M2 ld2/st2 ld4 ld6

  19. Partition Assignment • Greedily place instructions based on its cache estimates • Overlap instructions if required • Compute number of partitions for overlapped instructions • Enumerate cliques within interference graph • Compute combined working-set of all cliques • Assign the R/U bit to control lookup M4 D = 1 Clique 1 M1 D = 1 M2 D = 1 M3 D = 1 Clique 2

  20. Related Work • Direct addressed, cool caches [Unsal ’01, Asanovic ’01] • Tags maintained in registers that are addressed within loads/stores • Split temporal/spatial cache [Rivers ’96] • Hardware managed, two partitions • Column partitioning [Devdas ’00] • Individual ways can be configured as a scratch-pad • No load/store based partitioning • Region based caching [Tyson ’02] • Heap, stack, globals • More finer grained control and management • Pseudo set-associative caches [Calder ’96,Inou ’99,Albonesi ‘99] • Reduce tag check power • Compromises on cycle time • Orthogonal to our technique

  21. Code Size Overhead Annotated LD/STs Extra MOV instructions 15% 16% 12 10 8 6 Percentage instructions 4 2 0 epic cjpeg djpeg unepic Average pegwitenc pegwitdec rawcaudio rawdaudio mpeg2dec mpeg2enc pgpencode pgpdecode gsmencode gsmdecode g721encode g721decode

More Related