1 / 41

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching. Somayeh Sardashti and David A. Wood University of Wisconsin-Madison. Where does energy go?. Communication vs. Computation. Keckler Micro 2011. ~200X .

marv
Download Presentation

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching SomayehSardashti and David A. Wood University of Wisconsin-Madison

  2. Where does energy go?

  3. Communication vs. Computation Keckler Micro 2011 ~200X Improving cache utilization is critical for energy-efficiency!

  4. Compressed Cache: Compress and Compact Blocks • Higher effective cache size • Small area overhead • Higher system performance • Lower system energy • Previous work limit compression effectiveness: • Limited number of tags • High internal fragmentation • Energy expensive re-compaction

  5. Decoupled Compressed Cache (DCC) Saving system energy by improving LLC utilization through cache compression. Non-Contiguous Sub-Blocks Decoupled Super-Blocks • Previous work limit compression effectiveness: • Limited number of tags • High Internal Fragmentation • Energy expensive re-compaction

  6. Decoupled Compressed Cache (DCC) Saving system energy by improving LLC utilization through cache compression. Outperform 2X LLC 1.08X LLC area 14% higher performance 12% lower energy

  7. Outline • Motivation • Compressed caching • Our Proposals: Decoupled compressed cache • Experimental Results • Conclusions

  8. Uncompressed Caching A fixed one-to-one tag/data mapping Data Tags A B C A B C

  9. Compressed Caching Compact compressed blocks, to make room. Add more tags to increase effective capacity. Compress cache blocks. Data Tags A C B C B A A B

  10. Compression (1) Compression: how to compress blocks? • There are different compression algorithms. • Not the focus of this work. • But, which algorithm matters! Compressor C C 64 bytes 20 bytes

  11. Compression Potentials 3.9 We use C-PACK+Z for the rest of the talk! 2.8 1.5 Cycles to Decompress Compression Algorithm Compression Ratio = Original Size / Compressed Size High compression ratio  potentially large normalized effective cache capacity.

  12. Compaction (2) Compaction: how to store and find blocks? • Critical to achieve the compression potentials. • This work focuses on compaction. Fixed Sized Compressed Cache (FixedC)[Kim’02, WMPI, Yang Micro 02] Internal Fragmentation! Data Tags A B C A B C

  13. Compaction (2) Compaction: how to store and find blocks? Variable Sized Compressed Cache (VSC) [Alameldeen, ISCA 2002] Data Tags B A C D D Sub-block B A C

  14. Previous Compressed Caches (Limit 1) Limited Tag/Metadata • High Area Overhead by Adding 4X/more Tags (Limit 2) Internal Fragmentation • Low Cache Capacity Utilization Potential: 3.9 3.1 10B 16B 2.6 2.3 2.0 A A 1.7 Normalized Effective Capacity = LLC Number of Valid Blocks / MAX Number of (Uncompressed) Blocks

  15. (Limit 3) Energy-Expensive Re-Compaction VSC requires energy-expensive re-compaction. B needs 2 sub-blocks Update B Data Tags B D A C A B C B 3X higher LLC dynamic energy! D

  16. Outline • Motivation • Compressed caching • Our Proposals: Decoupled compressed cache • Experimental Results • Conclusions

  17. Decoupled Compressed Cache (1) Exploiting Spatial Locality • Low Area Overhead (2) Decoupling tag/data mapping • Eliminate energy expensive re-compaction • Reduce internal fragmentation (3) Co-DCC: Dynamically co-compacting super-blocks • Further reduce internal fragmentation

  18. (1) Exploiting Spatial Locality Neighboring blocks co-reside in LLC. 89%

  19. (1) Exploiting Spatial Locality DCC tracks LLC blocks at Super-Block granularity. Data Tags D E A B C A B C D E 4X Tags 2X Tags Quad (Q): A, B, C, D Singleton (S): E state A state B state C state D Super-Block Tag Q Super Tags Q S Up to 4X blocks with low area overheads!

  20. (2) Decoupling tag/data mapping DCC decouples mapping to eliminate re-compaction. Update B Flexible Allocation Super Tags A B C D E B1 B2 Q S Quad (Q): A, B, C, D Singleton (S): E Quad (Q): A, B, C, D Singleton (S): E

  21. (2) Decoupling tag/data mapping Back pointers identify the owner block of each sub-block. Back Pointers Super Tags Data C D E A B1 B2 Q S Quad (Q): A, B, C, D Singleton (S): E Quad (Q): A, B, C, D Singleton (S): E Tag ID Blk ID

  22. (3) Co-compacting super-blocks Co-DCC dynamically co‑compacts super-blocks. • Reducing internal fragmentation B A D B ABCD C A sub-block Quad (Q): A, B, C, D

  23. Outline • Motivation • Compressed caching • Our Proposals: Decoupled compressed cache • Experimental Results • Conclusions

  24. Experimental Methodology • Integrated DCC with AMD Bulldozer Cache. • We model the timing and allocation constraints of sequential regions at LLC in detail. • No need for an alignment network. • Verilog implementation and synthesis of the tag match and sub-block selection logic. • One additional cycle of latency due to sub-block selection.

  25. Experimental Methodology • Full-system simulation with a simulator based on GEMS. • Wide range of applications with different level of cache sensitivities: • Commercial workloads: apache, jbb, oltp, zeus • Spec-OMP: ammp, applu, equake, mgrid, wupwise • Parsec: blackscholes, canneal, freqmine • Spec 2006 mixes (m1-m8): bzip2, libquantum-bzip2, libquantum, gcc, astar-bwaves, cactus-mcf-milc-bwaves, gcc-omnetpp-mcf-bwaves-lbm-milc-cactus-bzip, omnetpp-lbm

  26. Effective LLC Capacity Normalized LLC Area 2 2X Baseline Normalized Effective LLC Capacity Co-DCC 1 VSC DCC FixedC Baseline 1 2 3

  27. (Co-)DCC Performance 0.96 0.95 0.93 0.90 0.86 (Co-)DCC boost system performance significantly.

  28. (Co-)DCC Energy Consumption 0.97 0.96 0.93 0.91 0.88 (Co-)DCC reduce system energyby reducing number of accesses to the main memory.

  29. Summary • Analyze the limits of compressed caching • Limited number of tags • Internal fragmentation • Energy-expensive re-compaction • Decoupled Compressed Cache • Improving performance and energy of compressed caching • Decoupled super-blocks • Non-contiguous sub-blocks • Co-DCC further reduces internal fragmentation • Practical designs [details in the paper]

  30. Backup • (De-)Compression overhead • DCC data array organization with AMD Bulldozer • DCC Timing • DCC Lookup • Applications • Co-DCC design • LLC effective capacity • LLC miss rate • Memory dynamic energy • LLC dynamic energy

  31. (De-)Compression Overhead

  32. DCC Data Array OrganizationAMD Bulldozer

  33. DCC Timing

  34. DCC Lookup • Access Super Tags and Back Pointers in parallel • Find the matched Back Pointers • Read corresponding sub-blocks and decompress Read C Back Pointers Data E A B D C1 C0 1 1 1 1 1 1 Quad (Q): A, B, C, D Singleton (S): E Q 1 0 Super Tags S

  35. Applications Sensitive to Cache Capacity and Latency Sensitive to Cache Latency Cache Insensitive Sensitive to Cache Capacity

  36. Co-DCC Design

  37. LLC Effective Cache Capacity

  38. LLC Miss Rate

  39. Memory Dynamic Energy

  40. LLC Dynamic Energy

More Related