1 / 49

Adaptive Cache Compression for High-Performance Processors

Adaptive Cache Compression for High-Performance Processors. Alaa Alameldeen and David Wood University of Wisconsin-Madison Wisconsin Multifacet Project http://www.cs.wisc.edu/multifacet. Overview. Design of high performance processors Processor speed improves faster than memory

russ
Download Presentation

Adaptive Cache Compression for High-Performance Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adaptive Cache Compression for High-Performance Processors Alaa Alameldeen and David Wood University of Wisconsin-Madison Wisconsin Multifacet Project http://www.cs.wisc.edu/multifacet

  2. Overview • Design of high performance processors • Processor speed improves faster than memory • Memory latency dominates performance • Need more effective cache designs • On-chip cache compression • Increases effective cache size • Increases cache hit latency • Does cache compression help or hurt? Alaa Alameldeen – Adaptive Cache Compression

  3. Does Cache Compression Help or Hurt? Alaa Alameldeen – Adaptive Cache Compression

  4. Does Cache Compression Help or Hurt? Alaa Alameldeen – Adaptive Cache Compression

  5. Does Cache Compression Help or Hurt? Alaa Alameldeen – Adaptive Cache Compression

  6. Does Cache Compression Help or Hurt? • Adaptive Compression determines when compression is beneficial Alaa Alameldeen – Adaptive Cache Compression

  7. Outline • Motivation • Cache Compression Framework • Compressed Cache Hierarchy • Decoupled Variable-Segment Cache • Adaptive Compression • Evaluation • Conclusions Alaa Alameldeen – Adaptive Cache Compression

  8. Instruction Fetcher Load-Store Queue L1 I-Cache (Uncompressed) L1 D-Cache (Uncompressed) Uncompressed Line Bypass Decompression Pipeline L1 Victim Cache Compression Pipeline From Memory To Memory L2 Cache (Compressed) Compressed Cache Hierarchy Alaa Alameldeen – Adaptive Cache Compression

  9. Decoupled Variable-Segment Cache • Objective: pack more lines into the same space Tag Area Data Area Address A Address B • 2-way set-associative with 64-byte lines • Tag Contains Address Tag, Permissions, LRU (Replacement) Bits Alaa Alameldeen – Adaptive Cache Compression

  10. Decoupled Variable-Segment Cache • Objective: pack more lines into the same space Tag Area Data Area Address A Address B Address C Address D Add two more tags Alaa Alameldeen – Adaptive Cache Compression

  11. Decoupled Variable-Segment Cache • Objective: pack more lines into the same space Tag Area Data Area Address A Address B Address C Address D Add Compression Size, Status, More LRU bits Alaa Alameldeen – Adaptive Cache Compression

  12. Decoupled Variable-Segment Cache • Objective: pack more lines into the same space Tag Area Data Area Address A Address B Address C Address D Divide Data Area into 8-byte segments Alaa Alameldeen – Adaptive Cache Compression

  13. Decoupled Variable-Segment Cache • Objective: pack more lines into the same space Tag Area Data Area Address A Address B Address C Address D Data lines composed of 1-8 segments Alaa Alameldeen – Adaptive Cache Compression

  14. Decoupled Variable-Segment Cache • Objective: pack more lines into the same space Tag Area Data Area Addr A uncompressed 3 Addr B compressed 2 Addr C compressed 6 Addr D compressed 4 Tag is present but line isn’t Compression Status Compressed Size Alaa Alameldeen – Adaptive Cache Compression

  15. Outline • Motivation • Cache Compression Framework • Adaptive Compression • Key Insight • Classification of L2 accesses • Global compression predictor • Evaluation • Conclusions Alaa Alameldeen – Adaptive Cache Compression

  16. Benefit(Compression) > Cost(Compression) No Yes Do not compress future lines Compress future lines Adaptive Compression • Use past to predict future • Key Insight: • LRU Stack [Mattson, et al., 1970] indicates for each reference whether compression helps or hurts Alaa Alameldeen – Adaptive Cache Compression

  17. Cost/Benefit Classification LRU Stack Data Area Addr A uncompressed 3 Addr B compressed 2 Addr C compressed 6 Addr D compressed 4 • Classify each cache reference • Four-way SA cache with space for two 64-byte lines • Total of 16 available segments Alaa Alameldeen – Adaptive Cache Compression

  18. An Unpenalized Hit LRU Stack Data Area Addr A uncompressed 3 Addr B compressed 2 Addr C compressed 6 Addr D compressed 4 • Read/Write Address A • LRU Stack order = 1 ≤ 2  Hit regardless of compression • Uncompressed Line  No decompression penalty • Neither cost nor benefit Alaa Alameldeen – Adaptive Cache Compression

  19. A Penalized Hit LRU Stack Data Area Addr A uncompressed 3 Addr B compressed 2 Addr C compressed 6 Addr D compressed 4 • Read/Write Address B • LRU Stack order = 2 ≤ 2  Hit regardless of compression • Compressed Line  Decompression penalty incurred • Compression cost Alaa Alameldeen – Adaptive Cache Compression

  20. An Avoided Miss LRU Stack Data Area Addr A uncompressed 3 Addr B compressed 2 Addr C compressed 6 Addr D compressed 4 • Read/Write Address C • LRU Stack order = 3 > 2  Hit only because of compression • Compression benefit: Eliminated off-chip miss Alaa Alameldeen – Adaptive Cache Compression

  21. An Avoidable Miss LRU Stack Data Area Addr A uncompressed 3 Addr B compressed 2 Addr C compressed 6 Addr D compressed 4 • Read/Write Address D • Line is not in the cache but tag exists at LRU stack order = 4 • Missed only because some lines are not compressed • Potential compression benefit Sum(CSize) = 15 ≤ 16 Alaa Alameldeen – Adaptive Cache Compression

  22. An Unavoidable Miss LRU Stack Data Area Addr A uncompressed 3 Addr B compressed 2 Addr C compressed 6 Addr D compressed 4 • Read/Write Address E • LRU stack order > 4  Compression wouldn’t have helped • Line is not in the cache and tag does not exist • Neither cost nor benefit Alaa Alameldeen – Adaptive Cache Compression

  23. Compression Predictor • Estimate: Benefit(Compression) – Cost(Compression) • Single counter : Global Compression Predictor (GCP) • Saturating up/down 19-bit counter • GCP updated on each cache access • Benefit: Increment by memory latency • Cost: Decrement by decompression latency • Optimization: Normalize to decompression latency = 1 • Cache Allocation • Allocate compressed line if GCP  0 • Allocate uncompressed lines if GCP < 0 Alaa Alameldeen – Adaptive Cache Compression

  24. Outline • Motivation • Cache Compression Framework • Adaptive Compression • Evaluation • Simulation Setup • Performance • Conclusions Alaa Alameldeen – Adaptive Cache Compression

  25. Simulation Setup • Simics full system simulator augmented with: • Detailed OoO processor simulator [TFSim, Mauer, et al., 2002] • Detailed memory timing simulator [Martin, et al., 2002] • Workloads: • Commercial workloads: • Database servers: OLTP and SPECJBB • Static Web serving: Apache and Zeus • SPEC2000 benchmarks: • SPECint: bzip, gcc, mcf, twolf • SPECfp: ammp, applu, equake, swim Alaa Alameldeen – Adaptive Cache Compression

  26. System configuration • A dynamically scheduled SPARC V9 uniprocessor • Configuration parameters: Alaa Alameldeen – Adaptive Cache Compression

  27. Simulated Cache Configurations • Always: All compressible lines are stored in compressed format • Decompression penalty for all compressed lines • Never: All cache lines are stored in uncompressed format • Cache is 8-way set associative with half the number of sets • Does not incur decompression penalty • Adaptive: Our adaptive compression scheme Alaa Alameldeen – Adaptive Cache Compression

  28. Performance SpecINT SpecFP Commercial Alaa Alameldeen – Adaptive Cache Compression

  29. Performance Alaa Alameldeen – Adaptive Cache Compression

  30. Performance 35% Speedup 18% Slowdown Alaa Alameldeen – Adaptive Cache Compression

  31. Performance Bug in GCP update Adaptive performs similar to the best of Always and Never Alaa Alameldeen – Adaptive Cache Compression

  32. Effective Cache Capacity Alaa Alameldeen – Adaptive Cache Compression

  33. Cache Miss Rates Misses Per 1000 Instructions 0.09 2.52 12.28 14.38 Penalized Hits Per Avoided Miss 6709 489 12.3 4.7 Alaa Alameldeen – Adaptive Cache Compression

  34. Adapting to L2 Sizes Misses Per 1000 Instructions 104.8 36.9 0.09 0.05 Penalized Hits Per Avoided Miss 0.93 5.7 6503 326000 Alaa Alameldeen – Adaptive Cache Compression

  35. Conclusions • Cache compression increases cache capacity but slows down cache hit time • Helps some benchmarks (e.g., apache, mcf) • Hurts other benchmarks (e.g., gcc, ammp) • Our Proposal: Adaptive compression • Uses (LRU) replacement stack to determine whether compression helps or hurts • Updates a single global saturating counter on cache accesses • Adaptive compression performs similar to the better of Always Compress and Never Compress Alaa Alameldeen – Adaptive Cache Compression

  36. Backup Slides • Frequent Pattern Compression (FPC) • Decoupled Variable-Segment Cache • Classification of L2 Accesses • (LRU) Stack Replacement • Cache Miss Rates • Adapting to L2 Sizes – mcf • Adapting to L1 Size • Adapting to Decompression Latency – mcf • Adapting to Decompression Latency – ammp • Phase Behavior – gcc • Phase Behavior – mcf • Can We Do Better Than Adaptive? Alaa Alameldeen – Adaptive Cache Compression

  37. Decoupled Variable-Segment Cache • Each set contains four tags and space for two uncompressed lines • Data area divided into 8-byte segments • Each tag is composed of: • Address tag • Permissions • CStatus : 1 if the line is compressed, 0 otherwise • CSize: Size of compressed line in segments • LRU/replacement bits Same as uncompressed cache Alaa Alameldeen – Adaptive Cache Compression

  38. Frequent Pattern Compression • A significance-based compression algorithm • Related Work: • X-Match and X-RL Algorithms [Kjelso, et al., 1996] • Address and data significance-based compression [Farrens and Park, 1991, Citron and Rudolph, 1995, Canal, et al., 2000] • A 64-byte line is decompressed in five cycles • More details in technical report: • “Frequent Pattern Compression: A Significance-Based Compression Algorithm for L2 Caches,” Alaa R. Alameldeen and David A. Wood, Dept. of Computer Sciences Technical Report CS-TR-2004-1500, April 2004 (available online). Alaa Alameldeen – Adaptive Cache Compression

  39. Frequent Pattern Compression (FPC) • A significance-based compression algorithm combined with zero run-length encoding • Compresses each 32-bit word separately • Suitable for short (32-256 byte) cache lines • Compressible Patterns: zero runs, sign-ext. 4,8,16-bits, zero-padded half-word, two SE half-words, repeated byte • A 64-byte line is decompressed in a five-stage pipeline • More details in technical report: • “Frequent Pattern Compression: A Significance-Based Compression Algorithm for L2 Caches,” Alaa R. Alameldeen and David A. Wood, Dept. of Computer Sciences Technical Report CS-TR-2004-1500, April 2004 (available online). Alaa Alameldeen – Adaptive Cache Compression

  40. Classification of L2 Accesses • Cache hits: • Unpenalized hit: Hit to an uncompressed line that would have hitwithout compression • Penalized hit: Hit to a compressed line that would have hit without compression • Avoided miss: Hit to a line that would NOT have hit without compression • Cache misses: • Avoidable miss: Miss to a line that would have hit with compression • Unavoidable miss: Miss to a line that would have missed even with compression Alaa Alameldeen – Adaptive Cache Compression

  41. (LRU) Stack Replacement • Differentiate penalized hits and avoided misses? • Only hits to top half of the tags in the LRU stack are penalized hits • Differentiate avoidable and unavoidable misses? • Is not dependent on LRU replacement • Any replacement algorithm for top half of tags • Any stack algorithm for the remaining tags Alaa Alameldeen – Adaptive Cache Compression

  42. Cache Miss Rates Alaa Alameldeen – Adaptive Cache Compression

  43. Adapting to L2 Sizes Misses Per 1000 Instructions 98.9 88.1 12.4 0.02 Penalized Hits Per Avoided Miss 11.6 4.4 12.6 2x106 Alaa Alameldeen – Adaptive Cache Compression

  44. Adapting to L1 Size Alaa Alameldeen – Adaptive Cache Compression

  45. Adapting to Decompression Latency Alaa Alameldeen – Adaptive Cache Compression

  46. Adapting to Decompression Latency Alaa Alameldeen – Adaptive Cache Compression

  47. Phase Behavior Predictor Value (K) Cache Size (MB) Alaa Alameldeen – Adaptive Cache Compression

  48. Phase Behavior Predictor Value (K) Cache Size (MB) Alaa Alameldeen – Adaptive Cache Compression

  49. Can We Do Better Than Adaptive? • Optimal is an unrealistic configuration: Always with no decompression penalty Alaa Alameldeen – Adaptive Cache Compression

More Related