1 / 52

Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture

Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture. Seongbeom Kim , Dhruba Chandra, and Yan Solihin Dept. of Electrical and Computer Engineering North Carolina State University {skim16, dchandr, solihin}@ncsu.edu. Cache Sharing in CMP. Processor Core 1.

len
Download Presentation

Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fair Cache Sharing and Partitioningin a Chip Multiprocessor Architecture Seongbeom Kim, Dhruba Chandra, and Yan Solihin Dept. of Electrical and Computer Engineering North Carolina State University {skim16, dchandr, solihin}@ncsu.edu

  2. Cache Sharing in CMP Processor Core 1 Processor Core 2 L1 $ L1 $ L2 $ …… Seongbeom Kim, NCSU

  3. Cache Sharing in CMP Processor Core 1 Processor Core 2 ←t1 L1 $ L1 $ L2 $ …… Seongbeom Kim, NCSU

  4. Cache Sharing in CMP Processor Core 1 Processor Core 2 t2→ L1 $ L1 $ L2 $ …… Seongbeom Kim, NCSU

  5. Cache Sharing in CMP Processor Core 1 Processor Core 2 ←t1 t2→ L1 $ L1 $ L2 $ …… t2’s throughput is significantly reduced due to unfair cache sharing. Seongbeom Kim, NCSU

  6. Shared L2 cache space contention Seongbeom Kim, NCSU

  7. Shared L2 cache space contention Seongbeom Kim, NCSU

  8. time slice t1 t2 t3 t1 t4 time slice t1 t1 t1 t1 t1 t3 t3 t2 t2 t4 Impact of unfair cache sharing • Uniprocessor scheduling • 2-core CMP scheduling • Problems of unfair cache sharing • Sub-optimal throughput • Thread starvation • Priority inversion • Thread-mix dependent throughput • Fairness: uniform slowdown for co-scheduled threads P1: P2: Seongbeom Kim, NCSU

  9. Contributions • Cache fairness metrics • Easy to measure • Approximate uniform slowdown well • Fair caching algorithms • Static/dynamic cache partitioning • Optimizing fairness • Simple hardware modifications • Simulation results • Fairness: 4x improvement • Throughput • 15% improvement • Comparable to cache miss minimization approach Seongbeom Kim, NCSU

  10. Related Work • Cache miss minimization in CMP: • G. Suh, S. Devadas, L. Rudolph, HPCA 2002 • Balancing throughput and fairness in SMT: • K. Luo, J. Gummaraju, M. Franklin, ISPASS 2001 • A. Snavely and D. Tullsen, ASPLOS, 2000 • … Seongbeom Kim, NCSU

  11. Outline • Fairness Metrics • Static Fair Caching Algorithms (See Paper) • Dynamic Fair Caching Algorithms • Evaluation Environment • Evaluation • Conclusions Seongbeom Kim, NCSU

  12. Fairness Metrics • Uniform slowdown Execution time of ti when it runs alone. Seongbeom Kim, NCSU

  13. Fairness Metrics • Uniform slowdown Execution time of ti when it shares cache with others. Seongbeom Kim, NCSU

  14. Fairness Metrics • Uniform slowdown • We want to minimize: • Ideally: Seongbeom Kim, NCSU

  15. Fairness Metrics • Uniform slowdown • We want to minimize: • Ideally: Seongbeom Kim, NCSU

  16. Fairness Metrics • Uniform slowdown • We want to minimize: • Ideally: Seongbeom Kim, NCSU

  17. Outline • Fairness Metrics • Static Fair Caching Algorithms (See Paper) • Dynamic Fair Caching Algorithms • Evaluation Environment • Evaluation • Conclusions Seongbeom Kim, NCSU

  18. LRU LRU LRU LRU Partitionable Cache Hardware • Modified LRU cache replacement policy • G. Suh, et. al., HPCA 2002 Current Partition Target Partition P1: 448B P1: 384B P2: 576B P2: 640B P2 Miss Seongbeom Kim, NCSU

  19. LRU LRU LRU LRU LRU LRU * * LRU LRU Partitionable Cache Hardware • Modified LRU cache replacement policy • G. Suh, et. al., HPCA 2002 Current Partition Target Partition P1: 448B P1: 384B P2: 576B P2: 640B P2 Miss Current Partition Target Partition P1: 384B P1: 384B P2: 640B P2: 640B Seongbeom Kim, NCSU

  20. MissRate shared P1: P2: Repartitioning interval Target Partition P1: P2: Dynamic Fair Caching Algorithm MissRate alone Ex) Optimizing M3 metric P1: P2: Seongbeom Kim, NCSU

  21. MissRate shared MissRate shared P1: P1:20% P2: P2:15% Repartitioning interval Target Partition P1:256KB P2:256KB Dynamic Fair Caching Algorithm MissRate alone 1st Interval P1:20% P2: 5% Seongbeom Kim, NCSU

  22. MissRate shared P1:20% P2:15% Repartitioning interval Target Partition Target Partition P1:256KB P1:192KB P2:320KB P2:256KB Dynamic Fair Caching Algorithm MissRate alone Repartition! P1:20% P2: 5% Evaluate M3 P1: 20% / 20% P2: 15% / 5% Partition granularity: 64KB Seongbeom Kim, NCSU

  23. MissRate shared MissRate shared MissRate shared P1:20% P1:20% P1:20% P2:10% P2:15% P2:15% Repartitioning interval Target Partition P1:192KB P2:320KB Dynamic Fair Caching Algorithm MissRate alone 2nd Interval P1:20% P2: 5% Seongbeom Kim, NCSU

  24. MissRate shared MissRate shared P1:20% P1:20% P2:15% P2:10% Repartitioning interval Target Partition Target Partition P1:128KB P1:192KB P2:384KB P2:320KB Dynamic Fair Caching Algorithm MissRate alone Repartition! P1:20% P2: 5% Evaluate M3 P1: 20% / 20% P2: 10% / 5% Seongbeom Kim, NCSU

  25. MissRate shared MissRate shared MissRate shared P1:20% P1:20% P1:25% P2:10% P2: 9% P2:10% Repartitioning interval Target Partition P1:128KB P2:384KB Dynamic Fair Caching Algorithm MissRate alone 3rd Interval P1:20% P2: 5% Seongbeom Kim, NCSU

  26. MissRate shared MissRate shared P1:25% P1:20% P2:10% P2: 9% Repartitioning interval Target Partition Target Partition P1:192KB P1:128KB P2:320KB P2:384KB Dynamic Fair Caching Algorithm MissRate alone Do Rollback if: P2: Δ<Trollback Δ=MRold-MRnew Repartition! P1:20% P2: 5% Seongbeom Kim, NCSU

  27. Fair Caching Overhead • Partitionable cache hardware • Profiling • Static profiling for M1, M3 • Dynamic profiling for M1, M3, M4 • Storage • Per-thread registers • Miss rate/count for “alone” case • Miss rate/count for “shared’ case • Repartitioning algorithm • < 100 cycles overhead in 2-core CMP • invoked at every repartitioning interval Seongbeom Kim, NCSU

  28. Outline • Fairness Metrics • Static Fair Caching Algorithms (See Paper) • Dynamic Fair Caching Algorithms • Evaluation Environment • Evaluation • Conclusions Seongbeom Kim, NCSU

  29. Evaluation Environment • UIUC’s SESC Simulator • Cycle accurate Seongbeom Kim, NCSU

  30. Evaluation Environment • 18 benchmark pairs • Algorithm Parameters • Static algorithms: FairM1 • Dynamic algorithms: FairM1Dyn, FairM3Dyn, FairM4Dyn Seongbeom Kim, NCSU

  31. Outline • Fairness Metrics • Static Fair Caching Algorithms (See Paper) • Dynamic Fair Caching Algorithms • Evaluation Environment • Evaluation • Correlation results • Static fair caching results • Dynamic fair caching results • Impact of rollback threshold • Impact of time interval • Conclusions Seongbeom Kim, NCSU

  32. Correlation Results Seongbeom Kim, NCSU

  33. Correlation Results M1 & M3 show best correlation with M0. Seongbeom Kim, NCSU

  34. Static Fair Caching Results Seongbeom Kim, NCSU

  35. Static Fair Caching Results FairM1 has comparable throughput as MinMiss with better fairness Seongbeom Kim, NCSU

  36. Static Fair Caching Results Opt assures that better fairness is achieved without throughput loss. Seongbeom Kim, NCSU

  37. Dynamic Fair Caching Results Seongbeom Kim, NCSU

  38. Dynamic Fair Caching Results FairM1Dyn, FairM3Dyn show best fairness and throughput. Seongbeom Kim, NCSU

  39. Dynamic Fair Caching Results Improvement in fairness results in throughput gain. Seongbeom Kim, NCSU

  40. Dynamic Fair Caching Results Fair caching sometimes degrades throughput (2 out of 18). Seongbeom Kim, NCSU

  41. Impact of Rollback Threshold in FairM1Dyn Seongbeom Kim, NCSU

  42. Impact of Rollback Threshold in FairM1Dyn ’20% Trollback’ shows best fairness and throughput. Seongbeom Kim, NCSU

  43. Impact of Repartitioning Interval in FairM1Dyn Seongbeom Kim, NCSU

  44. Impact of Repartitioning Interval in FairM1Dyn ‘10K L2 accesses’ shows best fairness and throughput. Seongbeom Kim, NCSU

  45. Outline • Fairness Metrics • Static Fair Caching Algorithms (See Paper) • Dynamic Fair Caching Algorithms • Evaluation Environment • Evaluation • Conclusions Seongbeom Kim, NCSU

  46. Conclusions • Problems of unfair cache sharing • Sub-optimal throughput • Thread starvation • Priority inversion • Thread-mix dependent throughput • Contributions • Cache fairness metrics • Static/dynamic fair caching algorithms • Benefits of fair caching • Fairness: 4x improvement • Throughput • 15% improvement • Comparable to cache miss minimization approach • Fair caching simplifies scheduler design • Simple hardware support Seongbeom Kim, NCSU

  47. Partitioning Histogram Mostly oscillating between two partitioning choices. Seongbeom Kim, NCSU

  48. Partitioning Histogram Trollback of 35% can still find better partition. Seongbeom Kim, NCSU

  49. Impact of Partition Granularity in FairM1Dyn 64KB shows best fairness and throughput. Seongbeom Kim, NCSU

  50. Impact of Initial Partition in FairM1Dyn Tolerable differences from various initial partition. Seongbeom Kim, NCSU

More Related