1 / 44

Managing Wire Delay in Large CMP Caches

Managing Wire Delay in Large CMP Caches. Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO 2004 12/8/04. Overview. Managing wire delay in shared CMP caches Three techniques extended to CMPs On-chip Strided Prefetching (not in talk – see paper)

jael
Download Presentation

Managing Wire Delay in Large CMP Caches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO 2004 12/8/04

  2. Overview • Managing wire delay in shared CMP caches • Three techniques extended to CMPs • On-chip Strided Prefetching (not in talk – see paper) • Scientific workloads: 10% average reduction • Commercial workloads: 3% average reduction • Cache Block Migration (e.g. D-NUCA) • Block sharing limits average reduction to 3% • Dependence on difficult to implement smart search • On-chip Transmission Lines (e.g. TLC) • Reduce runtime by 8% on average • Bandwidth contention accounts for 26% of L2 hit latency • Combining techniques • Potentially alleviates isolated deficiencies • Up to 19% reduction vs. baseline • Implementation complexity Managing Wire Delay in Large CMP Caches

  3. CPU 0 CPU 1 L1 D$ L1 I$ L1 I$ L1 D$ L2 Bank L2 Bank L2 Bank Current CMP: IBM Power 5 2 CPUs 3 L2 Cache Banks Managing Wire Delay in Large CMP Caches

  4. CPU 0 CPU 1 L1 D$ L1 I$ L1 I$ L1 D$ L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 CPU 5 CPU 3 CPU 7 CPU 6 CPU 2 CPU 4 L1 I$ L1 D$ L1 I$ L1 D$ L1 D$ L1 I$ L1 D$ L1 I$ L1 I$ L1 D$ L1 I$ L1 D$ CMP Trends 2004 Reachable Distance / Cycle 2010 Reachable Distance / Cycle 2010 technology 2004 technology

  5. L1 I $ L1 I $ CPU 2 CPU 3 L1 D $ L1 D $ CPU 4 L1 I $ L1 D $ L1 D $ L1 I $ CPU 1 CPU 5 L1 I $ L1 D $ L1 D $ L1 I $ CPU 0 L1 D $ L1 D $ CPU 7 CPU 6 L1 I $ L1 I $ Baseline: CMP-SNUCA

  6. Outline • Global interconnect and CMP trends • Latency Management Techniques • Evaluation • Methodology • Block Migration: CMP-DNUCA • Transmission Lines: CMP-TLC • Combination: CMP-Hybrid Managing Wire Delay in Large CMP Caches

  7. Block Migration: CMP-DNUCA L1 I $ L1 I $ CPU 2 CPU 3 L1 D $ L1 D $ CPU 4 A B L1 I $ L1 D $ L1 D $ L1 I $ CPU 1 CPU 5 L1 I $ L1 D $ B A L1 D $ L1 I $ CPU 0 L1 D $ L1 D $ CPU 7 CPU 6 L1 I $ L1 I $

  8. On-chip Transmission Lines • Similar to contemporary off-chip communication • Provides a different latency / bandwidth tradeoff • Wires behave more “transmission-line” like as frequency increases • Utilize transmission line qualities to our advantage • No repeaters – route directly over large structures • ~10x lower latency across long distances • Limitations • Requires thick wires and dielectric spacing • Increases manufacturing cost Managing Wire Delay in Large CMP Caches

  9. L1 I $ L1 I $ CPU 3 CPU 4 L1 D $ L1 D $ L1 I $ L1 I $ CPU 2 CPU 5 L1 D $ L1 D $ L1 I $ L1 I $ CPU 1 CPU 6 L1 D $ L1 D $ L1 I $ L1 I $ CPU 0 CPU 7 L1 D $ L1 D $ Transmission Lines: CMP-TLC 16 8-byte links

  10. Combination: CMP-Hybrid L1 I $ L1 I $ CPU 2 CPU 3 L1 D $ L1 D $ CPU 4 L1 I $ L1 D $ L1 D $ L1 I $ CPU 1 8 32-byte links CPU 5 L1 I $ L1 D $ L1 D $ L1 I $ CPU 0 L1 D $ L1 D $ CPU 7 CPU 6 L1 I $ L1 I $

  11. Outline • Global interconnect and CMP trends • Latency Management Techniques • Evaluation • Methodology • Block Migration: CMP-DNUCA • Transmission Lines: CMP-TLC • Combination: CMP-Hybrid Managing Wire Delay in Large CMP Caches

  12. Methodology • Full system simulation • Simics • Timing model extensions • Out-of-order processor • Memory system • Workloads • Commercial • apache, jbb, otlp, zeus • Scientific • Splash: barnes & ocean • SpecOMP: apsi & fma3d Managing Wire Delay in Large CMP Caches

  13. System Parameters Managing Wire Delay in Large CMP Caches

  14. Outline • Global interconnect and CMP trends • Latency Management Techniques • Evaluation • Methodology • Block Migration: CMP-DNUCA • Transmission Lines: CMP-TLC • Combination: CMP-Hybrid Managing Wire Delay in Large CMP Caches

  15. CMP-DNUCA: Organization CPU 3 CPU 2 CPU 4 Bankclusters CPU 1 Local Inter. Center CPU 5 CPU 0 CPU 6 CPU 7

  16. Hit Distribution: Grayscale Shading CPU 3 CPU 2 Greater % of L2 Hits CPU 4 CPU 1 CPU 5 CPU 0 CPU 6 CPU 7 Managing Wire Delay in Large CMP Caches

  17. CMP-DNUCA: Migration • Migration policy • Gradual movement • Increases local hits and reduces distant hits my center bankcluster other bankclusters my local bankcluster my inter. bankcluster Managing Wire Delay in Large CMP Caches

  18. CMP-DNUCA: Hit Distribution Ocean per CPU CPU 0 CPU 1 CPU 2 CPU 3 CPU 4 CPU 6 CPU 5 CPU 7 Managing Wire Delay in Large CMP Caches

  19. CMP-DNUCA: Hit Distribution Ocean all CPUs Block migration successfully separates the data sets Managing Wire Delay in Large CMP Caches

  20. CMP-DNUCA: Hit Distribution OLTP all CPUs Managing Wire Delay in Large CMP Caches

  21. CMP-DNUCA: Hit Distribution OLTP per CPU CPU 0 CPU 1 CPU 2 CPU 3 CPU 4 CPU 6 CPU 5 CPU 7 Hit Clustering:Most L2 hits satisfied by the center banks Managing Wire Delay in Large CMP Caches

  22. CMP-DNUCA: Search • Search policy • Uniprocessor DNUCA solution: partial tags • Quick summary of the L2 tag state at the CPU • No known practical implementationfor CMPs • Size impact of multiple partial tags • Coherence between block migrations and partial tag state • CMP-DNUCA solution: two-phase search • 1st phase: CPU’s local, inter., & 4 center banks • 2nd phase: remaining 10 banks • Slow2nd phase hits and L2 misses Managing Wire Delay in Large CMP Caches

  23. CMP-DNUCA: L2 Hit Latency Managing Wire Delay in Large CMP Caches

  24. CMP-DNUCA Summary • Limited success • Ocean successfully splits • Regular scientific workload – little sharing • OLTP congregates in the center • Commercial workload – significant sharing • Smart search mechanism • Necessary for performance improvement • No known implementations • Upper bound – perfect search Managing Wire Delay in Large CMP Caches

  25. Outline • Global interconnect and CMP trends • Latency Management Techniques • Evaluation • Methodology • Block Migration: CMP-DNUCA • Transmission Lines: CMP-TLC • Combination: CMP-Hybrid Managing Wire Delay in Large CMP Caches

  26. L2 Hit Latency Bars Labeled D: CMP-DNUCA T: CMP-TLC H: CMP-Hybrid Managing Wire Delay in Large CMP Caches

  27. Overall Performance Transmission lines improve L2 hit and L2 miss latency Managing Wire Delay in Large CMP Caches

  28. Conclusions • Individual Latency Management Techniques • Strided Prefetching: subset of misses • Cache Block Migration: sharing impedes migration • On-chip Transmission Lines: limited bandwidth • Combination: CMP-Hybrid • Potentially alleviates bottlenecks • Disadvantages • Relies on smart-search mechanism • Manufacturing cost of transmission lines Managing Wire Delay in Large CMP Caches

  29. Backup Slides Managing Wire Delay in Large CMP Caches

  30. Strided Prefetching • Utilize repeatable memory access patterns • Subset of misses • Tolerates latency within the memory hierarchy • Our implementation • Similar to Power4 • Unit and Non-unit stride misses L2 – Mem L1 – L2 Managing Wire Delay in Large CMP Caches

  31. On and Off-chip Prefetching Benchmarks Commercial Scientific Managing Wire Delay in Large CMP Caches

  32. CMP Sharing Patterns Managing Wire Delay in Large CMP Caches

  33. CMP Request Distribution Managing Wire Delay in Large CMP Caches

  34. CMP-DNUCA: Search Strategy CPU 3 CPU 2 CPU 4 Bankclusters CPU 1 Local Inter. Center CPU 5 CPU 0 1st Search Phase 2nd Search Phase CPU 6 CPU 7 Uniprocessor DNUCA: partial tag array for smart searches Significant implementation complexity for CMP-DNUCA

  35. CMP-DNUCA: Migration Strategy CPU 3 CPU 2 CPU 4 Bankclusters CPU 1 Local Inter. Center CPU 5 CPU 0 CPU 6 CPU 7 other local other inter. other center my center my inter. my local

  36. Uncontended Latency Comparison Managing Wire Delay in Large CMP Caches

  37. CMP-DNUCA: L2 Hit Distribution Benchmarks Managing Wire Delay in Large CMP Caches

  38. CMP-DNUCA: L2 Hit Latency Managing Wire Delay in Large CMP Caches

  39. CMP-DNUCA: Runtime Managing Wire Delay in Large CMP Caches

  40. CMP-DNUCA Problems • Hit clustering • Shared blocks move within the center • Equally far from all processors • Search complexity • 16 separate clusters • Partial tags impractical • Distributed information • Synchronization complexity Managing Wire Delay in Large CMP Caches

  41. CMP-TLC: L2 Hit Latency Bars Labeled D: CMP-DNUCA T: CMP-TLC Managing Wire Delay in Large CMP Caches

  42. Runtime: Isolated Techniques Managing Wire Delay in Large CMP Caches

  43. CMP-Hybrid: Performance Managing Wire Delay in Large CMP Caches

  44. Energy Efficiency Managing Wire Delay in Large CMP Caches

More Related