1 / 49

Clustered Data Cache Designs for VLIW Processors

Clustered Data Cache Designs for VLIW Processors. PhD Candidate : Enric Gibert Advisors : Antonio González, Jesús Sánchez. Motivation. Two major problems in processor design Wire delays Energy consumption. D. Matzke, "Will Physical Scalability Sabotage Performance Gains?¨¨

mikkel
Download Presentation

Clustered Data Cache Designs for VLIW Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustered Data Cache Designs for VLIW Processors PhD Candidate: Enric Gibert Advisors: Antonio González, Jesús Sánchez

  2. Motivation • Two major problems in processor design • Wire delays • Energy consumption D. Matzke, "Will Physical Scalability Sabotage Performance Gains?¨¨ in IEEE Computer 30(9), pp. 37-39, 1997 Data from www.sandpile.org

  3. Clustering L2 cache L1 cache Memory buses FUs FUs FUs FUs Reg. File Reg. File Reg. File Reg. File CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 Register-to-register communication buses

  4. Data Cache • Latency • Energy • Leakage will soon dominate energy consumption • Cache memories will probably be the main source of leakage (S. Hill, Hot Chips 13) • In this Thesis: • Latency Reduction Techniques • Energy Reduction Techniques

  5. Contributions of this Thesis • Memory hierarchy for clustered VLIW processors • Latency Reduction Techniques • Distribution of the Data Cache among clusters • Cost-effective cache coherence solutions • Word-Interleaved distributed data cache • Flexible Compiler-Managed L0 Buffers • Energy Reduction Techniques • Heterogeneous Multi-module Data Cache • Unified processors • Clustered processors

  6. Evaluation Framework • IMPACT C compiler • Compile + optimize + memory disambiguation • Mediabench benchmark suite • Microarchitectural VLIW simulator

  7. Presentation Outline • Latency reduction techniques • Software memory coherence in distributed caches • Word-interleaved distributed cache • Flexible Compiler-Managed L0 Buffers • Energy reduction techniques • Multi-Module cache for clustered VLIW processor • Conclusions

  8. L1 cache L1 cache module L1 cache module L1 cache module L1 cache module Memory buses Distributing the Data Cache L2 cache FUs FUs FUs FUs Reg. File Reg. File Reg. File Reg. File CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 Register-to-register communication buses

  9. MultiVLIW L2 cache cache block MSI cache coherence protocol L1 cache module L1 cache module L1 cache module L1 cache module FUs FUs FUs FUs Reg. File Reg. File Reg. File Reg. File CLUSTER1 CLUSTER 2 CLUSTER 3 CLUSTER 4 Register-to-register communication buses (Sánchez and González, MICRO33)

  10. Presentation Outline • Latency reduction techniques • Software memory coherence in distributed caches • Word-interleaved distributed cache • Flexible Compiler-Managed L0 Buffers • Energy reduction techniques • Multi-Module cache for clustered VLIW processor • Conclusions

  11. new value of X new value of X new value of X new value of X Update X Read X Memory Coherence NEXT MEMORY LEVEL memory buses Cache module Cache module Remote accesses Misses Replacements Others NON-DETERMINISTIC BUS LATENCY!!! X CLUSTER 2 CLUSTER 3 CLUSTER 1 CLUSTER 4

  12. Coherence Solutions: Overview • Local scheduling solutions  applied to loops • Memory Dependent Chains (MDC) • Data Dependence Graph Transformations (DDGT) • Store replication • Load-store synchronization • Software-based solutions with little hardware support • Applicable to different configurations • Word-interleaved cache • Replicated distributed cache • Flexible Compiler-Managed L0 Buffers

  13. Scheme 1: Mem. Dependent Chains • Sets of memory dependent instructions • Memory disambiguation by the compiler • Conservative assumptions • Assign instructions in same set to same cluster LD LD cache module cache module X CLUSTER 2 CLUSTER 3 Register deps ADD CLUSTER 1 CLUSTER 4 Memory deps store to X store to X ST load from X

  14. local instance remote instances Scheme 2: DDG transformations (I) • 2 transformations applied together • Store replication  overcome MF and MO • Little support from the hardware cache module cache module cache module cache module X CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 store to X store to X store to X store to X load from X

  15. MA SYNC Scheme 2: DDG transformations (II) • Load-store synchronization  overcome MA dependences cache module cache module LD X CLUSTER 2 CLUSTER 4 RF add CLUSTER 1 CLUSTER 3 load from X ST store to X add

  16. Results: Memory Coherence • Memory Dependent Chains (MDC) • Badsince restrictions on the assignment of instructions to clusters • Good when memory disambiguation is accurate • DDG Transformations (DDGT) • Good when there is pressure in the memory buses • Increases number of local accesses • Bad when there is pressure in the register buses • Big increase in inter-cluster communications • Solutions useful for different cache schemes

  17. Presentation Outline • Latency reduction techniques • Software memory coherence in distributed caches • Word-interleaved distributed cache • Flexible Compiler-Managed L0 Buffers • Energy reduction techniques • Multi-Module cache for clustered VLIW processor • Conclusions

  18. Word-Interleaved Cache • Simplify hardware • As compared to MultiVLIW • Avoid replication • Strides +1/-1 element are predominant • Page interleaved • Block interleaved • Word interleaved  best suited

  19. remote hit TAG W0 W1 W2 W3 W4 W5 W6 W7 local hit cache block subblock 1 W0 W4 W1 W5 W2 W6 W3 W7 local miss remote miss Architecture L2 cache TAG TAG TAG TAG cache module cache module cache module cache module Func. Units Func. Units Func. Units Func. Units Register File Register File Register File Register File CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 Register-to-register communication buses

  20. 25% of local accesses Instruction Scheduling (I): Unrolling a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] cache module cache module cache module cache module CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 for (i=0; i<MAX; i++) { ld r3, @a[i] … } for (i=0; i<MAX; i=i+4) { ld r3, @a[i] ld r3, @a[i+1] ld r3, @a[i+2] ld r3, @a[i+3] … } ld r3, @a[i] ld r3, @a[i+1] 100% of local accesses ld r3, @a[i+2] ld r3, @a[i+3]

  21. large latencies small latencies Cluster 1 C2 C3 C4 Cluster 1 C2 C3 C4 cycle 1 LD cycle 1 LD cycle 2 cycle 2 cycle 3 add cycle 3 cycle 4 cycle 5 add Instruction Scheduling (II) • Assign appropriate latency to memory instruction • Small latencies  ILP ↑, stall time ↑ • Large latencies  ILP ↓, stall time ↓ • Start with large latency (remote miss) + iteratively reassign appropriate latencies (local miss, remote hit, local hit) LD RF add

  22. Instruction Scheduling (III) • Assign instructions to clusters • Non-memory instructions • Minimize inter-cluster communications • Maximize workload balance among clusters • Memory instructions  2 heuristics • Preferred cluster (PrefClus) • Average preferred cluster of memory dependent set • Minimize inter-cluster communications (MinComs) • Min. Comms. for 1st instruction of the memory dependent set

  23. Memory Accesses • Sources of remote accesses: • Indirect, chains restrictions, double precision, …

  24. Attraction Buffers • Cost-effective mechanism  ↑ local accesses cache module cache module cache module cache module a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[0] a[4] Attraction Buffer AB AB AB CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 i=0 local accesses 0%  50% load a[i] i=i+4 loop • Results • ~ 15% INCREASE in local accesses • ~30-35% REDUCTION in stall time • 5-7% REDUCTION in overall execution time

  25. Performance

  26. Presentation Outline • Latency reduction techniques • Software memory coherence in distributed caches • Word-interleaved distributed cache • Flexible Compiler-Managed L0 Buffers • Energy reduction techniques • Multi-Module cache for clustered VLIW processor • Conclusions

  27. Why L0 Buffers • Still keep hardware simple, but… • ... Allow dynamic binding between addresses and clusters

  28. unpack logic L0 buffer L0 buffer L0 Buffers • Small number of entries  flexibility • Adaptative to application + dynamic address-cluster binding • Controlled by software  load/store hints • Mark instructions to access the buffers: which and how • Flexible Compiler-Managed L0 Buffers L1 cache INT FP MEM INT FP MEM CLUSTER 3 CLUSTER 4 Register File Register File CLUSTER 1 CLUSTER 2

  29. unpack logic 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 linear mapping interleaved mapping (1 cycle penalty) 4 bytes 4 bytes 4 bytes 1 1 4 bytes 2 2 3 3 4 4 a[0] a[0] a[0] a[0] a[0] a[1] a[1] a[1] a[1] a[1] a[0] a[4] All loads with a 4-element stride a[1] a[5] a[3] a[7] a[2] a[6] load a[3] load a[0] load a[1] load a[2] load a[0] with stride 1 element Mapping Flexibility a[1] a[3] a[0] a[2] a[4] a[5] a[6] a[7] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 L1 block (16 bytes) L1 cache L0 Buffer L0 Buffer L0 Buffer L0 Buffer CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4

  30. Hints and L0-L1 Interface • Memory hints • Access or bypass the L0 Buffers • Data mapping: linear/interleaved • Prefetch hints  next/previous blocks • L0 are write-through with respect to L1 • Simplifies replacements • Makes hardware simple • No arbitration • No logic to pack data back correctly • Simplifies coherence among L0 Buffers

  31. Instruction Scheduling • Selective loop unrolling • No unroll vs. unroll by N • Assign latencies to memory instructions • Critical instructions (slack) use L0 Buffers • Do not overflow L0 Buffers • Use counter of L0 Buffer free entries / cluster • Do not schedule critical instruction into cluster with counter == 0 • Memory coherence • Cluster assignment + schedule instructions • Minimize global communications • Maximize workload balance • Critical  Priority to clusters where L0 Buffer can be used • Explicit prefetching

  32. Number of Entries

  33. Performance

  34. Global Comparative

  35. Presentation Outline • Latency reduction techniques • Software memory coherence in distributed caches • Word-interleaved distributed cache • Flexible Compiler-Managed L0 Buffers • Energy reduction techniques • Multi-Module cache for clustered VLIW processor • Conclusions

  36. Motivation • Energy consumption  1st class design goal • Heterogeneity • ↓ supply voltage and/or ↑ threshold voltage • Cache memory  ARM10 • D-cache  24% dynamic energy • I-cache  22% dynamic energy • Exploit heterogeneity in the L1 D-cache? processor front-end processor front-end structure tuned for performance processor back-end processor back-end structure tuned for energy

  37. Variable-Based Multi-Module STACK SP2 L2D-CACHE SLOW SPACE distributed stack frames HEAP DATA FAST MODULE SLOW MODULE GLOBAL DATA STACK load/store queues SP1 ROB FAST SPACE L1 D-CACHE HEAP DATA @ GLOBAL DATA • It is possible to exploit heterogeneity! Multi-Module Data Cache Instruction-Based Multi-Module (Abella and González, ICCD 2003) L2 D-CACHE FAST CACHE MODULE SLOW CACHE MODULE CRITICALITY TABLE PROCESSOR inst PC

  38. FAST+NONE FAST+FAST FAST FAST FAST L2 D-CACHE FU+RF FU+RF FU+RF FU+RF latency x2 energy by 1/3 CLUSTER 1 CLUSTER 2 CLUSTER 1 CLUSTER 2 FIRST MODULE SECOND MODULE FAST+SLOW FAST SLOW FU FU FU+RF FU+RF CLUSTER 1 CLUSTER 2 RF RF CLUSTER 1 CLUSTER 2 SLOW+NONE SLOW+SLOW FAST SLOW Register buses 8KB 8KB SLOW SLOW SLOW FU+RF FU+RF FU+RF FU+RF CLUSTER 1 CLUSTER 2 CLUSTER 1 CLUSTER 2 L=2 1 R/W L=4 1 R/W Cache Configurations

  39. CACHE CACHE FU+RF FU+RF CLUSTER 1 CLUSTER 2 ST1 LD1 ST2 LD4 LD5 LD3 LD2 Instr.-to-Variable Graph (IVG) • Built with profiling information • Variables = global, local, heap LD1 LD2 ST1 LD3 ST2 LD4 LD5 VAR V1 VAR V2 VAR V3 VAR V4 FIRST SECOND

  40. Greedy Mapping Algorithm • Initial mapping  all to first @ space • Assign affinities to instructions • Express a preferred cluster for memory instructions: [0,1] • Propagate affinities to other instructions • Schedule code + refine mapping Compute IVG Compute affinities + propagate affinities Compute mapping Schedule code

  41. slack 0 slack 0 slack 2 slack 2 slack 0 slack 0 slack 2 slack 2 slack 2 slack 0 slack 0 slack 5 FIRST MODULE SECOND MODULE FU FU RF RF CLUSTER 1 CLUSTER 2 Register buses Computing and Propagating Affinity add1 add2 add3 add4 L=1 L=1 L=1 L=1 LD1 LD2 LD3 LD4 L=1 L=1 L=1 L=1 mul1 add5 L=3 L=1 AFFINITY=0 AFFINITY=1 AFF.=0.4 LD1 LD2 ST1 LD3 LD4 add6 add7 L=1 L=1 V1 V2 V3 V4 ST1 L=1 FIRST SECOND

  42. Affinity=0 Affinity=0.4 IA IC Affinity=0.9 Affinity range (0.3, 0.7) IB CACHE CACHE 100 60 40 ≤ 0.3 ≥ 0.7 FU+RF FU+RF V1 V2 V3 ? CLUSTER 1 CLUSTER 2 Cluster Assignment • Cluster affinity + affinity range  used to: • Define a preferred cluster • Guide the instruction-to-cluster assignment process • Strongly preferred cluster • Schedule instruction in that cluster • Weakly preferred cluster • Schedule instruction where global comms. are minimized IA IC IC IB

  43. EDD Results

  44. Other Results • ED • The SLOW schemes are better • In all cases, these schemes are better than unified cache • 29-31% better in EDD, 19-29% better in ED • No configuration is best for all cases

  45. Reconfigurable Cache Results • The OS can set each module in one state: • FAST mode / SLOW mode / Turned-off • The OS reconfigures the cache on a context switch • Depending on the applications scheduled in and scheduled out • Two different VDD and VTH for the cache • Reconfiguration overhead: 1-2 cycles [Flautner et al. 2002] • Simple heuristic to show potential • For each application, choose the estimated best cache configuration

  46. Presentation Outline • Latency reduction techniques • Software memory coherence in distributed caches • Word-interleaved distributed cache • Flexible Compiler-Managed L0 Buffers • Energy reduction techniques • Multi-Module cache for clustered VLIW processor • Conclusions

  47. Conclusions • Cache partitioning is a good latency reduction technique • Cache heterogeneity can be used to exploit energy efficiency • The best energy and performance efficient scheme is a distributed data cache • Dynamic vs. Static mapping between addresses and clusters • Dynamic for performance (L0 Buffers) • Static for energy consumption (Variable-Based mapping) • Hardware vs. Software-based memory coherence solutions • Software solutions are viable

  48. List of Publications • Distributed Data Cache Memories • ICS, 2002 • MICRO-35, 2002 • CGO-1, 2003 • MICRO-36, 2003 • IEEE Transactions on Computers, October 2005 • Concurrency & Computation: practice and experience • (to appear late ’05 / ’06) • Heterogeneous Data Cache Memories • Technical report UPC-DAC-RR-ARCO-2004-4, 2004 • PACT, 2005

  49. Questions…

More Related