1 / 23

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors. Enric Gibert 1,2 , Jaume Abella 1,2 , Jesús Sánchez 1 , Xavier Vera 1 , Antonio González 1,2. 2 Departament d’Arquitectura de Computadors Universitat Politècnica de Catalunya, Barcelona. 1 Intel Barcelona Research Center

kuper
Download Presentation

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Variable-Based Multi-Module Data Caches for Clustered VLIW Processors Enric Gibert1,2, Jaume Abella1,2, Jesús Sánchez1, Xavier Vera1, Antonio González1,2 2 Departament d’Arquitectura de Computadors Universitat Politècnica de Catalunya, Barcelona 1 Intel Barcelona Research Center Intel Labs, Barcelona

  2. Issue #1: Energy Consumption • First class design goal • Heterogeneity • ↓ supply voltage and/or ↑ threshold voltage • Cache memory  ARM10 • D-cache  24% dynamic energy • I-cache  22% dynamic energy • Heterogeneity can be exploited in the D-cache for VLIW processors processor front-end processor front-end Higher performance Higher energy processor back-end processor back-end Lower performance Lower energy

  3. Issue #2: Wire Delays • From capacity-bound to communication-bound • One possible solution: clustering • Unified cache clustered VLIW processor • Used as a baseline throughout this work Cache Memory buses FUs FUs FUs … Reg. File Reg. File Reg. File CLUSTER 1 CLUSTER 2 CLUSTER n Global communication buses

  4. Contributions • GOAL: exploit heterogeneity in the L1 D-cache for clustered VLIW processors • Power-efficient distributed L1 data cache • Divide data cache into two modules and assign each to a cluster • Modules may be heterogeneous • Map variables statically between cache modules • Develop instruction scheduling techniques • Results summary • Heterogeneous distributed data cache  good design point • Distributed data cache vs. unified data cache • Distributed caches outperform unified schemes in EDD and ED • No single distributed cache configuration is the best • Reconfigurable distributed cache  allows additional improvements

  5. Talk Outline • Variable-Based Multi-Module Data Cache • Distributed Cache Configurations • Instruction Scheduling • Results • Conclusions

  6. Logical Address Space L2 D-CACHE L2 D-CACHE STACK SP2 var X var Y FIRST MODULE SECOND MODULE SECOND SPACE distributed stack frames HEAP DATA  Access memory load *p GLOBAL DATA FU FU  Send reply back STACK RF RF RF RF SP1  Send request CLUSTER 1 CLUSTER 2 CLUSTER 1 CLUSTER 2 FIRST SPACE Register buses HEAP DATA Register buses GLOBAL DATA Variable-Based Multi-Module Cache  Stall clusters  Resume execution load X  Empty communication buses Memory instructions have a preferred cluster  cluster affinity “Wrong” cluster assignment  performance, not correctness

  7. Talk Outline • Variable-Based Multi-Module Data Cache • Distributed Cache Configurations • Instruction Scheduling • Results • Conclusions

  8. FAST+NONE FAST+FAST FAST FAST FAST L2 D-CACHE FU+RF FU+RF FU+RF FU+RF latency ↑ energy ↓ CLUSTER 1 CLUSTER 2 CLUSTER 1 CLUSTER 2 FIRST MODULE SECOND MODULE FAST+SLOW FAST SLOW FU FU FU+RF FU+RF CLUSTER 1 CLUSTER 2 RF RF CLUSTER 1 CLUSTER 2 SLOW+NONE SLOW+SLOW FAST SLOW Register buses 8KB 8KB SLOW SLOW SLOW FU+RF FU+RF FU+RF FU+RF CLUSTER 1 CLUSTER 2 CLUSTER 1 CLUSTER 2 1 R/W 1 R/W Distributed Cache Configurations

  9. Talk Outline • Variable-Based Multi-Module Data Cache • Distributed Cache Configurations • Instruction Scheduling • Results • Conclusions

  10. CACHE CACHE FU+RF FU+RF CLUSTER 1 CLUSTER 2 ST1 LD1 ST2 LD4 LD5 LD3 LD2 Instructions-to-Variables Graph • Built with profiling information • Variables = global, local, heap LD1 LD2 ST1 LD3 ST2 LD4 LD5 VAR V1 VAR V2 VAR V3 VAR V4 FIRST SECOND

  11. Greedy Mapping / Scheduling Algorithm Compute IVG Compute affinities using IVG + propagate affinities Compute mapping Schedule code • Initial mapping  all to first @ space • Assign affinities to instructions • Express a preferred cluster for memory instructions: [0,1] • Propagate affinities from memory insts. to other insts. • Schedule code + refine mapping

  12. slack 0 slack 0 slack 2 slack 2 slack 0 slack 0 slack 2 slack 2 slack 2 slack 0 slack 0 slack 5 FIRST MODULE SECOND MODULE FU FU RF RF CLUSTER 1 CLUSTER 2 Register buses Computing and Propagating Affinity add1 add2 add3 add4 L=1 L=1 L=1 L=1 LD1 LD2 LD3 LD4 L=1 L=1 L=1 L=1 mul1 add5 L=3 L=1 AFFINITY=0 AFFINITY=1 AFF.=0.4 LD1 LD2 ST1 LD3 LD4 add6 add7 L=1 L=1 V1 V2 V3 V4 ST1 L=1 FIRST SECOND

  13. Affinity=0 Affinity=0.4 IA IC Affinity=0.9 Affinity range (0.3, 0.7) IB CACHE CACHE 100 60 40 ≤ 0.3 ≥ 0.7 FU+RF FU+RF V1 V2 V3 ? CLUSTER 1 CLUSTER 2 Cluster Assignment • Cluster affinity + affinity range  used to: • Define a preferred cluster • Guide the instruction-to-cluster assignment process • Strongly preferred cluster • Schedule instruction in that cluster • Weakly preferred cluster • Schedule instruction where global comms. are minimized IA IC IC IB

  14. Talk Outline • Variable-Based Multi-Module Data Cache • DistributedCache Configurations • Instruction Scheduling • Results • Conclusions

  15. latency x2 energy by 1/3 FAST SLOW 8KB 8KB 1 R/W L = 2 1 R/W L = 4 Evaluation Framework • IMPACT compiler infrastructure +16 Mediabench • Cache parameters • CACTI 3.0 + SIA projections + ARM10 datasheets • Data cache consumes 1/3 of the processor energy • Leakage accounts for 50% of the total energy • Results outline • Distributed cache schemes: F+Ø, F+F, F+S, S+S, S+Ø • Affinity range • EDD and ED comparison  the lower, the better • F+Ø used as baseline throughout presentation • Comparison with a unified cache scheme • FAST and SLOW unified schemes • State-of-the-art scheduling techniques for these schemes • Reconfigurable distributed cache

  16. Affinity Range • Affinity plays a key role in cluster assignment • 36% - 44% better in EDD than no-affinity • 32% better in ED than no-affinity • (0,1) affinity range is the best • ~92% of memory instructions access a single variable • Binary affinity for memory instructions

  17. EDD Results

  18. ED Results

  19. Comparison With Unified Cache FAST CACHE SLOW CACHE • Instruction Scheduling Aletà et al. (PACT’02) FUs FUs FUs FUs RF RF RF RF CLUSTER 1 CLUSTER 2 CLUSTER 1 CLUSTER 2 • Distributed schemes are better than unified schemes • 29-31% better in EDD and 19-29% better in ED

  20. Reconfigurable Distributed Cache • The OS can set each module in one state: • FAST mode / SLOW mode / Turned-off • The OS reconfigures the cache on a context switch • Depending on the applications scheduled in and scheduled out • Two different VDD and VTH for the cache • Reconfiguration overhead: 1-2 cycles [Flautner et al. 2002] • Simple heuristic to show potential • For each application, choose the estimated best cache configuration

  21. Talk Outline • Variable-Based Multi-Module Data Cache • Distributed Cache Configurations • Instruction Scheduling • Results • Conclusions

  22. Conclusions • Distributed Variable-Based Multi-Module Cache • Affinity is crucial for achieving good performance • 36-44% better in EDD and 32% in ED than no-affinity • Heterogeneity (FAST+SLOW) is a good design point • 4-11% better in EDD and from 6% worse to 10% better in ED • No single cache configuration is the best • Reconfigurable cache modules  exploit additional 3-4% • Distributed schemes vs. unified schemes • All distributed schemes outperform unified ones • 29-31% better in EDD, 19-29% better in ED

  23. Q&A

More Related