1 / 28

Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache. Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2. 1 Dept. d’Arquitectura de Computadors Universitat Politècnica de Catalunya (UPC) Barcelona. 2 Intel Barcelona Research Center

jabir
Download Presentation

Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache Enric Gibert1 Jesús Sánchez2 Antonio González1,2 1Dept. d’Arquitectura de Computadors Universitat Politècnica de Catalunya (UPC) Barcelona 2Intel Barcelona Research Center Intel Labs Barcelona

  2. Motivation • Capacity vs. Communication-bound • Clustered microarchitectures • Simpler + faster • Power consumption • Communications not homogeneous • Clustering  embedded/DSP domain

  3. L2 cache L2 cache L2 cache Memory buses L1 cache module L1 cache L1 cache module L1 cache module L1 cache module L1 cache module L1 cache module L1 cache module L1 cache module Memory buses FUs FUs FUs FUs FUs FUs FUs FUs FUs FUs FUs FUs Reg. File Reg. File Reg. File Reg. File Reg. File Reg. File Reg. File Reg. File Reg. File Reg. File Reg. File Reg. File CLUSTER 1 CLUSTER 1 CLUSTER 1 CLUSTER 2 CLUSTER 2 CLUSTER 2 CLUSTER 3 CLUSTER 3 CLUSTER 3 CLUSTER 4 CLUSTER 4 CLUSTER 4 Register-to-register communication buses Register-to-register communication buses Register-to-register communication buses Clustered Microarchitectures

  4. Contributions • Distribution of data cache • Architecture design + data mapping • Word-interleaved scheme [ICS’02] • Appropriate scheduling techniques [MICRO’02] • Memory coherence • Scheduling techniques for mem. coherence • Local software-based techniques • Applied to word-interleaved cache • Complex conf. (with Attraction Buffers – refer to paper) • Simple conf. (without Attraction Buffers) • Applicable to any other cache configuration

  5. Talk Outline • Architecture and Scheduling Algorithms • Memory Coherence Problem • Solutions • Memory Dependent Chains (MDC) • DDG Transformations (DDGT) • Evaluation • Conclusions

  6. TAG W0 W1 W2 W3 W4 W5 W6 W7 subblock 1 local hit remote hit local miss remote miss Word-Interleaved Distribution L2 cache cache block TAG W0 W4 TAG W1 W5 TAG W2 W6 TAG W3 W7 cache module cache module cache module cache module Func. Units Func. Units Func. Units Func. Units Register File Register File Register File Register File CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 Register-to-register communication buses

  7. ld r3, a[i] ld r31, a[i] ld r32, a[i+1] ld r33, a[i+2] ld r34, a[i+3] Scheduling Techniques a[0] a[4] a[1] a[5] a[2] a[6] a[3] a[7] cache module cache module cache module cache module CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 Modulo scheduling Loop unrolling Assignment of latencies Padding + Profiling for (i=0; i<MAX; i++) { ld r3, a[i] r4 = OP(r3) st r4, b[i] } for (i=0; i<MAX; i+=4) { ld r31, a[i] (stride 16 bytes) ld r32, a[i+1] (stride 16 bytes) ld r33, a[i+2] (stride 16 bytes) ld r34, a[i+3] (stride 16 bytes) ... }

  8. Cluster Assignment • Non-memory instructions • Minimize register communications • Maximize workload balance • Memory instructions  2 heuristics: • PrefClus Heuristic • Preferred Cluster = most accessed cluster • Profiling + Padding • MinComs Heuristic • Minimize register communications • Maximize workload balance • Post-pass phase to increase local accesses

  9. Talk Outline • Architecture and Scheduling Algorithms • Memory Coherence Problem • Solutions • Memory Dependent Chains (MDC) • DDG Transformations (DDGT) • Evaluation • Conclusions

  10. Store to a[0] Store to a[0] Store to a[0] Store to a[0] Read a[0] Update a[0] CLUSTER 3 CLUSTER 2 a[0] a[3] a[7] a[4] Memory Coherence Problem NEXT MEMORY LEVEL memory buses Cache module Cache module Remote accesses Misses Replacements Others NON-DETERMINISTIC BUS LATENCY!!! CLUSTER 1 CLUSTER 4

  11. Talk Outline • Architecture and Scheduling Algorithms • Memory Coherence Problem • Solutions • Memory Dependent Chains (MDC) • DDG Transformations (DDGT) • Evaluation • Conclusions

  12. Solutions Outline • Local scheduling solutions  applied at a loop granularity • Memory Dependent Chains (MDC) • Data Dependence Graph Transformations (DDGT) • Store replication • Load-store synchronization • Software-based solutions • Applicable to other configurations • Replicated distributed cache • MultiVLIW [MICRO00] …

  13. Memory Dependent Chains • Sets of aliased instructions: • Memory Dependent Chains (MDC) • Instructions in sameset: • Assigned to same cluster • Restrictions on cluster assignment • PrefClus: average preferred cluster • MinComs: minimize comms. when scheduling first node MF = memory-flow MA = memory-anti RF = register-flow n1 load RF n6 load n2 load MA RF RF MF MF n7 div RF n3 add MA RF RF n8 add n4 store

  14. CLUSTER 3 CLUSTER 2 store to a[0] a[0] a[3] a[4] a[7] load from a[0] Memory Dependent Chains NEXT MEMORY LEVEL memory buses Cache module Cache module CLUSTER 1 CLUSTER 4

  15. local instance remote instances DDGT: Store Replication • Overcome MEM_FLOW (MF) and MEM_OUT (MO) store replication store A store A store A’ store A’’ store A’’’ MF MF load B load B store replication store A store A store A’ store A’’ store A’’’ MO MO store B store B store B’ store B’’ store B’’’

  16. Increase number of register communications!!! CLUSTER 3 CLUSTER 2 local instance remote instances a[0] a[3] a[4] a[7] DDGT: Store Replication NEXT MEMORY LEVEL memory buses Cache module Cache module CLUSTER 1 CLUSTER 4

  17. load A load A load-store sync. RF RF MA add add SYNC store B store B load A load A RF load-store sync. MA fake cons store B MA RF RF SYNC MO store B store C store C MO DDGT: ld-st Synchronization • Overcome MEM_ANTI (MA) dependences • Special cases: • Store is already REG_FLOW dependent on the load • Impossible recurrences

  18. always accesses data in cluster 1 always accesses data in cluster 2 cycle 1 RF MRT add cycle 3 A B IIres=2 C C1 C2 C3 C4 Latency LH = 1 cycle Latency RH = 5 cycles MRT MRT A A B IIres=3 IIres=2 B C C C C C C1 C2 C3 C4 C1 C2 C3 C4 MDC Solution: Case Study • Impact on compute time • May increase the IIres load A load B MA MF MF store C • Impact on stall time • May increase remote accesses • Extra stall cycles = 3 cycles / iteration

  19. MRT X X X IIres=2 X C1 C2 C3 C4 MRT X X X A IIres=3 B B X B B C1 C2 C3 C4 DDGT Solution: Case Study • Impact on compute time • More instructions (IIres) • Store replication • Fake consumers (few) • Register communications load A set of memory instructions X MA MF store B • Impact on stall time • Small • New dependences may decrease slack of some memory instructions

  20. Talk Outline • Architecture and Scheduling Algorithms • Memory Coherence Problem • Solutions • Memory Dependent Chains (MDC) • DDG Transformations (DDGT) • Evaluation • Conclusions

  21. Evaluation Framework • IMPACT C compiler • Compile + optimize + memory disambiguation • Mediabench benchmark suite

  22. Evaluation Framework

  23. Local vs. Remote Accesses

  24. Execution Time

  25. Configuration 2 # Buses Latency # Buses Latency Register buses 4 2 Memory buses 2 4 More pressure on memory buses DDGT outperforms best MDC in several cases: epicdec 17%, pgpdec 20%, pgpenc 9%, rasta 7%… Other Configurations • Configuration 1 # Buses Latency # Buses Latency Register buses 2 4 Memory buses 4 2 More pressure on register buses MDC outperforms DDGT in all cases  MDC requires less register communications

  26. Talk Outline • Architecture and Scheduling Algorithms • Memory Coherence Problem • Solutions • Memory Dependent Chains (MDC) • DDG Transformations (DDGT) • Evaluation • Conclusions

  27. Conclusions • Memory coherence problem • Two software-based solutions: MDC and DDGT • Applied to a word-interleaved cache clustered VLIW processor • MDC vs DDGT • Results depending on architecture configuration • MDC outperforms DDGT in most cases • DDGT better by up to 20% in specific configuration • Sets of memory dependent insts. are small • DDGT  freedom in cluster assignment • Increase local accesses by 15%  reduce stall time

  28. Questions?

More Related