1 / 23

An Interleaved Cache Clustered VLIW Processor

An Interleaved Cache Clustered VLIW Processor. E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica de Catalunya (UPC) * Also at Intel Barcelona Research Center June 2002. Motivation. Motivation. Capacity -bound vs. Communication -bound

karlyn
Download Presentation

An Interleaved Cache Clustered VLIW Processor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez* and A. González* Dept. d’Arquitectura de Computadors Universitat Politècnica de Catalunya (UPC) *Also at Intel Barcelona Research Center June 2002

  2. Motivation Motivation • Capacity-bound vs. Communication-bound • Solution: clustered microarchitectures • Partition some hardware resources • Simpler + faster • Power consumption • Communications not homogeneous • Goal: clustering the memory hierarchy in statically scheduled processors

  3. Talk Outline • State-of-the-art: multiVLIW • Interleaved Cache Clustered VLIW • Scheduling Algorithms • Enhancement: Attraction Buffers • Experimental Framework • Results • Conclusions

  4. Register-to-register buses Reg. File Reg. File Reg. File F.U. F.U. F.U. Cluster 1 Cluster 2 Cluster n ... L1 data cache L1 data cache L1 data cache Coherency network Next memory level State-of-the-art: MultiVLIW • Sánchez and González [MICRO’00]

  5. Talk Outline • State-of-the-art: multiVLIW • Interleaved Cache Clustered VLIW • Scheduling Algorithms • Enhancement: Attraction Buffers • Experimental Framework • Results • Conclusions

  6. TAG W0 W1 W2 W3 W4 W5 W6 W7 Subblock 1 TAG TAG TAG W2 W1 W3 W5 W7 W6 cache module cache module cache module Basic Interleaved Cache Clustered VLIW Processor NEXT MEMORY LEVEL cache block memory buses TAG W0 W4 cache module FUs FUs FUs FUs Reg. File Reg. File Reg. File Reg. File CLUSTER 2 CLUSTER 3 CLUSTER 4 CLUSTER 1 Register-to-register buses

  7. Talk Outline • State-of-the-art: multiVLIW • Interleaved Cache Clustered VLIW • Scheduling Algorithms • Enhancement: Attraction Buffers • Experimental Framework • Results • Conclusions

  8. Modulo Scheduling • Extract ILP from loops  overlap execution of iterations LOOP L A II A B A’ SC B C B’ A’’ Kernel C C’ B’’ C’’

  9. Base Scheduling Algorithm • Used for Unified Cache II=II+1 0 >0 How Many? Select possible clusters START Next node Best profit in output edges 1 How Many? Sort nodes Schedule it >1 Least loaded

  10. Interleaved Cache Scheduling Algorithm • Unroll loop to maximize instructions with a stride multiple of NxI  access ONE cache module • Assign latencies to memory instructions • Assign memory instructions to clusters: • IPBC (Interleaved Pre-Build Chains)  minimize stall time • IBC (Interleaved Build Chains)  minimize compute time

  11. store load load store load store Memory Dependent Instructions IPBC  preferred info is used vs. IBC  minimize register comms. Preferred=1 add Preferred=2 load memory dependant chain 2 memory dependant chain 1 Preferred=1 add Preferred=2 store

  12. Talk Outline • State-of-the-art: multiVLIW • Interleaved Cache Clustered VLIW • Scheduling Algorithms • Enhancement: Attraction Buffers • Experimental Framework • Results • Conclusions

  13. ATTRACTION BUFFER ADDRESS TAG W TAG W2 W6 word select = hit data Enhacement: Attraction Buffers ADDRESS CACHE MODULE Local Data ABuffer data hit local logic data hit data hit

  14. CLUSTER 3 CLUSTER 2 CLUSTER 1 a[3] a[0] a[7] a[4] An Example for (i=0; i<MAX; i++) { ld r3, a[i] r4 = OP(r3) st r4, b[i] } for (i=0; i<MAX; i+=4) { ld r31, a[i] (stride 16) ld r32, a[i+1] ld r33, a[i+2] ld r34, a[i+3] r41 = OP(r31) r42 = OP(r32) r43 = OP(r33) r44 = OP(r34) st r41, b[i] st r42, b[i+1] st r43, b[i+2] st r44, b[i+3] } 16 byte strides (NxI multiple) N = 4 clusters, I= 4 bytes Unroll x4 a[0] a[1] a[2] a[3] ... Local module ABuffer CLUSTER 4 ld r31, a[0]

  15. Enhacement: Attraction Buffers • Why remote accesses? Why Attraction Buffers? • Double precision accesses  low benefit • Indirect accesses: a[b[i]] low benefit • “Unclear” preferred cluster  big benefit for (i=0; i<MAX; i++) for (k=i; k<i+MAX; k+=4) ld a[k], ld a[k+1], ld a[k+2], ld a[k+3] • Memory dependent chains big benefit • IBC: preferred cluster info is not used  big benefit

  16. Talk Outline • State-of-the-art: multiVLIW • Interleaved Cache Clustered VLIW • Scheduling Algorithms • Enhancement: Attraction Buffers • Experimental Framework • Results • Conclusions

  17. Experimental Framework • IMPACT C compiler • Modulo scheduling on hyperblock loops • BASE for a Unified Cache • IPBC and IBC for an Interleaved Cache • IPBC and IBC for the MultiVLIW • The same unrolling factor has been used for all architecture configurations! • Mediabench benchmark suite

  18. Experimental Framework

  19. Results (I) • IPBC vs IBC similar cycle count results • MultiVLIW vs Interleaved  similar results BUT… … lower complexity!

  20. Results (II) • Memory dependent chains • Interleaved cacheworkload unbalance +  remote accesses • MultiVLIW  workload unbalance • Working on techniques to overcome scheduling restrictions

  21. Results (III) • Local hits are increased by 15% • Stall time reduced by 30%

  22. Conclusions • Scheduling Algorithms • Good latency assignment process (stall time accounts for 9% of execution time) • Coherence kept through memory dependent chains (5% cycle count degradation) • Attraction Buffers • Effective to increase local hits (15% average) + reduce stall time (30% average) • Reduce remote hits to previously accessed subblocks (70% average) • Cycle count results • similar to Unified Cache and MultiVLIW

  23. Questions

More Related