1 / 26

Improving Memory System Performance for Soft Vector Processors

Improving Memory System Performance for Soft Vector Processors. Peter Yiannacouras J. Gregory Steffan Jonathan Rose WoSPS – Oct 26, 2008. Soft Processors in FPGA Systems. Data-level parallelism → soft vector processors. Soft Processor. Custom Logic. C + Compiler. HDL + CAD.

Download Presentation

Improving Memory System Performance for Soft Vector Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Memory System Performance for Soft Vector Processors Peter Yiannacouras J. Gregory Steffan Jonathan Rose WoSPS – Oct 26, 2008

  2. Soft Processors in FPGA Systems Data-level parallelism → soft vector processors Soft Processor Custom Logic C + Compiler HDL + CAD  Easier  Faster  Smaller  Less Power  Configurable – how can we make use of this?

  3. Vector Processing Primer vadd // C code for(i=0;i<16; i++) b[i]+=a[i] // Vectorized code set vl,16 vload vr0,b vload vr1,a vadd vr0,vr0,vr1 vstore vr0,b b[15]+=a[15] b[14]+=a[14] b[13]+=a[13] b[12]+=a[12] b[11]+=a[11] b[10]+=a[10] b[9]+=a[9] b[8]+=a[8] b[7]+=a[7] b[6]+=a[6] b[5]+=a[5] b[4]+=a[4] Each vector instruction holds many units of independent operations b[3]+=a[3] b[2]+=a[2] b[1]+=a[1] b[0]+=a[0] 1 Vector Lane

  4. Vector Processing Primer 16x speedup vadd // C code for(i=0;i<16; i++) b[i]+=a[i] // Vectorized code set vl,16 vload vr0,b vload vr1,a vadd vr0,vr0,vr1 vstore vr0,b b[15]+=a[15] 16 Vector Lanes b[14]+=a[14] b[13]+=a[13] b[12]+=a[12] b[11]+=a[11] b[10]+=a[10] b[9]+=a[9] b[8]+=a[8] b[7]+=a[7] b[6]+=a[6] b[5]+=a[5] b[4]+=a[4] Each vector instruction holds many units of independent operations b[3]+=a[3] b[2]+=a[2] b[1]+=a[1] b[0]+=a[0]

  5. Sub-Linear Scalability Vector lanes not being fully utilized

  6. Where Are The Cycles Spent? 2/3 cycles spent waiting on memory unit, often from cache misses 16 lanes 67%

  7. Our Goals • Improve memory system • Better cache design • Hardware prefetching • Evaluate improvements for real: • Using a complete hardware design (in Verilog) • On real FPGA hardware (Stratix 1S80C6) • Running full benchmarks (EEMBC) • From off-chip memory (DDR-133MHz)

  8. Current Infrastructure VC RF VC WB Logic VS RF VS WB Mem Unit Decode Repli- cate Hazard check VR RF VR RF VR WB VR WB Satu- rate Satu- rate M U X M U X A L U A L U x & satur. Rshift x & satur. Rshift SOFTWARE HARDWARE Verilog EEMBC C Benchmarks GCC ld scalar μP ELF Binary + Vectorized assembly subroutines GNU as + vpu Vector support MINT Instruction Set Simulator Modelsim (RTL Simulator) Altera Quartus II v 8.0 area, frequency cycles verification verification

  9. VESPA Architecture Design Icache Dcache M U X WB Decode RF A L U Scalar Pipeline 3-stage Shared Dcache VC RF VC WB Supports integer and fixed-point operations, and predication Vector Control Pipeline 3-stage Logic Decode VS RF VS WB Mem Unit Decode Repli- cate Hazard check VR RF Vector Pipeline 6-stage VR RF VR WB M U X VR WB M U X A L U A L U Satu- rate Satu- rate 32-bit datapaths x & satur. Rshift x & satur. Rshift 10

  10. Memory System Design … vld.w (load 16 contiguous 32-bit words) VESPA 16 lanes Scalar Vector Coproc Lane 0 Lane 0 Lane 0 Lane 4 Lane 0 Lane 0 Lane 0 Lane 8 Lane 0 Lane 0 Lane 0 Lane 12 Lane 4 Lane 4 Lane 15 Lane 16 Vector Memory Crossbar Dcache 4KB, 16B line DDR 9 cycle access DDR

  11. Memory System Design … Reduced cache accesses + some prefetching vld.w (load 16 contiguous 32-bit words) VESPA 16 lanes Scalar Vector Coproc Lane 0 Lane 0 Lane 0 Lane 4 Lane 0 Lane 0 Lane 0 Lane 8 Lane 0 Lane 0 Lane 0 Lane 12 Lane 4 Lane 4 Lane 15 Lane 16 Vector Memory Crossbar 4x Dcache 16KB, 64B line 4x DDR 9 cycle access DDR

  12. Improving Cache Design • Vary the cache depth & cache line size • Using parameterized design • Cache line size: 16, 32, 64, 128 bytes • Cache depth: 4, 8, 16, 32, 64 KB • Measure performance on 9 benchmarks • 6 from EEMBC, all executed in hardware • Measure area cost • Equate silicon area of all resources used • Report in units of Equivalent LEs

  13. Cache Design Space – Performance (Wall Clock Time) Best cache design almost doubles performance of original VESPA More pipelining/retiming could reduce clock frequency penalty Cache line more important than cache depth (lots of streaming) 122MHz 123MHz 126MHz 129MHz

  14. Cache Design Space – Area 16bits 16bits 16bits 16bits 16bits 16bits 16bits 4096 bits 4096 bits 4096 bits 4096 bits 4096 bits 4096 bits 4096 bits System area almost doubled in worst case 64B (512 bits) … M4K 32 => 16KB of storage MRAM

  15. Cache Design Space – Area b) Don’t use MRAMs: big, few, and overkill a) Choose depth to fill block RAMs needed for line size M4K MRAM

  16. Hardware Prefetching Example … … No Prefetching Prefetching 3 blocks vld.w vld.w vld.w vld.w MISS MISS MISS HIT Dcache Dcache 9 cycle penalty 9 cycle penalty DDR DDR

  17. Hardware Data Prefetching We measure performance/area using a 64B, 16KB dcache • Advantages • Little area overhead • Parallelize memory fetching with computation • Use full memory bandwidth • Disadvantages • Cache pollution • We use Sequential Prefetching triggered on: • a) any miss, or • b) sequential vector instruction miss

  18. Prefetching K Blocks – Any Miss Only half the benchmarks significantly sped-up, max of 2.2x, avg 28% Peak averagespeedup 28% Not receptive 2.2x

  19. Prefetching Area Cost: Writeback Buffer … … Prefetching 3 blocks • Two options: • Deny prefetch • Buffer all dirty lines • Area cost is small • 1.6% of system area • Mostly block RAMs • Little logic • No clock frequency impact vld.w WB Buffer MISS dirty lines Dcache 9 cycle penalty DDR

  20. Any Miss vs Sequential Vector Miss Collinear – nearly all misses in our benchmarks are sequential vector

  21. Vector Length Prefetching • Previously: constant# cache lines prefetched • Now: Use multiple of vector length • Only for sequential vector memory instructions • Eg. Vector load of 32 elements • Guarantees <= 1 miss per vector memory instr 0 31 vld.w fetch + prefetch 28*k

  22. Vector Length Prefetching - Performance 1*VL prefetching provides good speedup without tuning, 8*VL best Peak 29% Not receptive 21% 2.2x no cache pollution

  23. Overall Memory System Performance Wider line + prefetching reduces memory unit stall cycles significantly Wider line + prefetching eliminates all but4% of miss cycles 67% 48% 31% 4% (4KB) (16KB) 15

  24. Improved Scalability Previous: 3-8x range, average of 5x for 16 lanes Now: 6-13x range, average of 10xfor 16 lanes

  25. Summary • Explored cache design • ~2x performance for ~2x system area • Area growth due largely to memory crossbar • Widened cache line size to 64B and depth to 16KB • Enhanced VESPA w/ hardware data prefetching • Up to 2.2x performance, average of 28% for K=15 • Vector length prefetcher gains 21% on average for 1*VL • Good for mixed workloads, no tuning, no cache pollution • Peak at 8*VL, average of 29% speedup • Overall improved VESPA memory system & scalability • Decreased miss cycles to 4%, • Decreased memory unit stall cycles to 31%

  26. Vector Memory Unit Memory Request Queue base rddata0 rddata1 stride*0 rddataL M U X stride*1 M U X + ... + stride*L M U X + index0 index1 indexL wrdata0 ... … Memory Lanes=4 wrdata1 wrdataL Dcache Read Crossbar Write Crossbar L = # Lanes - 1 Memory Write Queue … …

More Related