1 / 28

Fine-Grain Performance Scaling of Soft Vector Processors

Fine-Grain Performance Scaling of Soft Vector Processors. Peter Yiannacouras Jonathan Rose Gregory J. Steffan ESWEEK – CASES 2009, Grenoble, France Oct 13, 2009. FPGA Systems and Soft Processors. Target: Data level parallelism → vector processors.

Download Presentation

Fine-Grain Performance Scaling of Soft Vector Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fine-Grain Performance Scaling of Soft Vector Processors Peter Yiannacouras Jonathan Rose Gregory J. Steffan ESWEEK – CASES 2009, Grenoble, France Oct 13, 2009

  2. FPGA Systems and Soft Processors Target: Data level parallelism → vector processors Simplify FPGA design: Customize soft processor architecture Digital System computation Weeks Months Software + Compiler HDL + CAD Soft Processor Hard Processor Custom HW Used in 25% of designs [source: Altera, 2009] Faster Smaller Less Power Easier COMPETE ?Configurable Specialized device, increased cost Board space, latency, power

  3. Vector Processing Primer vadd // C code for(i=0;i<16; i++) c[i]=a[i]+b[i] // Vectorized code set vl,16 vload vr0,a vload vr1,b vadd vr2,vr0,vr1 vstore vr2,c vr2[15]=vr0[15]+vr1[15] vr2[14]=vr0[14]+vr1[14] vr2[13]=vr0[13]+vr1[13] vr2[12]=vr0[12]+vr1[12] vr2[11]=vr0[11]+vr1[11] vr2[10]=vr0[10]+vr1[10] vr2[9]= vr0[9]+vr1[9] vr2[8]= vr0[8]+vr1[8] vr2[7]= vr0[7]+vr1[7] vr2[6]= vr0[6]+vr1[6] vr2[5]= vr0[5]+vr1[5] vr2[4]= vr0[4]+vr1[4] Each vector instruction holds many units of independent operations vr2[3]= vr0[3]+vr1[3] vr2[2]= vr0[2]+vr1[2] vr2[1]= vr0[1]+vr1[1] vr2[0]= vr0[0]+vr1[0] 1 Vector Lane

  4. Vector Processing Primer 16x speedup vadd // C code for(i=0;i<16; i++) c[i]=a[i]+b[i] // Vectorized code set vl,16 vload vr0,a vload vr1,b vadd vr2,vr0,vr1 vstore vr2,c 16 Vector Lanes vr2[15]=vr0[15]+vr1[15] vr2[14]=vr0[14]+vr1[14] vr2[13]=vr0[13]+vr1[13] • Previous Work (on Soft Vector Processors): • Scalability • Flexibility • Portability • CASES’08 vr2[12]=vr0[12]+vr1[12] vr2[11]=vr0[11]+vr1[11] vr2[10]=vr0[10]+vr1[10] vr2[9]= vr0[9]+vr1[9] vr2[8]= vr0[8]+vr1[8] vr2[7]= vr0[7]+vr1[7] vr2[6]= vr0[6]+vr1[6] vr2[5]= vr0[5]+vr1[5] vr2[4]= vr0[4]+vr1[4] Each vector instruction holds many units of independent operations vr2[3]= vr0[3]+vr1[3] vr2[2]= vr0[2]+vr1[2] vr2[1]= vr0[1]+vr1[1] vr2[0]= vr0[0]+vr1[0]

  5. VESPA Architecture Design(Vector Extended Soft Processor Architecture) Icache Dcache Legend Pipe stage Logic Storage M U X WB Decode RF A L U Scalar Pipeline 3-stage VC RF VC WB Supports integer and fixed-point operations [VIRAM] Vector Control Pipeline 3-stage Logic Shared Dcache Decode VS RF VS WB Decode Repli- cate Hazard check VR RF VR WB Vector Pipeline 6-stage VR RF Lane 1 ALU,Mem Unit VR WB Lane 2 ALU, Mem, Mul 32-bit Lanes

  6. In This Work • Evaluate for real using modern hardware • Scale to 32 lanes (previous work did 16 lanes) • Add more fine-grain architectural parameters • Scale more finely • Augment with parameterized vector chaining support • Customize to functional unit demand • Augment with heterogeneous lanes • Explore a large design space

  7. Evaluation Infrastructure Evaluate soft vector processors with high accuracy SOFTWARE HARDWARE EEMBC Benchmarks GCC Compiler Verilog Full hardware design of VESPA soft vector processor ld Binary Vectorized assembly subroutines GNU as Stratix III 340 FPGA CAD Software Instruction Set Simulation RTL Simulation area, power, clock frequency DDR2 cycles verification verification

  8. VESPA Scalability Up to 19x, average of 11xfor 32 lanes → good scaling Powerful parameter … but is coarse-grained 19x (Area=1) (Area=1.3) (Area=1.9) (Area=3.2) (Area=6.3) (Area=12.3) 11x

  9. Vector Lane Design Space Too coarse grain! Reprogrammability allows more exact-fit 8% of largest FPGA (Equivalent ALMs)

  10. In This Work • Evaluate for real using modern hardware • Scale to 32 lanes (previous work did 16 lanes) • Add more fine-grain architectural parameters • Scale more finely • Augment with parameterized vector chaining support • Customize to functional unit demand • Augment with heterogeneous lanes • Explore a large design space

  11. Vector Chaining • Simultaneous execution of independent element operations within dependent instructions Dependent Instructions vadd vadd vr10, vr1,vr2 0 1 2 3 4 5 6 7 dependency vmul vr20, vr10,vr11 0 1 2 3 4 5 6 7 vmul Independent Element Operations

  12. Vector Chaining in VESPA M U X M U X A L U A L U A L U A L U A L U A L U A L U A L U Performance increase if instructions correctly scheduled Lanes=4 Vector Register File Single Instruction Execution No Vector Chaining Unified vadd B=1 vmul Mem Mem time Mem Mul Vector Register File Multiple Instruction Execution With Vector Chaining Bank 0 vadd B=2 vmul Bank 1 Mem Mem time Mem Mul

  13. ALU Replication ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU M U X M U X M U X M U X M U X Vector Register File Lanes=4 With Vector Chaining Single Instruction Execution Bank 0 B=2 APB=false vadd vsub Bank 1 Mem Mem Mem Mul time Vector Register File Multiple Instruction Execution With Vector Chaining Bank 0 vadd B=2 APB=true vsub Bank 1 time Mem Mem Mem Mul

  14. Vector Chaining Speedup(on an 8-lane VESPA) Chaining can be quite costly in area: 27%-92% Performance is application dependent: 5%-76% Significant speed improvement over no chaining (22-35% avg) More fine-grain vs double lanes: 19-89% speed, 86% area More banks More banks More ALUs More ALUs Don’t care Cycle Speedup vs No Chaining

  15. In This Work • Evaluate for real using modern hardware • Scale to 32 lanes (previous work did 16 lanes) • Add more fine-grain architectural parameters • Scale more finely • Augment with parameterized vector chaining support • Customize to functional unit demand • Augment with heterogeneous lanes • Explore a large design space

  16. Heterogeneous Lanes ALU ALU ALU ALU 4 Lanes (L=4) Lane 1 2 Multiplier Lanes (X=2) Mul vmul Lane 2 Mul Lane 3 Mul Lane 4 Mul

  17. Heterogeneous Lanes ALU ALU ALU ALU Save area, but reduce speed depending on demand on the multiplier 4 Lanes (L=4) Lane 1 2 Multiplier Lanes (X=2) STALL! Mul vmul Lane 2 Mul Lane 3 Lane 4

  18. Impact of Heterogeneous Lanes(on a 32-lane VESPA) Performance penalty is application dependent: 0%-85% Modest area savings (6%-13%) – dedicated multipliers Free Expensive Moderate

  19. In This Work • Evaluate for real using modern hardware • Scale to 32 lanes (previous work did 16 lanes) • Add more fine-grain architectural parameters • Scale more finely • Augment with parameterized vector chaining support • Customize to functional unit demand • Augment with heterogeneous lanes • Explore a large design space

  20. Design Space Exploration usingVESPA Architectural Parameters Compute Architecture Instruction Set Architecture Memory Architecture

  21. VESPA Design Space (768 architectural configurations) Fine-grain design space allows better-fit architecture Evidence of efficiency: trade performance and area 1:1 28x range Normalized Wall Clock Time 18x range 4x 4x 1 2 4 8 16 32 64 Normalized Coprocessor Area

  22. Summary Use software for non-critical data-parallel computation • Evaluated VESPA on modern FPGA hardware • Scale up to 32 lanes with 11x average speedup • Augmented VESPA with fine-tunable parameters • Vector Chaining (by banking the register file) • 22-35% better average performance than without • Chaining configuration impact very application-dependent • Heterogeneous Lanes – lanes w/o multipliers • Multipliers saved, costs performance (sometimes free) • Explored a vast architectural design space • 18x range in performance, 28x range in area

  23. Thank You! • VESPA release: http://www.eecg.utoronto.ca/VESPA

  24. VESPA Parameters Compute Architecture Instruction Set Architecture Memory Architecture

  25. VESPA Scalability Up to 27x, average of 15xfor 32 lanes → good scaling Powerful parameter … but too coarse-grained 27x (Area=1) (Area=1.3) (Area=1.9) (Area=3.2) (Area=6.3) (Area=12.3) 15x

  26. Proposed Soft Vector Processor System Design Flow We propose adding vector extensions to existing soft processors We want to evaluate soft vector processors for real www.fpgavendor.com User Code + Vectorized Software Routine Vectorized Software Routine Vectorized Software Routine Vectorized Software Routine Vectorized Software Routine Vectorized Software Routine Portable Is the soft processor the bottleneck? Custom HW Portable, Flexible, Scalable Soft Proc Vector Lane 1 Vector Lane 2 Vector Lane 3 Vector Lane 4 Memory Interface Peripherals yes, increase lanes

  27. Vector Memory Unit Memory Request Queue base rddata0 rddata1 stride*0 rddataL M U X stride*1 M U X + ... + stride*L M U X + index0 index1 indexL wrdata0 ... … Memory Lanes=4 wrdata1 wrdataL Dcache Read Crossbar Write Crossbar L = # Lanes - 1 Memory Write Queue … …

  28. Overall Memory System Performance Wider line + prefetching reduces memory unit stall cycles significantly Wider line + prefetching eliminates all but4% of miss cycles 16 lanes 67% 48% 31% 4% (4KB) (16KB) 15

More Related