1 / 20

VEGAS: Soft Vector Processor with Scratchpad Memory

VEGAS: Soft Vector Processor with Scratchpad Memory. Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University of British Columbia. Motivation. Embedded processing on FPGAs High performance, computationally intensive

gibson
Download Presentation

VEGAS: Soft Vector Processor with Scratchpad Memory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University of British Columbia

  2. Motivation • Embedded processing on FPGAs • High performance, computationally intensive • Soft processors, e.g. Nios/MicroBlaze, too slow • How to deliver High Performance? • Multiprocessor on FPGA • Custom Hardware accelerators (Verilog RTL) • Synthesized accelerators (C to FPGA)

  3. Motivation • Soft vector processor to the rescue • Previous works have demonstrated soft vector processor as a viable option to provide: • Scalable performance and area • Purely software-based • Decouples hardware/software development • Key performance bottlenecks • Memory access latency • On-chip data storage efficiency

  4. Contribution • VEGAS Architecture key features • Cacheless Scratchpad Memory • Fracturable ALUs • Concurrent memory access via DMA • Advantages • Eliminates on-chip data replication • Also: huge # of vectors, long vector lengths • More parallel ALUs • Fewer memory loads/stores

  5. VEGAS Architecture Vector Core: VEGAS @ 120MHz Scalar Core: NiosII/f @ 200MHz Concurrent Execution FIFO synchronized VEGAS DMA Engine & External DDR2

  6. Scratchpad Memory in Action Dest Dest srcB srcB srcA srcA Vector Scratchpad Memory Vector Lane 0 Vector Lane 1 Vector Lane 2 Vector Lane 3

  7. Scratchpad Memory in Action Dest srcA

  8. Scratchpad Advantage • Performance • Huge working set (256kB++) • Explicitly managed by software • Async load/store via concurrent DMA • Efficient data storage • Double-clocked memory (Trad. RF 2x copies) • 8b data stays as 8b (Trad. RF 4x copies) • No cache (Trad. RF +1 copy)

  9. Scratchpad Advantage • Accessed by address register • Huge # of vectors in scratchpad • VEGAS uses only 8 vector addr. reg. (V0..V7) • Modify content to access different vectors • Auto-increment lessens need to change V0..V7 • Long vector lengths • Fill entire scratchpad

  10. Scratchpad Advantage: Median Filter • Vector address registers  easier than unrolling • Traditional Vector Median Filter For J = 0..12 For I = J .. 24 V1 = vector[i]  vector load V2 = vector[j]  vector load CompareAndSwap( V1, V2 ) vector[j] = V2  vector store Vector[i] = V1  vector store • Optimize away 1 vector load + 1 vector store using temp • Total of 222 loads and 222 stores

  11. Scratchpad Advantage: Median Filter

  12. Fracturable ALUs Multiplier – uses 4 x 16b multipliers Multiplier also does shifts + rotate Adder – uses 4 x 8b adders

  13. Fracturable ALUs Advantage • Increased processing power • 4-Lane VEGAS • 4 x 32b operations / cycle • 8 x 16b operations / cycle • 16 x 8b operations / cycle • Median filter example • 32b data: 184 cycles / pixel • 16b data: 93 cycles / pixel • 8b data: 47 cycles / pixel

  14. Area and Frequency

  15. ALM Usage

  16. Performance

  17. Area-Delay Product • Area*Delay measures “throughput per mm2” • Compared to earlier vector processors, VEGAS offers 2-3x better throughput per unit area

  18. Integer Matrix Multiply • Integer Matrix Multiply • 4096 x 4096 integers (64MB data set) • Intel Core 2 (65nm), 2.5GHz, 16GB DDR2 • Vanilla IJK: 474 seconds • Vanilla KIJ: 134 s • Tiled IJK: 93 s • Tiled KIJ: 68 s • VEGAS (65nm Altera Stratix3) • Vector: 44 s (Nios only: 5407 s) • 256kB Scratchpad, 32 Lanes (about 50% of chip) • 200MHz NIOS, 100MHz Vector, 1GB DDR2 SODIMM

  19. Conclusions • Key features • Scratchpad Memory • Enhance performance with fewer loads/stores • No on-chip data replication; efficient storage • Double-clocked to hide memory latency • Fracturable ALUs • Operates on 8b, 16b, 32b data efficiently • Single vector core accelerates many applications • Result • 2-3x better Area-Delay product than VIPERS/VESPA • Out performs Intel Core 2 at Integer Matrix Multiply

  20. Issues / Future Work • No floating-point yet • Adding “complex function” support, to include floating-point or similar operations • Algorithms with only short vectors • Split vector processor into 2, 4, 8 pieces • Run multiple instances of algorithm • Multiple vector processors • Connecting them to work cooperatively • Goals: increase throughput, exploit task-level parallelism (ie, chaining or pipelining)

More Related