1 / 12

Lecture 16: Accelerator Design in the XUP Board

ECE 412: Microcomputer Laboratory. Lecture 16: Accelerator Design in the XUP Board. Objectives. Understand accelerator design considerations in a practical FPGA environment Gain knowledge in some details of the XUP platform required for efficient accelerator design.

jorjanna
Download Presentation

Lecture 16: Accelerator Design in the XUP Board

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE 412: Microcomputer Laboratory Lecture 16: Accelerator Design in the XUP Board Lecture 16

  2. Objectives • Understand accelerator design considerations in a practical FPGA environment • Gain knowledge in some details of the XUP platform required for efficient accelerator design Lecture 16

  3. Four Fundamental Models of Accelerator Design No OS Service (in simple embedded systems) Base OS service acc as User space mmaped I/O device Virtualized Device with OS sched support Lecture 16

  4. User level function or device driver: Soft object Hard object Source code Compiler analysis/transformations Human designed hardware Synthesis Compile Time User Runtime DLL Application Resource manager Linker/Loader Kernel Runtime OS modules Linux OS memory CPU FPGA accele- rators devices Hybrid Hardware/Software Execution Model • Hardware Accelerator as a DLL • Seamless integration of hardware accelerators into the Linux software stack for use by mainstream applications • The DLL approach enables transparent interchange of software and hardware components • Application level execution model • Compiler deep analysis and transformations generate CPU code, hardware library stubs and synthesized components • FPGA bitmaps as hardware counterpart to existing software modules. • Same dynamic linking library interfaces and stubs apply to both software and hardware implementation • OS resource management • Services (API) for allocation, partial reconfiguration, saving and restoring the status, and monitoring • Multiprogramming scheduler can pre-fetch hardware accelerators in time for next use • Control the access to the new hardware to ensure trust under private or shared use Lecture 16

  5. MP3 Decoder: Madplay Lib. Dithering as DLL Noise Shaping Noise Shaping Noise Shaping Biasing Biasing Biasing Random generator Random generator Random generator Dithering Dithering Dithering Clipping Clipping Clipping Quantization Quantization Quantization Software Dithering DLL Software Dithering DLL Application Application Decode MP3 Decode MP3 Read Read Write Write DL DL Block Block Sample Sample Sample Sample Stub Stub Hardware Dithering DLL Hardware Dithering DLL • Madplay shared library dithering function as software and FPGA DLL • Audio_linear_dither() software profiling shows 97% of application time • DL (dynamic linker) can switch the call to hardware or software implementation • Used by ~100 video and audio applications Sound driver Sound driver OS OS Hardware Dithering Hardware Dithering AC’97 AC’97 6 cycles Noise Shaping Noise Shaping Noise Shaping Biasing Biasing Biasing Dithering Dithering Dithering Clipping Clipping Clipping Quantization Quantization Quantization Random generator Random generator Random generator FPGA FPGA Lecture 16

  6. CPU-Accelerator Interconnect Options • PLB (Processor Local Bus) • Wide transfer – 64 bits • Access to DRAM channel • 1/3 CPU frequency • Big penalty if bus is busy during first attempt to access bus • OCM (On-chip Memory) interconnect • Narrower – 32 bits • No direct access to DRAM channel • CPU clock frequency Lecture 16

  7. Motion Estimation Design & Experience • Significant overhead in mmap, open calls • This arrangement can only support accelerators that will be invoked many times • Notice dramatic reduction in computation time • Notice large overhead in data marshalling and white • Full Search gives 10% better compression • Diamond Search is sequential, not suitable for acceleration Lecture 16

  8. JPEG: An Example Run-Length Encoding (RLE) 2D Discrete Cosine Transform (DCT) Y Downsample RGB to YUV Huffman Coding (HC) RGB U Downsample Quantization (QUANT) V Original Image Downsample Compressed Image Parallel Execution on Independent Blocks Inherently Sequential Region Lecture 16 Implemented as Reconfigurable Logic Accelerator Candidate

  9. JPEG Accelerator Design & Experience • Based on Model (d) • System call overhead for each invocation • Better protection • DCT and Quant are accelerated • Data flows directly from DCT to Quant • Data copy to user DMA buffer dominates cost Lecture 16

  10. PLB PLB DCR PLB PLB Execution Flow of DCT System Call Application Operating System Hardware Time  PLB Enable Accelerator Access for Application open(/dev/accel); /* only once*/ … /* construct macroblocks */ macroblock = … syscall(&macroblock, num_blocks) … PPC Memory Data copy Flush Cache Range PPC Memory Setup DMA Transfer PPC DMA Controller Poll PPC Accelerator (Executing) Setup DMA Transfer PPC DMA Controller Invalidate Cache Range … /* macroblock now has transformed data */ … PPC Memory Data Copy PLB PPC Memory Lecture 16

  11. Software Versus Hardware Acceleration Overhead is a major issue! Lecture 16

  12. Device Driver Access Cost Lecture 16

More Related