Lecture 16: Accelerator Design in the XUP Board

ECE 412: Microcomputer Laboratory Lecture 16: Accelerator Design in the XUP Board Lecture 16

Objectives • Understand accelerator design considerations in a practical FPGA environment • Gain knowledge in some details of the XUP platform required for efficient accelerator design Lecture 16

Four Fundamental Models of Accelerator Design No OS Service (in simple embedded systems) Base OS service acc as User space mmaped I/O device Virtualized Device with OS sched support Lecture 16

User level function or device driver: Soft object Hard object Source code Compiler analysis/transformations Human designed hardware Synthesis Compile Time User Runtime DLL Application Resource manager Linker/Loader Kernel Runtime OS modules Linux OS memory CPU FPGA accelerators devices Hybrid Hardware/Software Execution Model • Hardware Accelerator as a DLL • Seamless integration of hardware accelerators into the Linux software stack for use by mainstream applications • The DLL approach enables transparent interchange of software and hardware components • Application level execution model • Compiler deep analysis and transformations generate CPU code, hardware library stubs and synthesized components • FPGA bitmaps as hardware counterpart to existing software modules. • Same dynamic linking library interfaces and stubs apply to both software and hardware implementation • OS resource management • Services (API) for allocation, partial reconfiguration, saving and restoring the status, and monitoring • Multiprogramming scheduler can pre-fetch hardware accelerators in time for next use • Control the access to the new hardware to ensure trust under private or shared use Lecture 16

MP3 Decoder: Madplay Lib. Dithering as DLL Noise Shaping Noise Shaping Noise Shaping Biasing Biasing Biasing Random generator Random generator Random generator Dithering Dithering Dithering Clipping Clipping Clipping Quantization Quantization Quantization Software Dithering DLL Software Dithering DLL Application Application Decode MP3 Decode MP3 Read Read Write Write DL DL Block Block Sample Sample Sample Sample Stub Stub Hardware Dithering DLL Hardware Dithering DLL • Madplay shared library dithering function as software and FPGA DLL • Audio_linear_dither() software profiling shows 97% of application time • DL (dynamic linker) can switch the call to hardware or software implementation • Used by ~100 video and audio applications Sound driver Sound driver OS OS Hardware Dithering Hardware Dithering AC’97 AC’97 6 cycles Noise Shaping Noise Shaping Noise Shaping Biasing Biasing Biasing Dithering Dithering Dithering Clipping Clipping Clipping Quantization Quantization Quantization Random generator Random generator Random generator FPGA FPGA Lecture 16

CPU-Accelerator Interconnect Options • PLB (Processor Local Bus) • Wide transfer – 64 bits • Access to DRAM channel • 1/3 CPU frequency • Big penalty if bus is busy during first attempt to access bus • OCM (On-chip Memory) interconnect • Narrower – 32 bits • No direct access to DRAM channel • CPU clock frequency Lecture 16

Motion Estimation Design & Experience • Significant overhead in mmap, open calls • This arrangement can only support accelerators that will be invoked many times • Notice dramatic reduction in computation time • Notice large overhead in data marshalling and white • Full Search gives 10% better compression • Diamond Search is sequential, not suitable for acceleration Lecture 16

JPEG: An Example Run-Length Encoding (RLE) 2D Discrete Cosine Transform (DCT) Y Downsample RGB to YUV Huffman Coding (HC) RGB U Downsample Quantization (QUANT) V Original Image Downsample Compressed Image Parallel Execution on Independent Blocks Inherently Sequential Region Lecture 16 Implemented as Reconfigurable Logic Accelerator Candidate

JPEG Accelerator Design & Experience • Based on Model (d) • System call overhead for each invocation • Better protection • DCT and Quant are accelerated • Data flows directly from DCT to Quant • Data copy to user DMA buffer dominates cost Lecture 16

PLB PLB DCR PLB PLB Execution Flow of DCT System Call Application Operating System Hardware Time  PLB Enable Accelerator Access for Application open(/dev/accel); /* only once*/ … /* construct macroblocks */ macroblock = … syscall(&macroblock, num_blocks) … PPC Memory Data copy Flush Cache Range PPC Memory Setup DMA Transfer PPC DMA Controller Poll PPC Accelerator (Executing) Setup DMA Transfer PPC DMA Controller Invalidate Cache Range … /* macroblock now has transformed data */ … PPC Memory Data Copy PLB PPC Memory Lecture 16

Software Versus Hardware Acceleration Overhead is a major issue! Lecture 16

Device Driver Access Cost Lecture 16

Lecture 16: Accelerator Design in the XUP Board