Hardware accelerator for PPC microprocessor

Hardware accelerator for PPC microprocessor Final presentation By: Instructor: Kopitman Reem Fiksman Evgeny Stolberg Dmitri

Agenda • Ways to implement an algorithm • Starting with ASC • HW architecture • SW architecture • System optimization • Generic module (iDCT) • Timing results

Abstract • Problem • There are complex functions (e.g. FFT) which takes a lot of CPU recourses • Consider the ways of implementation of such functions and choose the best solution according to specified constraints • Solutions • Pure SW implementation • Pure HW implementation • Combinational HW + SW - ASC technology

Abstract • SW • Low cost • Low performance • HW • High cost • High performance • Combinational

Project Goals • Study of ASC (A Stream compiler) • Study of functions in PamDC library • Implementation of interface between a generic module and the CPU using ASC • Implementation of some specific module to test the interface • Implementation of the same module in SW and make conclusions about performance

ASC - A Stream Compiler • Combinational (SW/HW) code • Familiar C++ writing • Generates a flexible HW • Standard NetList output (edif) • Supported by standard Cad tools • Provides HW optimization • UNIX oriented

ASC – code example #include "asc.h" main(int argc, char **argv) { printf("Hello World\n"); STREAM_START; // ASC code start // Hardware Variable Declarations HWint in(IN); HWint out(OUT); HWint tmp(TMP); STREAM_LOOP(16); tmp = (in << 1) + 55; out = tmp; STREAM_END; // ASC code end } Software Hello World Hardware 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87

System components • Memec evaluation board • Xilinx Virtex II Pro FPGA with PPC405 • JTAG • LCD, Serial port for debug • SW tools • Xilinx EDK • Xilinx Platform Studio • Chip Scope

System Bus (PLB) Design Approach - general • FPGA module Memory EDAC Memory EDAC Memory DRAM Peripheral module PPC405 Processor Peripheral Peripheral ASC Monitor module Monitor Monitor other peripheral

ASC interface (General view) Generic Module PLB bus Interrupt controller Fifo_full DMA engine CTRL CTRL Serdes Data_in FIFO_in Data DMA Buffer Data_out Addr FIFO_out Fifo_full

SW review – main algorithm Start/reset System blocks initialization(FIFO,DMA,GPIO,LCD) Yes DMA busy No Read data packets from ASC application Write data packets to ASC application No Yes Calculation complete

SW review – C code fundament • DMA – control and data TX/RX func. • LCD – setup and data TX func. • Data size manipulation • Timers control func. • MASK definition – user friendly orientation

iDCT abstract • Reconstructs an image or audio block from it’s discrete cosine transform • Why iDCT? Complex iterative algorithm which takes a lot of CPU resources

ASC design – IDCT module • Discrete Cosine Transform • This transform is utilized in the current standards for still images (JPEG) and video compression (MPEG). • The principle: Xm - matrix of discrete samples (iDCT samples) Tm - cosine coefficientmatrix Fm - DCT matrix

ASC design – Optimization (1) • ASC supports: • Latency • Throughput • Area • For large amount of data: Throughput – calculation time optimized

ASC design – Optimization (2) • Optimization… Throughput, Area, Latency?

ASC design – Optimization (3) • Optimization – Area consumption • Absolute values refer to Xilinx Virtex II Pro XC2VP7 FPGA

ASC design – Optimization (4) • Optimization – Area Consumption • Optimization by latency is the choice . Best throughput and latency, • with average area consumption

Clock calculations • Clock calculations Get time 1 Set DMA control Tx / Rx data packet complete No LCD write Data + calculation time Yes Get time 2 Calk_time = time2 – time1

SW performance 1400 1200 1000 800 Calculation Time [us] 600 400 200 0 1 2 4 8 16 17 20 24 28 30 32 38 40 44 46 48 50 54 60 64 68 72 74 78 80 84 88 90 94 96 98 104 100 108 112 116 120 Packet length (x*32) SW performance iDCT running results – SW (1) • Linear calculation time growth vs. data packet length as expected • in iDCT • Basic packet size is 32 bytes. Packet length scale is in num. of • basic packets

iDCT running results – SW (2) Exponential Data incease 100000000 10000000 1000000 100000 10000 log (Calculation time[us]) 1000 100 10 1 1 3 7 10 20 30 50 70 100 150 200 250 300 350 400 450 470 500 512 550 700 1000 10000 30000 50000 100000 300000 500000 1000000 log (Packet length) (x*32) Exponential Data increase • Exponential time calculation growth with exp. data length increasing

iDCT running results – HW (1) • FIFO size influence (512 bytes) • High calculation time vs. writing new data to FIFO

iDCT running results – HW (2) • FIFO size influence (512 bytes) • High calculation time vs. writing new data to FIFO • Basic packet size is 32 bytes. Packet length scale is in num. of • basic packets

iDCT running results – SW vs. HW

Innovations • Make this generic interface hard coded and include it as part of FPGA (IP) development packet. • Development becomes to C++ coding only • Interconnection between PPC & Generic Module becomes transparent • Make current design faster using separate DMA channels for read and write

The end

Hardware accelerator for PPC microprocessor