1 / 26

Hardware accelerator for PPC microprocessor

Hardware accelerator for PPC microprocessor. Final presentation By: Instructor: Kopitman Reem Fiksman Evgeny Stolberg Dmitri. Agenda. Ways to implement an algorithm Starting with ASC

wayde
Download Presentation

Hardware accelerator for PPC microprocessor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hardware accelerator for PPC microprocessor Final presentation By: Instructor: Kopitman Reem Fiksman Evgeny Stolberg Dmitri

  2. Agenda • Ways to implement an algorithm • Starting with ASC • HW architecture • SW architecture • System optimization • Generic module (iDCT) • Timing results

  3. Abstract • Problem • There are complex functions (e.g. FFT) which takes a lot of CPU recourses • Consider the ways of implementation of such functions and choose the best solution according to specified constraints • Solutions • Pure SW implementation • Pure HW implementation • Combinational HW + SW - ASC technology

  4. Abstract • SW • Low cost • Low performance • HW • High cost • High performance • Combinational

  5. Project Goals • Study of ASC (A Stream compiler) • Study of functions in PamDC library • Implementation of interface between a generic module and the CPU using ASC • Implementation of some specific module to test the interface • Implementation of the same module in SW and make conclusions about performance

  6. ASC - A Stream Compiler • Combinational (SW/HW) code • Familiar C++ writing • Generates a flexible HW • Standard NetList output (edif) • Supported by standard Cad tools • Provides HW optimization • UNIX oriented

  7. ASC – code example #include "asc.h" main(int argc, char **argv) { printf("Hello World\n"); STREAM_START; // ASC code start // Hardware Variable Declarations HWint in(IN); HWint out(OUT); HWint tmp(TMP); STREAM_LOOP(16); tmp = (in << 1) + 55; out = tmp; STREAM_END; // ASC code end } Software Hello World Hardware 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87

  8. System components • Memec evaluation board • Xilinx Virtex II Pro FPGA with PPC405 • JTAG • LCD, Serial port for debug • SW tools • Xilinx EDK • Xilinx Platform Studio • Chip Scope

  9. System Bus (PLB) Design Approach - general • FPGA module Memory EDAC Memory EDAC Memory DRAM Peripheral module PPC405 Processor Peripheral Peripheral ASC Monitor module Monitor Monitor other peripheral

  10. ASC interface (General view) Generic Module PLB bus Interrupt controller Fifo_full DMA engine CTRL CTRL Serdes Data_in FIFO_in Data DMA Buffer Data_out Addr FIFO_out Fifo_full

  11. SW review – main algorithm Start/reset System blocks initialization(FIFO,DMA,GPIO,LCD) Yes DMA busy No Read data packets from ASC application Write data packets to ASC application No Yes Calculation complete

  12. SW review – C code fundament • DMA – control and data TX/RX func. • LCD – setup and data TX func. • Data size manipulation • Timers control func. • MASK definition – user friendly orientation

  13. iDCT abstract • Reconstructs an image or audio block from it’s discrete cosine transform • Why iDCT? Complex iterative algorithm which takes a lot of CPU resources

  14. ASC design – IDCT module • Discrete Cosine Transform • This transform is utilized in the current standards for still images (JPEG) and video compression (MPEG). • The principle: Xm - matrix of discrete samples (iDCT samples) Tm - cosine coefficientmatrix Fm - DCT matrix

  15. ASC design – Optimization (1) • ASC supports: • Latency • Throughput • Area • For large amount of data: Throughput – calculation time optimized

  16. ASC design – Optimization (2) • Optimization… Throughput, Area, Latency?

  17. ASC design – Optimization (3) • Optimization – Area consumption • Absolute values refer to Xilinx Virtex II Pro XC2VP7 FPGA

  18. ASC design – Optimization (4) • Optimization – Area Consumption • Optimization by latency is the choice . Best throughput and latency, • with average area consumption

  19. Clock calculations • Clock calculations Get time 1 Set DMA control Tx / Rx data packet complete No LCD write Data + calculation time Yes Get time 2 Calk_time = time2 – time1

  20. SW performance 1400 1200 1000 800 Calculation Time [us] 600 400 200 0 1 2 4 8 16 17 20 24 28 30 32 38 40 44 46 48 50 54 60 64 68 72 74 78 80 84 88 90 94 96 98 104 100 108 112 116 120 Packet length (x*32) SW performance iDCT running results – SW (1) • Linear calculation time growth vs. data packet length as expected • in iDCT • Basic packet size is 32 bytes. Packet length scale is in num. of • basic packets

  21. iDCT running results – SW (2) Exponential Data incease 100000000 10000000 1000000 100000 10000 log (Calculation time[us]) 1000 100 10 1 1 3 7 10 20 30 50 70 100 150 200 250 300 350 400 450 470 500 512 550 700 1000 10000 30000 50000 100000 300000 500000 1000000 log (Packet length) (x*32) Exponential Data increase • Exponential time calculation growth with exp. data length increasing

  22. iDCT running results – HW (1) • FIFO size influence (512 bytes) • High calculation time vs. writing new data to FIFO

  23. iDCT running results – HW (2) • FIFO size influence (512 bytes) • High calculation time vs. writing new data to FIFO • Basic packet size is 32 bytes. Packet length scale is in num. of • basic packets

  24. iDCT running results – SW vs. HW

  25. Innovations • Make this generic interface hard coded and include it as part of FPGA (IP) development packet. • Development becomes to C++ coding only • Interconnection between PPC & Generic Module becomes transparent • Make current design faster using separate DMA channels for read and write

  26. The end

More Related