1 / 25

DSP Algorithms on FPGA Part II Digital image Processing

DSP Algorithms on FPGA Part II Digital image Processing. Content. Overview image processing and FPGA Algorithm to FPGA Mapping Flow Nested Loop Algorithms and MODG Example: Motion Estimation Conclusion and Future Trends. Video signal in different formats.

druce
Download Presentation

DSP Algorithms on FPGA Part II Digital image Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DSP Algorithms on FPGAPart II Digital image Processing

  2. Content • Overview image processing and FPGA • Algorithm to FPGA Mapping Flow • Nested Loop Algorithms and MODG • Example: Motion Estimation • Conclusion and Future Trends

  3. Video signal in different formats • PAL 720*576(pixels) 25 (f/s) 10.4 (Mp/s) • NTSC 720*480 29.97 10.4 • HDTV 1920*1080 30.0 62.2 Common delivery form: • Analog (cable) • USB • Firewire

  4. Image Processing Character • Need available maximize logic by supporting N-D multiple configurable devices For Example : Image *

  5. Challenges How to……??? • Appropriate partitioning of algorithms between hardware and software • Exploiting spatial and temporal parallelism • Integration the configurable computer into the software framework • Selecting a suitable configuration strategy How shall we deal with these challenges?

  6. Why SRAM-Based FPGAs? (Pros) • Higher logic/storage capacity * Fast carry chain for adders /subtractors * Built-in XOR gates/LUT * Array of bit-parallel multipliers * Fast and local storage: array of SRAM blocks * Interconnect supports: three-state buffers/LUT • Equivalent to fine-grained reconfigurable hardware * Finer-gained pipeling can help preserve the performance at low power supply voltage • More mature CMOS manufacturing technology

  7. Algorithm to FPGA Mapping Flow

  8. The Matrix Multiplication MODG A number of different execution orders can be carried out to achieve the same algorithm.

  9. Nested Do Loop Algorithms and Inter-Iteration Dependence Graph Do i=1 to M Do j=1 to N c[i,j]=0; Do k=1 to K c[i,j]= c[i,j]+a[i,k]*b[k,j]; EndDo k EndDo j EndDo I Dependence vectors • da = (i,j,k)t= (0,1,0)t • db = (i,j,k)t= (1,0,0)t • dc = (i,j,k)t= (0,0,1)t • Index Space J3 = {(i,j,k)t: 1£ i,j,k£ 3}(M=N=K=3) • Inter-Iteration Data Dependence graph (DG)

  10. s s s P 3-D DG (Dependence Graph) 2-D Processor Array Systolic Mapping (space-time) of Matrix Multiplication

  11. a11 a21 a31 a12 a22 a32 a13 a23 a33 b13 b13 b13 b23 b23 b23 b33 b33 b33 C13 C23 C33 C13 C23 C33 C13 C23 C33 a11 a21 a31 a12 a22 a32 a13 a23 a33 b12 b12 b12 b22 b22 b22 b32 b32 b32 C12 C22 C32 C12 C22 C32 C12 C22 C32 a11 a21 a31 a12 a22 a32 a13 a23 a33 b11 b11 b11 b21 b21 b21 b31 b31 b31 C11 C21 C31 C11 C21 C31 C11 C21 C31 Systolic Mapping of Matrix Multiplication, cont. 0 0 0

  12. Why Space-Time Mapping is suitable for FPGAs? • It can bridge the nested Do loop signal/image processing algorithms to the processorarray implementation. • The space-time array matches the modular and regular FPGA structure. • The localized/pipelined interprocessor links can overcome the long programmable interconnect delay. • The size of configuration storage can be significantly reduced because of the almost identical processing elements and interconnect structure.

  13. Problems with Existing Design Methodologies/Tools • The dependence graphs of many other algorithms are not uniform and must be predetermined by human designers. • Existing methodologies • cannot handle these complex algorithms use unrealistic cost functions (metrics) • No built-in features of FPGAs have been incorporated. • Longer interconnect delay in deep submicron CMOS technology • Much lower hardware utilization due to programmable interconnect delay in FPGAs There is another problem--speed

  14. What is Intra-PE Pipelining? • Interconnect delay of FPGAs results in even longer clock period. • To enhance the overall throughput, Intra-Iteration parallelism must be exploited. • A simple vector dot product array • It can be observed that the utilization of each operator is increased. • Of course, the control mechanism is more complex. Tech done example

  15. Examples of Nested Do Loop Algorithms • Motion estimation • One of the most time consuming operations (tasks) in digital video compression • Stereo matching • used to build disparity map for 3D robot/computer navigation • Matrix/Vector Multiplication • FFT, DCT, 2D/3D graphic etc. • 2D Linear Transform/Operations • 2D FFT, 2D DCT, etc.

  16. Tennis frame 0

  17. Tennis frame 1

  18. Motion Vectors of 8x8-Pixel Blocks

  19. Reconstructed Frame 1 from Frame 0 and Motion Vectors

  20. Illustration of Full Search Block Matching Motion Estimation (6 level Nested do loop) Motion vector=(m,n)

  21. MAD(m,n)= MAD(m,n)+|x(hN+i,vN+j)-y(hN+i+m-p,vN+j+n-p)| Xilinx Core Generator System Critical path delay = 25 ns. based on Xilinx Virtex data 1,500-2,000 equivalent gate count Critical path (blue line) can be shortened further by the Intra-PE pipelining Exp: A Simpler PE Microarchitecture

  22. The MODG representation for nested Do loop algorithms The actual execution is not constrained to any predetermined order. keeps track of every variable instance so that there is no redundant memory access to save I/O, bandwidth and power consumption. can be automated using memory . Without the MODG, the motion estimation and many other nested DO loop algorithms can be written in many of different DGs, human must be involved to formulate a DG, the built-in ROM/RAM of FPGA may not be exploited, and Significance of the Contributions

  23. Significance of the Contributions, cont. • Space-Time mapping for the MODG can be applied to • any SRAM-based FPGA Architecture Constraints and Practical Cost functions • any coarse-grained architecture • Intra-PE pipelining • enhances/preserves the throughput rate at low power mode.

  24. Conclusion • Users demand more communication/multimedia processing capabilities on the resource-limited Internet appliances. • Reconfigurable SOC is the ultimate solution to design the challenging low-power/high performance platform. • Its success lies on the embedded high-density FPGA core as a reconfigurable (programmable) accelerating hardware. • As technology (supply voltage) scales down, logic (transistor) is virtually free while the interconnect becomes the bottleneck and power consuming. • Parallel execution of nested Do loop algorithms by an array of localized processing elements at moderate clock frequency is a viable solution. • It can compromise the three main issues: design time, power consumption, and performance.

  25. Future Trends • Memory (storage) organization should be investigated due to multiple reads per-clock cycle in order to sustain such high throughput. • The control mechanism of the entire array is one of the aspects that will determine its success. • A given MODG may need to be partitioned of so that the resulting array fits the on-chip reconfigurable FPGA core.

More Related