1 / 27

Hadi Afshar, Philip Brisk, Paolo Ienne EPFL

Scalable and Low Cost Design Approach for Variable Block Size Motion Estimation. Hadi Afshar, Philip Brisk, Paolo Ienne EPFL. 30 April 2009. Fixed Block Size Motion Estimation. Less compression Few motion vectors. Reference Frame. Current Frame. MV. MB. MV: Motion Vector

soyala
Download Presentation

Hadi Afshar, Philip Brisk, Paolo Ienne EPFL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable and Low Cost Design Approach for Variable Block Size Motion Estimation Hadi Afshar, Philip Brisk, Paolo Ienne EPFL 30 April 2009

  2. Fixed Block Size Motion Estimation • Less compression • Few motion vectors Reference Frame Current Frame MV MB MV: Motion Vector MB: Macro Block

  3. Variable Block Size Motion Estimation • More compression • More motion vectors • More computation Reference Frame Current Frame MV MB MV: Motion Vector MB: Macro Block

  4. Systolic Arrays and Motion Estimation Data is shared, low memory bandwidth Memory FF FF FF Pixel(s) Pix Pix Ref Ref Ref. PE0 PE1 PE2 PEn C S ABS1 … ABS4 Comparator Reference Frame Current Frame  MV MB Regfile

  5. Systolic Arrays for VBSME Memory • Chen TCAS 2006 • Li FPT 2006 • Yap TCAS 2004 • Song IEICE 2006 FF FF FF PE0 PE1 PE2 PEn Comparator + + + + + REUSE UNIT REUSE UNIT REUSE UNIT REUSE UNIT Primitive Blocks Regfile Regfile Regfile 16-pixel SAD MERGE TREE + Regfile Comparator Comparator SAD BUS NETWORK 16-pixel

  6. Proposed Design Approach Array Organization Processing Element Design Scheduling Related Work Case Study: H.264 VBSME Experimental Results VLSI Implementation FPGA Implementation Conclusion Outline

  7. Basics: Each PE is augmented by a comparator unit in addition to the reuse unit Each PE computes the SADs of all sub-blocks within MB considering a specific reference MB Each PE is one clock cycle prior to its neighbouring PE Different PEs compute different SADs of the same MB with different reference MBs Proposed Approach

  8. Proposed Approach PE1 PE2 PE0 SADB0,R0 Ti SADB1,R1 SADB0,R1 Ti+1 SADB0,R2 SADB2,R2 SADB1,R2 Ti+2 B1 B2 B0 R0 R1 SADB1,R3 SADB3,R3 R2 R3 R4 SADB2,R3 Ti+3 SADB2,R4 SADB4,R4 SADB3,R4 Ti+4

  9. MIN MIN MIN MIN MIN MIN MIN MIN MIN Proposed Approach PE0 PE1 PE2 Ti SB0,R0 Ti+1 SB0,R1 SB1,R1 MIN MIN MIN MIN MIN MIN Ti+2 SB0,R2 SB1,R2 SB2,R2 Ti+3 SB2,R3 SB1,R3 Ti+4 SB2,R4

  10. Array Organization Array Organization Memory FF FF Comparator PE0 PE1 PEn Compare Compare Compare REUSE UNIT REUSE UNIT REUSE UNIT SAD BUS NETWORK • -MIN SADs move in the chain and stored in the regfile • - Each PE must compute more than one search region • - (# of Pes) < (# of Search regions) MIN SAD Reg File Min SAD Register File

  11. + PE Design FB CU Pix Pix Ref Ref CU output(s) of Previous PE C S ABS1 … ABS4  MIN Reg Regfile RU

  12. To minimize the size of RU register file Each PE should compare and transfer computed SADs ASAP Parallel comparators are required, when multiple SADs are produced in the same cycle Transfer Rate B: # of sub-blocks within MB T: # of cycles required to compute MB SADs PE Design Optimization

  13. To minimize the size of RU register file Each PE should compare and transfer computed SADs ASAP Parallel comparators are required, when multiple SADs are produced in the same cycle Transfer Rate B: # of sub-blocks within MB T: # of cycles required to compute MB SADs Uniform generation of B sub-blocks within T cycles, reduces the RU regfile Regular workflow, simplifies controller PE Design Optimization

  14. Primitive SADs computations need to be distributed in T cycles Non-primitive SADs A SAD is generated as soon as its building SADs are ready Proper scheduling frees SAD registers for other generated building SADs We propose zig-zag pattern for reusing Also helps to evenly distribute SAD computations SAD Scheduling

  15. SAD Scheduling

  16. Related Work • VLSI H.264 VBSME • Yap [TCAS 2004]: 1-D array with SAD bus network • Song [IEICE 2006]: 1-D array with SAD bus network • Chen [TCAS 2006] : 2-D array with SAD merge tree, use for HDTV applications • FPGA H.264 VBSME • Wei [2003]: 1-D array with SAD bus network • Lopez [ISCAS 2005]: 1-D array using SRAMs with SAD bus network • Li [FPT 2006]: Bit-serial architecture with SAD merge tree

  17. MB = 16x16 pixels, B = 41 sub-blocks, 4x4 primitive blocks 4 PEs Each PE computes 4 pixel SADs in each cycle Search range: 16x16 pixels for each pixel T = 64 cycles, for each MB Four identical and regular 16-cycles workflows Case Study: H.264 VBSME

  18. SAD Scheduling

  19. H.264 VBSME modelled in VHDL VLSI Implementations Synopsys DC CMOS libraries 0.18 µm: 12k gates, 285 MHz 0.13 µm: 18k gates, 400 MHz FPGA Implementations Altera Quartus, Xilinx ISE Altera APEX, Xilinx VIRTEX-II & STRATIX-II Experimental Results

  20. VLSI Implementation • MB Processing Time (MBPT) • SR: Search Range • T: MB SAD cycles • N: # of PEs ~20-25% reduction

  21. VLSI Implementation • Gate count (k gates) large area reduction

  22. FPGA Implementation • Throughput (MB / sec) lower throughput than best designs, but…

  23. FPGA Implementation best efficiency …up to 3/4th area reduction

  24. Scalability Stratix-II almost perfect scalability

  25. Conclusion • We improved scalability by redesigning the organization of systolic array and the design of PEs in the array • Very low cost design, less area and delay • We proposed zig-zag pattern for reusing the primitive SADs • Less registers for maintaining computed SADs • Very regular workflow • This approach can be exploited by existing architectures and also can be applied to future standards with different block sizes

  26. Thanks!

  27. SAD Scheduling

More Related