1 / 17

Sum of Absolute Differences Hardware Accelerator

Sum of Absolute Differences Hardware Accelerator. Mark Lodermeier. Outline. Overview of Motion Estimation and MPEG4 – Part 10 - AVC Approach Tasks Performed To Do. Motion Estimation. Used for video compression – block matching between successive frames

tatum
Download Presentation

Sum of Absolute Differences Hardware Accelerator

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sum of Absolute Differences Hardware Accelerator Mark Lodermeier

  2. Outline • Overview of Motion Estimation and MPEG4 – Part 10 - AVC • Approach • Tasks Performed • To Do

  3. Motion Estimation • Used for video compression – block matching between successive frames • Search for best matching block (find motion vectors) • Used to create model of current frame using reference frame(s) from either previous or future frames • Motion vectors found by determining the minimum SAD MV(r,s) = argmin[SAD(x,y,r,s)]

  4. Motion Estimation • Full Search algorithm produces the best results. • However, computationally expensive. • Motion Estimation accounts for 50-70% of computational complexity in MPEG-4 video encoding/decoding • For a small search range of [-8, +7] in each direction with 16x16 macroblocks, there are 16*16 pixel comparisons performed 16*16 times = 65,536 additions of absolute differences for a single 16x16 block • Real-time video of a 480x640

  5. MPEG-4 Part 10 – AVC • Variable Block Sizes • Each 16x16 Macroblock can be split in half into two 16x8 or 8x16 blocks or into four 8x8 sub-blocks • These sub-blocks can then be split in half into two 8x4 or 4x8 blocks or into four 4x4 blocks.

  6. MPEG-4 Part 10 – AVC • Previous 16x16 macroblock split into smaller blocks

  7. Purpose: Many small blocks requires large amount of bits to encode Few large blocks may produce poor quality Can produce higher efficiency with same quality Challenges: Generate motion vectors for all block sizes - Increase computation cost to an already intensive algorithm Choose the correct block size among many choices to balance bandwidth and quality MPEG4 Variable Block Sizes

  8. Approach • How to efficiently generate all MV’s for variable sized blocks? • Take full advantage of parallel nature of both Motion Estimation and the generation of variable sized blocks • Maintain high processor utilization

  9. Tasks Performed • Implemented 1-D systolic array in VHDL • 16 Processing Elements, each with: • Absolute Difference unit • 9 to 2 Compressor • 3 to 1 Compressor

  10. Absolute Difference Unit

  11. Absolute Difference • Math behind absolute difference unit: • Just check condition - A > B • B + B_not = 2n – 1  B_not = 2n – 1 – B • 2n-1+|A-B| is the value of the sum of the two outputs of the absolute difference unit. • Need to add a correction term of m to get rid of the 2n-1, where m is equal to the number of absolute difference units used.

  12. C B A Abs Diff Unit Abs Diff Unit Abs Diff Unit Abs Diff Unit 9 to 2 Adder Reduction Tree 3 to 1 Reduction Tree Latch Single Processing Element Correction term - 4

  13. Systolic Array C D D D D B A … PE0 PE1 PE2 PE15 4x4 SAD and MV 4x4 SAD and MV 4x4 SAD and MV 4x4 SAD and MV Shift Register Shift Register Shift Register Shift Register … control 16 4x4 SAD values and MV’s 8 4x8 SAD values and MV’s 8 8x4 SAD values and MV’s Back-End Adder Array for Variable Blocks 4 8x8 SAD values and MV’s 2 8x16 SAD values and MV’s 2 16x8 SAD values and MV’s 1 16x16 SAD value and MV

  14. Back-End Adder Array Diagram 12 13 14 15 8 9 10 11 4 5 6 7 0 1 2 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 4x4 SAD’s and MV’s 8 8x4 SAD’s and MV’s 8 4x8 SAD’s and MV’s 4 8x8 SAD’s and MV’s 2 8x16 and 2 16x8 SAD’s and MV’s 16x16 SAD and MV Macroblock with 16 4x4 sub-blocks - Each dot represents the following: Min Latch

  15. Another way to represent the generation of motion vectors for the variable sized blocks p x p Matrix • First you take the pxp matrices from the specified 4x4 blocks to generate the matrix for the 4x8 block (where p is the total search range) • To find the corresponding motion vector, you just search for the minimum value in the matrix SAD values for one 4x4 block in all search positions p x p Matrix SAD values for one 4x8 block in all search positions p x p Matrix SAD values for one 4x4 block in all search positions

  16. The top box is the current frame data and the bottom box is the reference frame data Every 4 clock cycles a PE will produce a 4x4 SAD value It takes 16 clock cycles to fill the systolic array and have 100% Processor Utilization After 64 clock cycles PE0 will have completed a 16x16 macroblock search for one position, Every other PE will then do the same on the following cycle So, to do a full search of (-8, 7) in the x and y positions, a total of 64x16 = 1024 cycles are needed. The Schedule for the Accelerator:

  17. Questions?

More Related