1 / 11

High Speed Systolic Array Structure for Variable Block Size Motion Estimation Vinod Reddy 05/04/2009

High Speed Systolic Array Structure for Variable Block Size Motion Estimation Vinod Reddy 05/04/2009. Nested Loop Structure for Fixed Size ME. For m= 0 to 2p For n= 0 to 2p SAD(m,n)=0 for i= 0 to N-1 for j= 0 to N-1

walker
Download Presentation

High Speed Systolic Array Structure for Variable Block Size Motion Estimation Vinod Reddy 05/04/2009

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High Speed Systolic Array Structure for Variable Block Size Motion EstimationVinod Reddy05/04/2009

  2. Nested Loop Structure for Fixed Size ME • For m= 0 to 2p • For n= 0 to 2p • SAD(m,n)=0 • for i= 0 to N-1 • for j= 0 to N-1 • SAD(m,n)= SAD(m,n)+|x(i,j)-y(i+m-p,j+n-p)| • End j • End i • SAD(m,n) = SAD(m,n); • End n • End m

  3. For a 4x4 Block size and p =2 • For m= 0 to 3 • For n= 0 to 3 • SAD(m,n)=0 • for i= 0 to 3 • for j= 0 to 3 • SAD(m,n)= SAD(m,n)+|x(i,j)-y(i+m,j+n)| • }}}} In matrix form, inner two loops can be represented as SAD(0,0) = x00 x01 x02 x03 y00 y01 y02 y03 x10 x11 x12 x13 - y10 y11 y12 y13 x20 x21 x22 x23 y20 y21 y22 y23 x30 x31 x32 x33 y30 y31 y32 y33 The corresponding DG: i j x00 y00 x01 y01 x02 y02 x03 y03 x10 y10 x11 y11 x12 y12 x13 y13 x20 y20 x21 y21 x22 y22 x23 y23 x30 y30 x31 y31 x32 y32 x33 y33 Sad (0,0)

  4. Unrolling and interchanging the third loop for max reuse • For n = 0 to 3 • For m = 0 to 3 • SAD(m,n)=0 • for i= 0 to 3 • for j= 0 to 3 • SAD(m,n)= SAD(m,n)+|x(i,j)-y(i+m,j+n)| • }}}} Representing in matrix form by unrolling third loop iteration SAD(0,0) = x00 x01 x02 x03 y00 y01 y02 y03 x10 x11 x12 x13 - y10 y11 y12 y13 x20 x21 x22 x23 y20 y21 y22 y23 x30 x31 x32 x33 y30 y31 y32 y33 SAD(1,0) = x00 x01 x02 x03 y10 y11 y12 y13 x10 x11 x12 x13 - y20 y21 y22 y23 x20 x21 x22 x23 y30 y31 y32 y33 x30 x31 x32 x33y40 y41 y42 y43 SAD(2,0) = x00 x01 x02 x03 y20 y21 y22 y23 x10 x11 x12 x13 - y30 y31 y32 y33 x20 x21 x22 x23 y40 y41 y42 y43 x30 x31 x32 x33 y50 y51 y52 y53 m i j x00 x01 x02 x03 x00 x01 x02 x03 y00 y01 y02 y03 x10 x11 x12 x13 y10 y11 y12 y13 x20 x21 x22 x23 y20 y21 y22 y23 x30 x31 x32 x33 Sad (1,0) y43 y30 y31 y32 y33 Sad (0,0)

  5. Grouping 4 pixels into a Single large pixel • Previous approaches used too pessimistic pixel level granularity [1,2]. • We will work now with large pixel set of four pixels. • The same DG shown before can now be conveniently represented in 2D as Where X00 = {x00,x01,x02,x03} Y00 = {y00,y01,y02,y03}…… Y40 = {y40,y41,y42,y43}………. Sad (0,0) Sad (1,0) Sad (2,0) Sad (3,0) x00 x02 x03 x01 X00 x00 x01 x02 x03 Y00 y00 y01 y02 y03 X10 x10 x11 x12 x13 y10 y11 y12 y13 Y10 x20 x21 x22 x23 X20 y20 y21 y22 y23 Y20 x30 x31 x32 x33 Sad (1,0) y43 y30 y31 y32 y33 X30 Sad (0,0) Y30 Y40 Y50 Y60

  6. Systolic Array for 4x4 Block Size i j Choosing the Projection direction d : [1 0] i- axis Processor space vector PT : [0 1] j-axis Scheduling vector ST : [1 1] The edge mapping of the DG in systolic array is given by eT PTe STe X(1,0) 0 1 Y(1,1) 1 2 Res(0,1) 1 1 Sad (0,0) Sad (1,0) Sad (2,0) Sad (3,0) X00 Y00 X10 Y10 X20 Y20 X30 Y30 Y40 Y50 Y60 D D D D X30 X20 X10 X00 …..Sad(3,0) Sad(2,0) Sad(1,0) Sad (0,0) 2D 2D 2D + Sad4x4 generated every cycle for each Ref Blk + Critical Path is decided by PE Processing time. + X Inputs are stored internally in the PE + Each Y vector is applied once and the systolic array reuses it efficiently without rereading the same input twice. D D D Y30 Y20 Y40 Y10 - - Y50 Y00 - - Y60

  7. Variable Block Size for 8x8 • Lets now consider variable block size motion estimation • X & Y 8x8 blocks can be decomposed into above four 4x4 blocks 4x4 4x4 8x8 8x8

  8. Extending DG for Variable Block Size8x8 Sad (0,2) Sad (1,2) Sad (2,2) Sad (3,2) Sad (0,3) Sad (1,3) Sad (2,3) Sad (3,3) Sad 4x8_01 X40 X41 Y40 Y41 X50 X51 Y50 Y51 Sad 8x4_01 X60 X61 Y60 Y61 Sad 8x8_00 X70 X71 Sad 8x4_00 Y70 Y71 Y80 Y90 Y81 Y91 Y10,0 Y10,1 Sad (0,0) Sad (1,0) Sad (2,0) Sad (3,0) Sad (0,1) Sad (1,1) Sad (2,1) Sad (3,1) X00 X01 Sad 4x8_00 Y00 Y01 X10 X11 Y10 Y11 X20 X21 Y20 Y21 X30 X31 Y30 Y31 Y40 Y50 Y41 Y51 Y60 Y61

  9. Mapping DG of Variable ME to Systolic Array X30 X20 X10 X00 2D 2D 2D Sad 4x4_00 Sad 4x8_00 ……Y20 Y10 Y00 D D D X31 X21 X11 X01 2D 2D 2D Sad 4x4_01 ……Y21 Y11 Y01 D D D Sad 8x4_00 X30 X20 X10 X00 Sad 8x8_00 2D 2D 2D Sad 4x4_10 ……Y22 Y12 Y02 D D D Sad 8x4_01 X30 X20 X10 X00 2D 2D 2D Sad 4x4_11 ……Y23 Y13 Y03 D D D Sad 4x8_01 + Generates 4 Sad4x4, 2 Sad8x4, 2 Sad4x8, 1 8x8 Sad in a single cycle + Four large Y pixels {Y00,Y01,Y02,Y03} are read only once and reused internally + Four systolic arrays operating in parallel for variable block ME of size 8x8 + Similarly we can extend it for 16x16, by operating16 systolic arrays in parallel.

  10. Large PE Design PSAD x30 y30 Y30 X30 X30 Y30 PSAD_I C0 ABS0 x30 y30 x31 y31 x32 y32 x33 y33 Y30 X30 Reg PSAD_I Reg Reg X30 Y30 PSAD |x-y| = x + y’ + 1 x > y (x + y’) ‘ x =< y … [3]

  11. Results • Synthesis results using synopsys design compiler. • Target library LSI 90nm. Clock Freq Area PE 333 MhZ 7276 um2 PE 500 MhZ 8505 um2

More Related