EE398 – Project Presentation

EE398 – Project Presentation Performance-Complexity Tradeoff H.264 Motion Search Ionut Hristodorescu ionuth@stanford.edu

Outline • H.264 motion search algorithm • Mapping of the motion compensation algorithm on a memory hierarchy subsystem • Cache organization and its impact on the motion compensation speed • Making the internal H.264 data structures more cache friendly

H.264 Motion Search Algorithm • Block Matching Algorithm • Computes the SADs for all the targets in a given area (exhaustive search) • Computationally intensive • Its complexity is equal or greater than the rest of the encoding steps • Takes most of the encoding time

Mapping on a memory hierarchy subsystem • The luma/chroma is represented internally in the motion compensation algorithm as a line-by-line matrix • So, each line of a macroblock will be separated by (size(pel)*width) bytes • This means that accessing pels that are sitting on the same row will generate 1 cache miss/pel !!!

Mapping on a memory hierarchy subsystem • To overcome the above, we could arrange the information so that consecutive block lines will sit in consecutive memory locations

Mapping on a memory hierarchy subsystem • So, a natural representation of the chroma/luma matrixes would be as a sequential macroblock line by macroblock line • This way, the needed information is loaded into the cache quicker

Mapping on a memory hierarchy subsystem • The advantages are immediate • Each macroblock line is 16 pels • So, we could fit 2 16-pels consecutive lines in a cache line • The macroblock is accessed now in a natural, sequential order

Mapping on a memory hierarchy subsystem

Mapping on a memory hierarchy subsystem • The biggest problem that arises now is with non-macroblock line boundary access • Each macroblock line sits at a 16-pel boundary in our representation so far • For macroblock line aligned access, this is great • How about non-macroblock line aligned access ?

Mapping on a memory hierarchy subsystem • We have problems : imagine we want to access 16 pixels, but starting from position 4 in a macroblock line • In the original representation, this is no problem, since the original picture lines are sequential in memory • In our case, we will end up in the next consecutive macroblock line

Mapping on a memory hierarchy subsystem • Solutions 1 : pretend we don’t know about this problem and let the encoder access the wrong pels • Solution 2 : check each time if we are crossing a macroblock line boundary and proceed accordingly • Solution 3 : keep two blocked versions of the picture : the original picture blocked and the shifted-by-32pels blocked

Mapping on a memory hierarchy subsystem • We prefer solution 3 (even if it is more expensive in terms of memory) because this way the pels are accessed quicker • If pel_pos%32 < 16, we are going to pick up the pels from the blocked version of the original picture • Else, we are going to pick up the pels from the blocked versions of the original picture shifted-by-32

Mapping on a memory hierarchy subsystem • 32pels will fit exactly in one cache line (or, for better processors, even 64 pels) • So, each time we access two macroblock lines, we will have no cache miss since the two macroblock lines will fit into a cache line

Results • MET time decreased by approx. 8% compared to the non-blocked exhaustive search • Cache misses/pixel decreased to approx. 15 from 600-800 !!! • Rate-distortion ratio was preserved*

Further optimizations • Assembly language coding of the SAD computation and in particular usage of the PSADW MMX instruction • Multi-threading of the motion compensation algorithm • By using Performance API (PAPI), we could measure the runtime behavior of the cache and introduce the cache misses into the motion cost function, much like in [1]

Further optimizations • Intelligent prefetching of data • Extend the blocking algorithm to the entire motion estimation engine

EE398 – Project Presentation