ee398 project presentation n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
EE398 – Project Presentation PowerPoint Presentation
Download Presentation
EE398 – Project Presentation

Loading in 2 Seconds...

play fullscreen
1 / 16
shaquille

EE398 – Project Presentation - PowerPoint PPT Presentation

108 Views
Download Presentation
EE398 – Project Presentation
An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. EE398 – Project Presentation Performance-Complexity Tradeoff H.264 Motion Search Ionut Hristodorescu ionuth@stanford.edu

  2. Outline • H.264 motion search algorithm • Mapping of the motion compensation algorithm on a memory hierarchy subsystem • Cache organization and its impact on the motion compensation speed • Making the internal H.264 data structures more cache friendly

  3. H.264 Motion Search Algorithm • Block Matching Algorithm • Computes the SADs for all the targets in a given area (exhaustive search) • Computationally intensive • Its complexity is equal or greater than the rest of the encoding steps • Takes most of the encoding time

  4. Mapping on a memory hierarchy subsystem • The luma/chroma is represented internally in the motion compensation algorithm as a line-by-line matrix • So, each line of a macroblock will be separated by (size(pel)*width) bytes • This means that accessing pels that are sitting on the same row will generate 1 cache miss/pel !!!

  5. Mapping on a memory hierarchy subsystem • To overcome the above, we could arrange the information so that consecutive block lines will sit in consecutive memory locations

  6. Mapping on a memory hierarchy subsystem • So, a natural representation of the chroma/luma matrixes would be as a sequential macroblock line by macroblock line • This way, the needed information is loaded into the cache quicker

  7. Mapping on a memory hierarchy subsystem • The advantages are immediate • Each macroblock line is 16 pels • So, we could fit 2 16-pels consecutive lines in a cache line • The macroblock is accessed now in a natural, sequential order

  8. Mapping on a memory hierarchy subsystem

  9. Mapping on a memory hierarchy subsystem • The biggest problem that arises now is with non-macroblock line boundary access • Each macroblock line sits at a 16-pel boundary in our representation so far • For macroblock line aligned access, this is great • How about non-macroblock line aligned access ?

  10. Mapping on a memory hierarchy subsystem • We have problems : imagine we want to access 16 pixels, but starting from position 4 in a macroblock line • In the original representation, this is no problem, since the original picture lines are sequential in memory • In our case, we will end up in the next consecutive macroblock line

  11. Mapping on a memory hierarchy subsystem • Solutions 1 : pretend we don’t know about this problem and let the encoder access the wrong pels • Solution 2 : check each time if we are crossing a macroblock line boundary and proceed accordingly • Solution 3 : keep two blocked versions of the picture : the original picture blocked and the shifted-by-32pels blocked

  12. Mapping on a memory hierarchy subsystem • We prefer solution 3 (even if it is more expensive in terms of memory) because this way the pels are accessed quicker • If pel_pos%32 < 16, we are going to pick up the pels from the blocked version of the original picture • Else, we are going to pick up the pels from the blocked versions of the original picture shifted-by-32

  13. Mapping on a memory hierarchy subsystem • 32pels will fit exactly in one cache line (or, for better processors, even 64 pels) • So, each time we access two macroblock lines, we will have no cache miss since the two macroblock lines will fit into a cache line

  14. Results • MET time decreased by approx. 8% compared to the non-blocked exhaustive search • Cache misses/pixel decreased to approx. 15 from 600-800 !!! • Rate-distortion ratio was preserved*

  15. Further optimizations • Assembly language coding of the SAD computation and in particular usage of the PSADW MMX instruction • Multi-threading of the motion compensation algorithm • By using Performance API (PAPI), we could measure the runtime behavior of the cache and introduce the cache misses into the motion cost function, much like in [1]

  16. Further optimizations • Intelligent prefetching of data • Extend the blocking algorithm to the entire motion estimation engine