1 / 40

Optimization of H.264 High Profile Decoder for Pentium 4 Processor

Optimization of H.264 High Profile Decoder for Pentium 4 Processor . Tarun Bhatia University of Texas at Arlington tarun@fastvdo.com. H.264 Decoder. Video Output. Bitstream Input. +. Entropy Decoding. Inverse Transform and Dequantization. Deblocking. +. Intra/Inter

berke
Download Presentation

Optimization of H.264 High Profile Decoder for Pentium 4 Processor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimization of H.264High Profile Decoder for Pentium 4 Processor Tarun Bhatia University of Texas at Arlington tarun@fastvdo.com

  2. H.264Decoder Video Output Bitstream Input + Entropy Decoding Inverse Transform and Dequantization Deblocking + Intra/Inter Mode Selection Picture Buffering Intra Prediction Motion Compensation

  3. Optimization:Need • H.264/AVC video coding introduces substantially more coding tools and coding options than earlier standards. Therefore, it takes much more computational complexity to achieve highest possible coding gain. • Aggressive optimization is typically required in order to get H.264 implementations to meet cost and power targets and provide real-time performance for applications.

  4. Sequences Used Girl.264 Karate.264 Golf.264 Shore.264 Plane.264

  5. H.264 Profiles High Profile Adaptive Block Size Transform Perceptual Quantization Matrices Extended Profile Main Profile B slices Weighted Prediction CABAC Data Partition I slice P slice CAVLC Arbitrary Slice Order (ASO) Frame Macroblock Ordering (FMO) Redundant Slices Baseline Profile SP Slice SI Slice

  6. H.264 High Profiles - features • Main Profile + additional features • 8x8 Integer DCT • HVS matrices • 8x8 Intra Prediction modes

  7. Optimization : Levels • Algorithm Level e.g. DCT implementation • Compiler Level (Microsoft Visual Studio .NET 2003 / Intel C++ compiler v 8.0) • Implementation Level e.g. Elimination of Loops, Conditions Using SIMD for implementation Multithreading

  8. Target Platform : Pentium 4 ProcessorIntel SIMD Architecture 8 XMM Registers [128 bits] MXCSR [32 bit] 8 MMX Registers [64 bit] 8 GPRs [32bit] X87 FP Register File EFLAGS[32bit] FP MMX SSE/SSE2/ SSE3 FP MOVE L1 Data Cache (8KB 4-way)

  9. Intel HT (Hyper Threading) Technology Purpose : Simultaneous Execution of Threads SYSTEM BUS

  10. Optimization : Steps • Optimization during code development • Optimization after code development 1) Searching for “hotspots” in the code 2) Analysis of “hotspot” e.g. more number of calls, cache miss, slower implementation 3) Optimization of hotspots

  11. Performance Profiling • Intel VTuneTM Performance Analyzer

  12. Intel VTune Performance Analysis - Results (FastVDO H.264 HD High Profile Decoder)

  13. Distribution of Decoder Time Consumption

  14. SIMD • Single Instruction Multiple Data Instructions • Intel Pentium 4 MMX ( Multimedia Extension) from Pentium MMX onwards SSE ( Streaming SIMD Extension ) from Pentium III onwards SSE2 ( Streaming SIMD Extension 2) from Pentium IV onwards • AMD Athalon 64 3D Now

  15. SIMD Data Types 128 Available in XMM registers in SSE Technology Available in MMX and XMM registers

  16. SIMD Instructions : Types • Packed Arithmetic (e.g. padd, pmul) • Packed Logical (e.g. pand, por) • Data Movement and Memory Access (mov) • General Support (pack, unpack) • Packed Shift ( >> ,<< ) • Packed Comparison (<=, = =)

  17. Case Study interpolation4x4 (pixel_data * forward_block, pixel_data* backward_block) { pixel_data* result; for (int i=0 ; i<=15 ; i++) { result [i] = (forward_block[i] + backward_block[i]+1)/2; } }

  18. MMX Code interpolation (pixel_data* forward_block , pixel_data* backward_block) { ___asm { __asm { pxor mm7,mm7 // set mm7 to 0 mov EDX, 0x01010101 // EDX = 01 01 01 01 mov EAX, forward_block // Store forward block starting address movd mm3, EDX // mm3: 00 00 00 00 01 01 01 01 mov EBX, backward_block // Store backward block starting address punpcklbw mm3,mm7 // mm3: 00 01 00 01 00 01 00 01 mov ECX, result // Store the address of result movd mm0, [EAX] // mm0: fb[1:4] movd mm1, [EBX] // mm1: bb[1:4] movd mm4, [EAX+4] // mm4: fb[5:8] movd mm5, [EBX+4] // mm5: bb[5:8] punpcklbw mm0,mm7 // punpcklbw mm1,mm7 // punpcklbw mm4,mm7 // punpcklbw mm5,mm7 // paddw mm0, mm1 // mm0: fb[1:4]+bb[1:4] paddw mm4, mm5 // mm4: fb[5:8]+bb[5:8] paddw mm0, mm3 // mm0: fb[1:4]+bb[1:4]+1 paddw mm4, mm3 // mm4: fb[5:8]+bb[5:8]+1 psrl mm0, 1 // mm1: (fb[1:4]+bb[1:4]+1)>> 1 psrl mm4, 1 // mm5: (fb[5:8]+bb[5:8]+1)>> 1 packuswb mm0,mm0 // mm0: 00 00 00 00 r4 r3 r2 r1 packuswb mm4,mm4 // mm4: 00 00 00 00 r8 r7 r6 r5 movd [ECX],mm0 // result[1:4] = mm0 movd [ECX+4],mm4 // result[5:8] = mm4 //Repeat the same process for fb[9:16] and bb[9:16] emms // Empty MMX state } }

  19. SIMD Application Results • Amdahl’s Law : The Overall Speedup (O.S.) obtained by optimizing a portion p of the program by a factor s is O.S. = 1 x 100 % ----------------- - 1 1 – p + (p/s) p  fraction of the code being optimized s  speedup factor for that fraction of code

  20. Application to IDCT 4x4

  21. IDCT 4x4 Comparison of % Time Consumed Of the Total Decoding Time

  22. % Overall Speed up in Decoding Time with SIMD IDCT4x4

  23. Application to Motion Compensation The implementation of Motion Compensation can be divided as :- • Data Manipulation (SIMD not used) • Interpolation (SIMD used) • Half Pel Interpolation • Quarter Pel Interpolation • Linear Interpolation for B frames

  24. Motion Compensation-% Time consumption (without MMX)

  25. SIMD Application to Motion Compensation - Results

  26. Motion Compensation – ResultsComparison of % Time Consumed

  27. % Overall Speed up in Decoding Time with SIMD MC

  28. Multithreading • Definition : Multithreading is the ability of the program to multitask within itself. The program can split itself into separate “threads” of execution that seem to run concurrently. • Waitsare used to block the thread till a particular event hands over control • Releaseis use to unblock the thread • Semaphores : Locking mechanism / Counters to control access to shared resources being used by multiple processes

  29. Producer-Consumer Problem (Diagram) Producer Thread Consumer Thread Semaphores Wait Serial Execution Of a Thread Release

  30. Producer-Consumer Problem (Algorithm) • Producer thread starts and initialize data • Wait for the Consumer thread • If Consumer thread ready, release control to the consumer thread • Producer thread completes one execution cycle in the meantime and waits for Consumer thread • When the control is passed back to Producer thread, the process is repeated till the end condition is met.

  31. Multithreading in Video Coding The Codec can be multithreaded in two ways:- • Block Level • Independent blocks can be executed as separate threads e.g. slices in H.264, motion estimation, deblocking of non-reference frames • GOP Level • Closed GOP : Group of frames which will not use any reference frames except from their GOP • Open GOP : Group of frames can use reference frames from outside their GOP

  32. Proposed Multithreading Architecture -features • GOP Level (Closed GOP) • 30 frames per GOP • IPPPPPPP…P • Each GOP begins with an I frame and contains P frames only (i.e. 1 I frame and 29 P frames in each ) • B frames are not used in the design to maintain closed GOP structure

  33. Proposed Multithreading Architecture Get IDR Position Main Thread Decoder 0 Decoder 1 Decoder N

  34. Multithreaded Decoder - Threads • Main Thread • Creates all threads and semaphores • Get SPS and PPS NALUs from the • Initialize Multiple decoders with SPS and PPS NALUs • Get IDR Frame Position Thread • Search for IDR NALU Position in the bitstream • Manage Waits and Releases of Semaphores • Decoder Threads • Decode H.264 GOPs SPS  Sequence Parameter Set PPS Picture Parameter Set NALU  Network Abstraction Layer Unit

  35. Multithreading - Results% Speed up in Decoding Time Number of Threads

  36. Multithreading-ResultsThreading Overhead (Time in seconds) No. of Threads

  37. Further Research • Optimization of High Profile HD (720p) Encoder for minimization of Hardware requirement • Testing of the H.264 encoder and decoder on multicore CPUs • Implementation of time consuming modules of H.264 encoder and decoder on GPU (Graphic Processing Unit)

  38. References • H.264: International Telecommunication Union, “Recommendation ITU-T H.264: Advanced Video Coding for Generic Audiovisual Services,” ITU-T, 2005. • MPEG-2: ISO/IEC JTC1/SC29/WG11 and ITU-T, “ISO/IEC 13818-2: Information Technology-Generic Coding of Moving Pictures and Associated Audio Information: Video,” ISO/IEC and ITU-T, 1994. • Soon-kak Kwon, A.Tamhankar and K.R.Rao ,”Overview of MPEG-4 Part 10”. • G. Sullivan, P. Topiwala and A. Luthra, “The H.264/AVC Advanced Video Coding Standard: Overview and Introduction to the Fidelity Range Extensions,” SPIE Conference on Applications of Digital Image Processing XXVII, vol 5558 , page 53-74, Aug 2004. • The Software Optimization Cookbook, Intel Press, 2002. • IA-32 Intel Architecture Optimization, Reference Manual, www.intel.com • Optimization Applications with the Intel C++ and FORTRAN compilers, White paper, http://developer.intel.com/design/pentium4/manuals/ • J.Lee, S.Moon and W.Sun, “H.264 Decoder Optimization Exploiting SIMD Instructions”, Seoul National University. http://sips03.snu.ac.kr/pub/conf/c67.pdf Accepted at IEEE Asia-Pacific Conference on Circuits and Systems, (APCCAS), December 2004. • Amdahl, G.M. Validity of the single-processor approach to achieving large scale computing capabilities. In AFIPS Conference Proceedings vol. 30 (Atlantic City, N.J., Apr. 18-20). AFIPS Press, Reston, Va., 1967, pp. 483-485. • Horowitz, A. Joch, F. Kossentini, and A. Hallapuro,“H.264/AVC Baseline Profile Decoder Complexity Analysis,” IEEE Transactions for Circuits and Systems for Video Technology, vol.13, no. 7, pp. 704-716, July 2003.

  39. References:Continued • http://www.blu-ray.com/ • http://www.hddvd.org/hddvd/ • http://www.fastvdo.com • http://www.intel.com • http://www.intel.com/software/products/vtune/ • http://msdn.microsoft.com

  40. Thanks!!

More Related