A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture

A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona State University CODES+ISSS (The International Conference on Hardware-Software Codesign and System Synthesis) 2009

Outline • Introduction and Motivation • Opportunities for Parallelization in H.264 • Implementation • Performance Optimizations • Experimental Results • Conclusion

Motivation • Multicore Architectures • Scalability: more cores = more performance • H.264 • Standard for video applications including High Definition(HD) • Computationally expensive • Cell Broadband Engine(CBE) • Common and inexpensive thanks to PS3 • Low power high performancedesign gives a glimpse of future embedded architectures

IBM Cell Broadband Engine Architecture • 3.2 GHz • 9 cores, 10 threads • >200 Gflops(single precision) • >20 Gflops(double precision) • Up to 25 GB/s memory bandwidth • Up to 75 GB/s I/O bandwidth • >300 GB/s interconnect bus SPE: Synergistic Processor Element SPU: Synergistic Processor Unit SXU: SPU Core LS: Local Storage SMF: Synergistic Memory Flow Control EIB: Element Interconnect Bus PPE: PowerPC Processor Element PPU: PowerPC processor Unit PXU: Power Processor Unit MIC: Memory Interface Controller BIC: Bus Interface Controller L1: Memory Cache Internal to the CPU L2: Memory Cache External to the CPU http://domino.research.ibm.com/comm/research.nsf/pages/r.arch.innovation.htm

H.264 Advanced Video Coding • H.264 is a video compression standard • Version 1 completed May 2003 • ITU-T Video Coding Experts Group (H.264) • ISO/IEC Moving Picture Experts Group (MPEG-4 AVC) • Macroblock(MB) based CODEC closely related to MPEG-2 • Growing demand for HD and Wireless video • 50% bit rate reduction over previous standard • Computational complexity approximately 2.4 x MPEG2

H.264: Decoder

Reference Code: FFmpeg (H.264 Decoder) • Open source video and audio converter • Handles a multitude of formats • Codecs other than H.264 decoder removed • About 250K Lines of Code after paring to H.264 only • About 200 functions ported to SPU in our implementation http://www.ffmpeg.org/

H.264 Frame Level Relationships • I Frame: Independently Encoded • Intra Prediction • P Frame: Predicted from a Preceding frame • Intra and Inter Prediction • B Frame: Predicted from Both preceding and following frames • Intra and Inter Prediction

H.264 Opportunities for Parallelism: GOP and Frame Level • I, P, B Frames • Picture Sequence • IBBPBBP • Independent Group of Pictures (GOP) • Independent Frames within GOP

H.264 Opportunities for Parallelism: Slice and MB Level • Slices: Independently encoded groups of MBs within a frame • Intra Dependencies:

Data Partitioning Scheme • Our Scheme: One row of MBs issued to each SPU Possible Intra MB dependencies:

Functional Partitioning CBE Architecture:

FFmpeg main MB decoding loop Intra Inter

Scalable Implementation

FFmpeg Data Structure Modification • Single threaded code: monolithic data structure • Entire structure needed to decode single MB but majority is static from one MB to the next • SPU only requires applicable subset for one row of MBs • Only MB specific data replicated in SPU LS Figure 10: Data structure modifications reducing memory requirements in the local store. W is the width of the video frame in macroblocks.

SPU LS(Local Store) Limitations

Code Overlay • Code segmentcontains one or morefunctions • Memory regionassigned one or more segments • At run time, region contains exactly one segment

Designing an Overlay Scheme • Start with one flat region • 1. Identify key functions and assign to new regions • Profiling indicates f21() is most important with 50 calls • However, f11() is present 80 times in the trace • f11() is a key function • 2. Create new regions based on profiling data until memory is exhausted

Designing an Overlay Scheme

Overlay Performance

Additional Performance Optimizations

Experimental Results • Microsoft’s WMV HD demonstration page [13] • The source videos were transcoded into H.264 1920x1080 (1080p) format • 5 different bitrates: 2.5, 4, 8, 12, 16MbpsCAVLC and CABAC • Use the x264 H.264 encoder integrated into ffmpeg • The videos were encoded using the x264 presets: baseline, normal, and hq • Decoder performance is measured on the Sony’s Playstation3, 3.2 GHz Cell Processor (limited by Sony for access tosix of the CBE’s eight SPUs) running Linux Fedora 9 [13] Microsoft Corporation. WMV HD Content Showcase. http://www.microsoft.com/windows/windowsmedia/musicandvideo/hdvideo/contentshowcase.aspx

Motion vector decoding and deblocking are the most expensive components • The white band at the bottom is the PPU (entropy decoder) contribution Figure 14: Breakdown of decoder performance by component using a single SPU.

Decoder Performance Compare with [4], our implementation achieves an average 25.23fps or a 23% improvement when decoding similarly encoded video streams on four SPUs. [4] H. Baik, K.-H. Sihn, Y. il Kim, S. Bae, N. Han, and H. J. Song. “Analysis and Parallelization of H.264 decoder on Cell Broadband Engine Architecture.” In Signal Processing and Information Technology, pages 791–795. Samsung Electron. Co., Ltd., Suwon, Korea, 2007.

Our implementation achieved a “best case” average framerate of 34.94fps on 2.5Mbps modified-normal CAVLC encoded video streams on six SPUs • And a “worst case” entropy decoder limited average framerate of 15.43fps on 16Mbps hq CABAC encoded video streams.

Conclusion • Demonstrated scalable H.264 decoder for multicore processor • 23% frame rate advantage over prior work [4] on similar videos and using same number of cores • Careful engineering required to efficiently manage data structures and scratchpad memory

A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture