A scalable parallel h 264 decoder on the cell broadband engine architecture
1 / 26

A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture - PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture. Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona State University CODES+ISSS (The International Conference on Hardware-Software Codesign and System Synthesis) 2009. Outline.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture

Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula

Arizona State University

CODES+ISSS (The International Conference on Hardware-Software Codesign and System Synthesis) 2009


  • Introduction and Motivation

  • Opportunities for Parallelization in H.264

  • Implementation

  • Performance Optimizations

  • Experimental Results

  • Conclusion


  • Multicore Architectures

    • Scalability:

      more cores = more performance

  • H.264

    • Standard for video applications including High Definition(HD)

    • Computationally expensive

  • Cell Broadband Engine(CBE)

    • Common and inexpensive thanks to PS3

    • Low power high performancedesign gives a glimpse of future embedded architectures

IBM Cell Broadband Engine Architecture

  • 3.2 GHz

  • 9 cores, 10 threads

  • >200 Gflops(single precision)

  • >20 Gflops(double precision)

  • Up to 25 GB/s memory bandwidth

  • Up to 75 GB/s I/O bandwidth

  • >300 GB/s interconnect bus

SPE: Synergistic Processor Element

SPU: Synergistic Processor Unit


LS: Local Storage

SMF: Synergistic Memory Flow Control

EIB: Element Interconnect Bus

PPE: PowerPC Processor Element

PPU: PowerPC processor Unit

PXU: Power Processor Unit

MIC: Memory Interface Controller

BIC: Bus Interface Controller

L1: Memory Cache Internal to the CPU

L2: Memory Cache External to the CPU


H.264 Advanced Video Coding

  • H.264 is a video compression standard

    • Version 1 completed May 2003

    • ITU-T Video Coding Experts Group (H.264)

    • ISO/IEC Moving Picture Experts Group (MPEG-4 AVC)

  • Macroblock(MB) based CODEC closely related to MPEG-2

  • Growing demand for HD and Wireless video

  • 50% bit rate reduction over previous standard

  • Computational complexity approximately 2.4 x MPEG2

H.264: Decoder

Reference Code: FFmpeg (H.264 Decoder)

  • Open source video and audio converter

  • Handles a multitude of formats

  • Codecs other than H.264 decoder removed

  • About 250K Lines of Code after paring to H.264 only

  • About 200 functions ported to SPU in our implementation


H.264 Frame Level Relationships

  • I Frame: Independently Encoded

    • Intra Prediction

  • P Frame: Predicted from a Preceding frame

    • Intra and Inter Prediction

  • B Frame: Predicted from Both preceding and following frames

    • Intra and Inter Prediction

H.264 Opportunities for Parallelism: GOP and Frame Level

  • I, P, B Frames

  • Picture Sequence


  • Independent Group of Pictures (GOP)

  • Independent Frames within GOP

H.264 Opportunities for Parallelism: Slice and MB Level

  • Slices: Independently encoded groups of MBs within a frame

  • Intra Dependencies:

Data Partitioning Scheme

  • Our Scheme: One row of MBs issued to each SPU

    Possible Intra MB dependencies:

Functional Partitioning

CBE Architecture:

FFmpeg main MB decoding loop



Scalable Implementation

FFmpeg Data Structure Modification

  • Single threaded code: monolithic data structure

  • Entire structure needed to decode single MB but majority is static from one MB to the next

  • SPU only requires applicable subset for one row of MBs

  • Only MB specific data replicated in SPU LS

Figure 10: Data structure modifications reducing memory requirements in the local store. W is the width of the video frame in macroblocks.

SPU LS(Local Store) Limitations

Code Overlay

  • Code segmentcontains one or morefunctions

  • Memory regionassigned one or more segments

  • At run time, region contains exactly one segment

Designing an Overlay Scheme

  • Start with one flat region

    • 1. Identify key functions and assign to new regions

      • Profiling indicates f21() is most important with 50 calls

      • However, f11() is present 80 times in the trace

      • f11() is a key function

    • 2. Create new regions based on profiling data until memory is exhausted

Designing an Overlay Scheme

Overlay Performance

Additional Performance Optimizations

Experimental Results

  • Microsoft’s WMV HD demonstration page [13]

  • The source videos were transcoded into H.264 1920x1080 (1080p) format

    • 5 different bitrates: 2.5, 4, 8, 12, 16MbpsCAVLC and CABAC

    • Use the x264 H.264 encoder integrated into ffmpeg

  • The videos were encoded using the x264 presets: baseline, normal, and hq

  • Decoder performance is measured on the Sony’s Playstation3, 3.2 GHz Cell Processor (limited by Sony for access tosix of the CBE’s eight SPUs) running Linux Fedora 9

[13] Microsoft Corporation. WMV HD Content Showcase. http://www.microsoft.com/windows/windowsmedia/musicandvideo/hdvideo/contentshowcase.aspx

  • Motion vector decoding and deblocking are the most expensive components

  • The white band at the bottom is the PPU (entropy decoder) contribution

Figure 14: Breakdown of decoder performance by component using a single SPU.

Decoder Performance

Compare with [4], our implementation achieves an average 25.23fps or a 23% improvement when decoding similarly encoded video streams on four SPUs.

[4] H. Baik, K.-H. Sihn, Y. il Kim, S. Bae, N. Han, and H. J. Song. “Analysis and Parallelization of H.264 decoder on Cell Broadband Engine Architecture.” In Signal Processing and Information Technology, pages 791–795. Samsung Electron. Co., Ltd., Suwon, Korea, 2007.

  • Our implementation achieved a “best case” average framerate of 34.94fps on 2.5Mbps modified-normal CAVLC encoded video streams on six SPUs

  • And a “worst case” entropy decoder limited average framerate of 15.43fps on 16Mbps hq CABAC encoded video streams.


  • Demonstrated scalable H.264 decoder for multicore processor

  • 23% frame rate advantage over prior work [4] on similar videos and using same number of cores

  • Careful engineering required to efficiently manage data structures and scratchpad memory

  • Login