A scalable parallel h 264 decoder on the cell broadband engine architecture
Download
1 / 26

A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture - PowerPoint PPT Presentation


  • 233 Views
  • Uploaded on

A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture. Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona State University CODES+ISSS (The International Conference on Hardware-Software Codesign and System Synthesis) 2009. Outline.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture' - mauve


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
A scalable parallel h 264 decoder on the cell broadband engine architecture l.jpg

A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture

Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula

Arizona State University

CODES+ISSS (The International Conference on Hardware-Software Codesign and System Synthesis) 2009


Outline l.jpg
Outline Engine Architecture

  • Introduction and Motivation

  • Opportunities for Parallelization in H.264

  • Implementation

  • Performance Optimizations

  • Experimental Results

  • Conclusion


Motivation l.jpg
Motivation Engine Architecture

  • Multicore Architectures

    • Scalability:

      more cores = more performance

  • H.264

    • Standard for video applications including High Definition(HD)

    • Computationally expensive

  • Cell Broadband Engine(CBE)

    • Common and inexpensive thanks to PS3

    • Low power high performancedesign gives a glimpse of future embedded architectures


Ibm cell broadband engine architecture l.jpg
IBM Cell Broadband Engine Architecture Engine Architecture

  • 3.2 GHz

  • 9 cores, 10 threads

  • >200 Gflops(single precision)

  • >20 Gflops(double precision)

  • Up to 25 GB/s memory bandwidth

  • Up to 75 GB/s I/O bandwidth

  • >300 GB/s interconnect bus

SPE: Synergistic Processor Element

SPU: Synergistic Processor Unit

SXU: SPU Core

LS: Local Storage

SMF: Synergistic Memory Flow Control

EIB: Element Interconnect Bus

PPE: PowerPC Processor Element

PPU: PowerPC processor Unit

PXU: Power Processor Unit

MIC: Memory Interface Controller

BIC: Bus Interface Controller

L1: Memory Cache Internal to the CPU

L2: Memory Cache External to the CPU

http://domino.research.ibm.com/comm/research.nsf/pages/r.arch.innovation.htm


H 264 advanced video coding l.jpg
H.264 Advanced Video Coding Engine Architecture

  • H.264 is a video compression standard

    • Version 1 completed May 2003

    • ITU-T Video Coding Experts Group (H.264)

    • ISO/IEC Moving Picture Experts Group (MPEG-4 AVC)

  • Macroblock(MB) based CODEC closely related to MPEG-2

  • Growing demand for HD and Wireless video

  • 50% bit rate reduction over previous standard

  • Computational complexity approximately 2.4 x MPEG2


H 264 decoder l.jpg
H.264: Decoder Engine Architecture


Reference code ffmpeg h 264 decoder l.jpg
Reference Code: FFmpeg (H.264 Decoder) Engine Architecture

  • Open source video and audio converter

  • Handles a multitude of formats

  • Codecs other than H.264 decoder removed

  • About 250K Lines of Code after paring to H.264 only

  • About 200 functions ported to SPU in our implementation

http://www.ffmpeg.org/


H 264 frame level relationships l.jpg
H.264 Frame Level Relationships Engine Architecture

  • I Frame: Independently Encoded

    • Intra Prediction

  • P Frame: Predicted from a Preceding frame

    • Intra and Inter Prediction

  • B Frame: Predicted from Both preceding and following frames

    • Intra and Inter Prediction


H 264 opportunities for parallelism gop and frame level l.jpg
H.264 Opportunities for Parallelism: Engine ArchitectureGOP and Frame Level

  • I, P, B Frames

  • Picture Sequence

    • IBBPBBP

  • Independent Group of Pictures (GOP)

  • Independent Frames within GOP


H 264 opportunities for parallelism slice and mb level l.jpg
H.264 Opportunities for Parallelism: Engine ArchitectureSlice and MB Level

  • Slices: Independently encoded groups of MBs within a frame

  • Intra Dependencies:


Data partitioning scheme l.jpg
Data Partitioning Scheme Engine Architecture

  • Our Scheme: One row of MBs issued to each SPU

    Possible Intra MB dependencies:


Functional partitioning l.jpg
Functional Partitioning Engine Architecture

CBE Architecture:


Ffmpeg main mb decoding loop l.jpg
FFmpeg main MB decoding loop Engine Architecture

Intra

Inter


Scalable implementation l.jpg
Scalable Implementation Engine Architecture


Ffmpeg data structure modification l.jpg
FFmpeg Data Structure Modification Engine Architecture

  • Single threaded code: monolithic data structure

  • Entire structure needed to decode single MB but majority is static from one MB to the next

  • SPU only requires applicable subset for one row of MBs

  • Only MB specific data replicated in SPU LS

Figure 10: Data structure modifications reducing memory requirements in the local store. W is the width of the video frame in macroblocks.


Spu ls local store limitations l.jpg
SPU LS(Local Store) Limitations Engine Architecture


Code overlay l.jpg
Code Overlay Engine Architecture

  • Code segmentcontains one or morefunctions

  • Memory regionassigned one or more segments

  • At run time, region contains exactly one segment


Designing an overlay scheme l.jpg
Designing an Overlay Scheme Engine Architecture

  • Start with one flat region

    • 1. Identify key functions and assign to new regions

      • Profiling indicates f21() is most important with 50 calls

      • However, f11() is present 80 times in the trace

      • f11() is a key function

    • 2. Create new regions based on profiling data until memory is exhausted


Designing an overlay scheme19 l.jpg
Designing an Overlay Scheme Engine Architecture


Overlay performance l.jpg
Overlay Performance Engine Architecture



Experimental results l.jpg
Experimental Results Engine Architecture

  • Microsoft’s WMV HD demonstration page [13]

  • The source videos were transcoded into H.264 1920x1080 (1080p) format

    • 5 different bitrates: 2.5, 4, 8, 12, 16MbpsCAVLC and CABAC

    • Use the x264 H.264 encoder integrated into ffmpeg

  • The videos were encoded using the x264 presets: baseline, normal, and hq

  • Decoder performance is measured on the Sony’s Playstation3, 3.2 GHz Cell Processor (limited by Sony for access tosix of the CBE’s eight SPUs) running Linux Fedora 9

[13] Microsoft Corporation. WMV HD Content Showcase. http://www.microsoft.com/windows/windowsmedia/musicandvideo/hdvideo/contentshowcase.aspx


Slide23 l.jpg

Figure 14: Breakdown of decoder performance by component using a single SPU.


Decoder performance l.jpg
Decoder Performance expensive components

Compare with [4], our implementation achieves an average 25.23fps or a 23% improvement when decoding similarly encoded video streams on four SPUs.

[4] H. Baik, K.-H. Sihn, Y. il Kim, S. Bae, N. Han, and H. J. Song. “Analysis and Parallelization of H.264 decoder on Cell Broadband Engine Architecture.” In Signal Processing and Information Technology, pages 791–795. Samsung Electron. Co., Ltd., Suwon, Korea, 2007.


Slide25 l.jpg


Conclusion l.jpg
Conclusion framerate of 34.94fps on 2.5Mbps modified-normal CAVLC encoded video streams on six SPUs

  • Demonstrated scalable H.264 decoder for multicore processor

  • 23% frame rate advantage over prior work [4] on similar videos and using same number of cores

  • Careful engineering required to efficiently manage data structures and scratchpad memory


ad