A scalable parallel h 264 decoder on the cell broadband engine architecture l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 26

A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture PowerPoint PPT Presentation


  • 194 Views
  • Uploaded on
  • Presentation posted in: General

A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture. Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona State University CODES+ISSS (The International Conference on Hardware-Software Codesign and System Synthesis) 2009. Outline.

Download Presentation

A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


A scalable parallel h 264 decoder on the cell broadband engine architecture l.jpg

A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture

Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula

Arizona State University

CODES+ISSS (The International Conference on Hardware-Software Codesign and System Synthesis) 2009


Outline l.jpg

Outline

  • Introduction and Motivation

  • Opportunities for Parallelization in H.264

  • Implementation

  • Performance Optimizations

  • Experimental Results

  • Conclusion


Motivation l.jpg

Motivation

  • Multicore Architectures

    • Scalability:

      more cores = more performance

  • H.264

    • Standard for video applications including High Definition(HD)

    • Computationally expensive

  • Cell Broadband Engine(CBE)

    • Common and inexpensive thanks to PS3

    • Low power high performancedesign gives a glimpse of future embedded architectures


Ibm cell broadband engine architecture l.jpg

IBM Cell Broadband Engine Architecture

  • 3.2 GHz

  • 9 cores, 10 threads

  • >200 Gflops(single precision)

  • >20 Gflops(double precision)

  • Up to 25 GB/s memory bandwidth

  • Up to 75 GB/s I/O bandwidth

  • >300 GB/s interconnect bus

SPE: Synergistic Processor Element

SPU: Synergistic Processor Unit

SXU: SPU Core

LS: Local Storage

SMF: Synergistic Memory Flow Control

EIB: Element Interconnect Bus

PPE: PowerPC Processor Element

PPU: PowerPC processor Unit

PXU: Power Processor Unit

MIC: Memory Interface Controller

BIC: Bus Interface Controller

L1: Memory Cache Internal to the CPU

L2: Memory Cache External to the CPU

http://domino.research.ibm.com/comm/research.nsf/pages/r.arch.innovation.htm


H 264 advanced video coding l.jpg

H.264 Advanced Video Coding

  • H.264 is a video compression standard

    • Version 1 completed May 2003

    • ITU-T Video Coding Experts Group (H.264)

    • ISO/IEC Moving Picture Experts Group (MPEG-4 AVC)

  • Macroblock(MB) based CODEC closely related to MPEG-2

  • Growing demand for HD and Wireless video

  • 50% bit rate reduction over previous standard

  • Computational complexity approximately 2.4 x MPEG2


H 264 decoder l.jpg

H.264: Decoder


Reference code ffmpeg h 264 decoder l.jpg

Reference Code: FFmpeg (H.264 Decoder)

  • Open source video and audio converter

  • Handles a multitude of formats

  • Codecs other than H.264 decoder removed

  • About 250K Lines of Code after paring to H.264 only

  • About 200 functions ported to SPU in our implementation

http://www.ffmpeg.org/


H 264 frame level relationships l.jpg

H.264 Frame Level Relationships

  • I Frame: Independently Encoded

    • Intra Prediction

  • P Frame: Predicted from a Preceding frame

    • Intra and Inter Prediction

  • B Frame: Predicted from Both preceding and following frames

    • Intra and Inter Prediction


H 264 opportunities for parallelism gop and frame level l.jpg

H.264 Opportunities for Parallelism: GOP and Frame Level

  • I, P, B Frames

  • Picture Sequence

    • IBBPBBP

  • Independent Group of Pictures (GOP)

  • Independent Frames within GOP


H 264 opportunities for parallelism slice and mb level l.jpg

H.264 Opportunities for Parallelism: Slice and MB Level

  • Slices: Independently encoded groups of MBs within a frame

  • Intra Dependencies:


Data partitioning scheme l.jpg

Data Partitioning Scheme

  • Our Scheme: One row of MBs issued to each SPU

    Possible Intra MB dependencies:


Functional partitioning l.jpg

Functional Partitioning

CBE Architecture:


Ffmpeg main mb decoding loop l.jpg

FFmpeg main MB decoding loop

Intra

Inter


Scalable implementation l.jpg

Scalable Implementation


Ffmpeg data structure modification l.jpg

FFmpeg Data Structure Modification

  • Single threaded code: monolithic data structure

  • Entire structure needed to decode single MB but majority is static from one MB to the next

  • SPU only requires applicable subset for one row of MBs

  • Only MB specific data replicated in SPU LS

Figure 10: Data structure modifications reducing memory requirements in the local store. W is the width of the video frame in macroblocks.


Spu ls local store limitations l.jpg

SPU LS(Local Store) Limitations


Code overlay l.jpg

Code Overlay

  • Code segmentcontains one or morefunctions

  • Memory regionassigned one or more segments

  • At run time, region contains exactly one segment


Designing an overlay scheme l.jpg

Designing an Overlay Scheme

  • Start with one flat region

    • 1. Identify key functions and assign to new regions

      • Profiling indicates f21() is most important with 50 calls

      • However, f11() is present 80 times in the trace

      • f11() is a key function

    • 2. Create new regions based on profiling data until memory is exhausted


Designing an overlay scheme19 l.jpg

Designing an Overlay Scheme


Overlay performance l.jpg

Overlay Performance


Additional performance optimizations l.jpg

Additional Performance Optimizations


Experimental results l.jpg

Experimental Results

  • Microsoft’s WMV HD demonstration page [13]

  • The source videos were transcoded into H.264 1920x1080 (1080p) format

    • 5 different bitrates: 2.5, 4, 8, 12, 16MbpsCAVLC and CABAC

    • Use the x264 H.264 encoder integrated into ffmpeg

  • The videos were encoded using the x264 presets: baseline, normal, and hq

  • Decoder performance is measured on the Sony’s Playstation3, 3.2 GHz Cell Processor (limited by Sony for access tosix of the CBE’s eight SPUs) running Linux Fedora 9

[13] Microsoft Corporation. WMV HD Content Showcase. http://www.microsoft.com/windows/windowsmedia/musicandvideo/hdvideo/contentshowcase.aspx


Slide23 l.jpg

  • Motion vector decoding and deblocking are the most expensive components

  • The white band at the bottom is the PPU (entropy decoder) contribution

Figure 14: Breakdown of decoder performance by component using a single SPU.


Decoder performance l.jpg

Decoder Performance

Compare with [4], our implementation achieves an average 25.23fps or a 23% improvement when decoding similarly encoded video streams on four SPUs.

[4] H. Baik, K.-H. Sihn, Y. il Kim, S. Bae, N. Han, and H. J. Song. “Analysis and Parallelization of H.264 decoder on Cell Broadband Engine Architecture.” In Signal Processing and Information Technology, pages 791–795. Samsung Electron. Co., Ltd., Suwon, Korea, 2007.


Slide25 l.jpg

  • Our implementation achieved a “best case” average framerate of 34.94fps on 2.5Mbps modified-normal CAVLC encoded video streams on six SPUs

  • And a “worst case” entropy decoder limited average framerate of 15.43fps on 16Mbps hq CABAC encoded video streams.


Conclusion l.jpg

Conclusion

  • Demonstrated scalable H.264 decoder for multicore processor

  • 23% frame rate advantage over prior work [4] on similar videos and using same number of cores

  • Careful engineering required to efficiently manage data structures and scratchpad memory


  • Login