a scalable parallel h 264 decoder on the cell broadband engine architecture
Download
Skip this Video
Download Presentation
A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture

Loading in 2 Seconds...

play fullscreen
1 / 26

A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture - PowerPoint PPT Presentation


  • 241 Views
  • Uploaded on

A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture. Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona State University CODES+ISSS (The International Conference on Hardware-Software Codesign and System Synthesis) 2009. Outline.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture' - mauve


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
a scalable parallel h 264 decoder on the cell broadband engine architecture

A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture

Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula

Arizona State University

CODES+ISSS (The International Conference on Hardware-Software Codesign and System Synthesis) 2009

outline
Outline
  • Introduction and Motivation
  • Opportunities for Parallelization in H.264
  • Implementation
  • Performance Optimizations
  • Experimental Results
  • Conclusion
motivation
Motivation
  • Multicore Architectures
    • Scalability:

more cores = more performance

  • H.264
    • Standard for video applications including High Definition(HD)
    • Computationally expensive
  • Cell Broadband Engine(CBE)
    • Common and inexpensive thanks to PS3
    • Low power high performancedesign gives a glimpse of future embedded architectures
ibm cell broadband engine architecture
IBM Cell Broadband Engine Architecture
  • 3.2 GHz
  • 9 cores, 10 threads
  • >200 Gflops(single precision)
  • >20 Gflops(double precision)
  • Up to 25 GB/s memory bandwidth
  • Up to 75 GB/s I/O bandwidth
  • >300 GB/s interconnect bus

SPE: Synergistic Processor Element

SPU: Synergistic Processor Unit

SXU: SPU Core

LS: Local Storage

SMF: Synergistic Memory Flow Control

EIB: Element Interconnect Bus

PPE: PowerPC Processor Element

PPU: PowerPC processor Unit

PXU: Power Processor Unit

MIC: Memory Interface Controller

BIC: Bus Interface Controller

L1: Memory Cache Internal to the CPU

L2: Memory Cache External to the CPU

http://domino.research.ibm.com/comm/research.nsf/pages/r.arch.innovation.htm

h 264 advanced video coding
H.264 Advanced Video Coding
  • H.264 is a video compression standard
    • Version 1 completed May 2003
    • ITU-T Video Coding Experts Group (H.264)
    • ISO/IEC Moving Picture Experts Group (MPEG-4 AVC)
  • Macroblock(MB) based CODEC closely related to MPEG-2
  • Growing demand for HD and Wireless video
  • 50% bit rate reduction over previous standard
  • Computational complexity approximately 2.4 x MPEG2
reference code ffmpeg h 264 decoder
Reference Code: FFmpeg (H.264 Decoder)
  • Open source video and audio converter
  • Handles a multitude of formats
  • Codecs other than H.264 decoder removed
  • About 250K Lines of Code after paring to H.264 only
  • About 200 functions ported to SPU in our implementation

http://www.ffmpeg.org/

h 264 frame level relationships
H.264 Frame Level Relationships
  • I Frame: Independently Encoded
    • Intra Prediction
  • P Frame: Predicted from a Preceding frame
    • Intra and Inter Prediction
  • B Frame: Predicted from Both preceding and following frames
    • Intra and Inter Prediction
h 264 opportunities for parallelism gop and frame level
H.264 Opportunities for Parallelism: GOP and Frame Level
  • I, P, B Frames
  • Picture Sequence
    • IBBPBBP
  • Independent Group of Pictures (GOP)
  • Independent Frames within GOP
h 264 opportunities for parallelism slice and mb level
H.264 Opportunities for Parallelism: Slice and MB Level
  • Slices: Independently encoded groups of MBs within a frame
  • Intra Dependencies:
data partitioning scheme
Data Partitioning Scheme
  • Our Scheme: One row of MBs issued to each SPU

Possible Intra MB dependencies:

functional partitioning
Functional Partitioning

CBE Architecture:

ffmpeg data structure modification
FFmpeg Data Structure Modification
  • Single threaded code: monolithic data structure
  • Entire structure needed to decode single MB but majority is static from one MB to the next
  • SPU only requires applicable subset for one row of MBs
  • Only MB specific data replicated in SPU LS

Figure 10: Data structure modifications reducing memory requirements in the local store. W is the width of the video frame in macroblocks.

code overlay
Code Overlay
  • Code segmentcontains one or morefunctions
  • Memory regionassigned one or more segments
  • At run time, region contains exactly one segment
designing an overlay scheme
Designing an Overlay Scheme
  • Start with one flat region
    • 1. Identify key functions and assign to new regions
      • Profiling indicates f21() is most important with 50 calls
      • However, f11() is present 80 times in the trace
      • f11() is a key function
    • 2. Create new regions based on profiling data until memory is exhausted
experimental results
Experimental Results
  • Microsoft’s WMV HD demonstration page [13]
  • The source videos were transcoded into H.264 1920x1080 (1080p) format
    • 5 different bitrates: 2.5, 4, 8, 12, 16MbpsCAVLC and CABAC
    • Use the x264 H.264 encoder integrated into ffmpeg
  • The videos were encoded using the x264 presets: baseline, normal, and hq
  • Decoder performance is measured on the Sony’s Playstation3, 3.2 GHz Cell Processor (limited by Sony for access tosix of the CBE’s eight SPUs) running Linux Fedora 9

[13] Microsoft Corporation. WMV HD Content Showcase. http://www.microsoft.com/windows/windowsmedia/musicandvideo/hdvideo/contentshowcase.aspx

slide23

Motion vector decoding and deblocking are the most expensive components

  • The white band at the bottom is the PPU (entropy decoder) contribution

Figure 14: Breakdown of decoder performance by component using a single SPU.

decoder performance
Decoder Performance

Compare with [4], our implementation achieves an average 25.23fps or a 23% improvement when decoding similarly encoded video streams on four SPUs.

[4] H. Baik, K.-H. Sihn, Y. il Kim, S. Bae, N. Han, and H. J. Song. “Analysis and Parallelization of H.264 decoder on Cell Broadband Engine Architecture.” In Signal Processing and Information Technology, pages 791–795. Samsung Electron. Co., Ltd., Suwon, Korea, 2007.

slide25

Our implementation achieved a “best case” average framerate of 34.94fps on 2.5Mbps modified-normal CAVLC encoded video streams on six SPUs

  • And a “worst case” entropy decoder limited average framerate of 15.43fps on 16Mbps hq CABAC encoded video streams.
conclusion
Conclusion
  • Demonstrated scalable H.264 decoder for multicore processor
  • 23% frame rate advantage over prior work [4] on similar videos and using same number of cores
  • Careful engineering required to efficiently manage data structures and scratchpad memory
ad