A high level simulator for the h 264 avc decoding process in multi core systems
Sponsored Links
This presentation is the property of its rightful owner.
1 / 23

A high-level simulator for the H.264/AVC decoding process in multi-core systems PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

A high-level simulator for the H.264/AVC decoding process in multi-core systems. Florian H. Seitner, Ralf M. Schreier, Michael Bleyer, Margrit Gelautz Vienna University of Technology, Austria SPIE IS&T Electronic Imaging Conference, Multimedia on Mobile Devices 2008. Outline. Introduction

Download Presentation

A high-level simulator for the H.264/AVC decoding process in multi-core systems

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

A high-level simulator for the H.264/AVC decoding processin multi-core systems

Florian H. Seitner, Ralf M. Schreier, Michael Bleyer, Margrit Gelautz

Vienna University of Technology, Austria

SPIE IS&T Electronic Imaging Conference, Multimedia on Mobile Devices 2008


  • Introduction

  • Multi-processor decoding

  • High-level simulator

  • Simulation result

  • Conclusion


  • H.264 as a new-generation video coding algorithm is becoming increasingly important for international broadcasting standards such as DVB-H and DMB.

  • H.264 improved high compression efficiency at the cost of increased computational complexity.

  • Mobile devices (embedded processor)

    • Low processing(computation) capability

    • Limited energy(power)

  • Multi-core systems provide an elegant and power-efficient solution to overcome the performance limitation.


  • Efficiently distributing the video algorithm among multiple processors is a non-trivial task.

    • The decoding load should be distributed equally

    • Data dependency

    • Synchronization

  • It requires detailed knowledge about the algorithmic complexity and inter-dependencies between functional blocks.

  • The objective of this paper is an investigation on

    • the dynamic behavior of the H.264 decoding process

    • the interaction between the main decoding tasks in the multi-core environments

  • Figure 1. Dynamic variations in the execution times of individual macroblocks in the H.264 decoding process. Histograms are shown for six IPB coded sequences of 100 frames with Group of Pictures (GOP) sizes being 13.

    • Histogram bins plot the number of macroblocks having similar runtimes.

    • It is observed that the runtimes of macroblocks significantly vary within a sequence due to different image content.

    • The overall runtime of the decoder strongly depends on the content of the encoded video material.

    Table 1. Six test sequences with normalization to 35 dB.

    We present a high-level simulator for multi-core implementations of the H.264 decoder in this paper.

    Figure 2. Concept of the simulator. (a) Profiling data. (b) Underlying hardware.

    (c) Simulation of a splitting that maps f1 and f2 to the first core and f3 to thesecond one.

    Parallel H.264 Decoding The H.264 Decoder

    Encoded Bitstream

    Inverse Quantization

    Inverse DCT

    Stream Parsing

    Entropy Decoder



    Spatial Prediction

    Motion Compensation

    Reference Frames


    Data-Parallel Processing

    The H.264 decoding process http://www.powercam.cc/slide/1580


    Multi-processor decoding

    • The parser processor (1st CPU) performs all functions related to bitstream parsing

      • Entropy Decoding : the basic entropy decoding of picture data such as motion vectors and DCT residuals

      • Context Calculation : the prediction step of context adaptive VLC coding for residuals and motion vector prediction

      • Init: the memory initialization of macroblock data structures

    Figure 3. Partitioning the H.264 decoder on a dual-core system.

    Multi-processor decoding

    • The reconstructor processor (2nd CPU) handles all pixel-based operations

      • Intra/Inter Prediction : the intra and inter prediction routines

      • IDCT : the inverse residual transformation, which are based on multiples of the 4 × 4 pixel block size

      • Strength Calculation: filter strength coefficients for the deblocking process are calculated

      • Deblocking : before applying the deblocking filter as the last step in the macroblock decoding process

    High-level simulator - Austrochip 2008, Invited Poster

    CHILI Vector Processor

    CHILI Design

    • CHILI Core with 32bit / 4 Slots / 8 SIMD

    • High performance for signal processing and control code

    • Compiler friendly instruction set

    • Fully programmable (C / Assembler)

    • C-Compiler (LLVM, GCC) and instruction set simulator available

    CHILI Processor Features

    • Separate instruction and data path

    • 16-bit SIMD operands

    • 64 32-bit general purpose registers

    • 128-bit core memory interface

    • 64 KB instruction cache

    • 64 KB data SRAM (core memory)

    • 64-channel data load and store DMA controller

    • 1.92 GMAC 16-bit operations (@ 240 MHz)

    SVENm Multimedia Engine

    • Video / multimedia companion

    • Targets H.264 encoding / decoding at SD resolution

    Simulation result

    • 6 test sequences

      • Foreman, Flowergarden, Barcelona, Paris, Bus, Mobile

    • Parameters

      • Test sequences are encoded in H.264 main profile using the JM12.2 encoder

      • GOP size = 13 frames

      • CIF, IPB, VLC, deblocking active, all prediction modes allowed

      • SR(Search Range) = +/–16 pixels

      • 3 reference frames

      • 1 slice per frame

    Simulation result – (1)Variation of partitioning

    Figure 5. Two methods for partitioning the H.264 decoder on a dual-core system.

    (a) Scenario 1: The function Strength Calc. is part of the parsing module.

    (b) Scenario 2: The function Strength Calc. is part of the reconstructor.

    Figure 6. The Foreman sequence at different bitrates. Macroblocks are classified based on the percentage of the overall time that is spent in the parsing module of the decoder.

    A percentage of 80 means that 80% of the runtime is spent in the parser, while 20% are consumed in the reconstructor. A value of 50% indicates a perfect balance.

    Figure 6(b) indicates that the work load balancing between the two processors is significantly improved.

    • Figure 7(a) : the reconstructor processor’s idle time is approximately 40% in all three test sequences and for all data rates.

    • Figure 7(b) : the reconstructor idle time can be reduced below 15%.

    Figure 7. Idle time for parser and reconstructor core for three test sequences. (a) Filter strength calculation is done at the parser side. (b) Filter strength calculation is performed at the reconstructor side.

    Simulation result – (2)Variation of buffers

    Figure 8. Average idle times of all system cores while decoding (a) intra-coded(I frames). For the simulations the calculation of the filter strength was assigned to the reconstructor core.[Figure 5(b)]

    • Increasing the PSNR value (and the bitrate) mainly raises the macroblock processing complexity at the parsing core and performance decrease.

    • At a buffer size of one macroblock(1MB) the Foreman sequence performs best at 35 dB.

    • 5MB: a continuous performance decrease with increasing PSNR values can be observed for the Foreman sequence.

    • Flowergarden and the Barcelona sequences, the higher parsing complexity results in typically higher idle times and less performance improvements at higher buffer sizes.

    Figure 8. Average idle times of all system cores while decoding (b) inter-coded P- and (c) inter-coded B-frames of three test sequences.


    • A simulator for mapping the H.264 decoding process onto hardware architecture has been introduced.

    • We have demonstrated the simulators abilities to analyze the efficiency of a multi-core architecture under various conditions.


    • [4] T.-T. Shih, C.-L. Yang, and Y.-S. Tung, “Workload characterization of the H.264/AVC decoder,” in Proc. of the 5th IEEE Pacific-Rim Conference on Multimedia, pp. 957–966, 2004.

    • [6] F. Seitner, R. Schreier, M. Bleyer, and M. Gelautz, “A macroblock-level analysis on the dynamic behaviour of an H.264 decoder,” in Proc. of ISCE 2007, (Dallas), June 2007.

    • [7] E. B. van der Tol, E. G. Jaspers, and R. H. Gelderblom, “Mapping of H.264 decoding on a multiprocessor architecture,” in Proc. of the SPIE, 5022, pp. 707–718, May 2003.

    • F. Seitner, R. Schreier, M. Bleyer, and M. Gelautz, “Evaluation of data-parallel splitting approaches for H.264 decoding,” Proc. of the 6th International Conference on Advances in Mobile Computing and Multimedia, Linz; November 2008. http://www.powercam.cc/slide/1580

    • Florian Seitner, Josef Meser, Gerold Schedelberger, Andreas Wasserbauer, Michael Bleyer, Margrit Gelautz, Markus Schutti, Ralf Schreier, Premysl Vaclavik, Gerald Krottendorfer, Günther Truhlar, Thomas Bauernfeind, Philipp Beham, “Design Methodology for the SVENm Multimedia Engine,” Austrochip 2008, Invited Poster.

    FFmpeg H.264 decoder

    • H264 benchmarks

      • JM Reference Codec

      • X264 encoder

      • FFmpeg H.264 decoder

    • FFmpeg H.264 decoder

      • FFmpeg includes a H.264/AVC decoder that implements most of the features of the main and high profiles of the standard.

      • The code is very optimized and include MMX/SSE and Altivec SIMD instructions for the most time consuming kernels.

      • It is widely used in free multimedia players like MPlayer, VLC media player(VideoLAN), Xine…etc.

      • http://ffmpeg.org/

  • Login