Processor architectures for multimedia applications
This presentation is the property of its rightful owner.
Sponsored Links
1 / 25

PROCESSOR ARCHITECTURES FOR MULTIMEDIA APPLICATIONS PowerPoint PPT Presentation


  • 87 Views
  • Uploaded on
  • Presentation posted in: General

PROCESSOR ARCHITECTURES FOR MULTIMEDIA APPLICATIONS. Oguz Karacuka. What Is Multimedia Processing?. Desktop: – 3D graphics (games) – Speech recognition (voice input) – Video/audio decoding (mpeg-mp3 playback) Servers: – Video/audio encoding (video servers, IP telephony)

Download Presentation

PROCESSOR ARCHITECTURES FOR MULTIMEDIA APPLICATIONS

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Processor architectures for multimedia applications

PROCESSOR ARCHITECTURES FOR MULTIMEDIA APPLICATIONS

Oguz Karacuka


What is multimedia processing

What Is Multimedia Processing?

  • Desktop:

    – 3D graphics (games)

    – Speech recognition (voice input)

    – Video/audio decoding (mpeg-mp3 playback)

  • Servers:

    – Video/audio encoding (video servers, IP telephony)

    – Digital libraries and media mining (video servers)

    – Computer animation, 3D modeling & rendering (movies)

  • Embedded:

    – 3D graphics (game consoles)

    – Video/audio decoding&encoding (set top boxes, PVR...)

    – Image processing (digital cameras)

    – Signal processing (cellular phones)


Characteristics of multimedia apps

Characteristics Of Multimedia Apps.

  • Requirement for real-time response

    – “Incorrect” result often preferred to slow result

    – Unpredictability can be bad (e.g. dynamic execution)

  • Narrow data-types

    – Typical width of data in memory: 8 to 16 bits

    – Typical width of data during computation: 16 to 32 bits

    – 64-bit data types rarely needed

    – Fixed-point arithmetic often replaces floating-point

  • Fine-grain (data) parallelism

    – Identical operation applied on streams of input data

    – Branches have high predictability

    – High instruction locality in small loops or kernels


Characteristics of multimedia apps cont

Characteristics Of Multimedia Apps.cont.

  • Coarse-grain parallelism

    – Most apps organized as a pipeline of functions

    – Multiple threads of execution can be used

  • Memory requirements

    – High bandwidth requirements but can tolerate highlatency

    – High spatial locality (predictable pattern) but lowtemporal locality

    – Cache bypassing and prefetching can be crucial


Examples of media functions

Examples of Media Functions

  • Matrix transpose/multiply(3D graphics)

  • DCT/FFT(Video, audio, communications)

  • Motion estimation(Video encoding, deinterlacing)

  • Gamma correction(3D graphics)

  • Haar transform(Media mining)

  • Median filter(Image processing)

  • Separable convolution(Image processing)

  • Viterbi decode(Communications, speech)

  • Bit packing(Communications, cryptography)


Approaches to media processing

Approaches to Media Processing

VLIW with SIMD extensions

(aka mediaprocessors, Adapted

Programmable Architectures)

Asics/FPGA’s

(Dedicated/Function Specific

Architectures)

Multimedia

Processing

DSP’s

(Flexible Programmable

Architectures)

Vector Processors

General-purpose

processors with

SIMD extensions

coldfire:

Dedicated multimedia processors are typically custom designed architectures intended to

perform specific multimedia functions. These functions usually include video and audio

compression and decompression, and in this case these processors are referred to as video

codecs. In addition to support for compression, some advanced multimedia processors

provide support for 2D and 3D graphics applications. Designs of dedicated multimedia

processors range from fully custom architectures, referred to as function specific architectures,

with minimal programmability, to fully programmable architectures. Furthermore, programmable

architectures can be classified into flexible programmable architectures, which provide

moderate to high flexibility, and adapted programmable architectures, which provide an

increased efficiency and less flexibility [1]. The dedicated multimedia processors use a variety

of architectural schemes from multiple functional units and a RISC or DSP (digital signal

processor) core processors to multiple processor schemes. Furthermore, the latest dedicated

processors use single-instruction-multiple-data (SIMD) and very-long-instruction-word

(VLIW) architectures, as well as some hybrid schemes. These architectures are presented in

Section 3.

General-purpose (GP) processors provide support for multimedia by including multimedia

instructions into the instruction set. Instead of performing specific multimedia functions (such

as compression and 2D/3D graphics), GP processors provide instructions specifically created

to support generic operations in video processing. For example, these instructions include

support for 8-bit data types (pixels), efficient data addressing and I/O instructions, and even

instructions to support motion estimation. The latest processors, such as MMX (Intel), VIS

(Sun) and MAX-2 (HP), incorporate some types of SIMD architectures, which perform the

same operation in parallel on multiple data elements.


Application example mpeg dec

Application Example: MPEG Dec.


Mpeg encoder decoder complexity

MPEG Encoder & Decoder Complexity


Function specific architectures

Function Specific Architectures

  • Limited (if any) programmability

  • DSP or RISC core processor for main control

  • Special hardware accelerators for the DCT, quantization, entropy encoding, motionestimation...

  • High efficiency and speed: typically better compared to programmable architectures.

  • The siliconarea optimization achieved by function-specific architectures allows lower production cost.


Function specific architectures1

Function Specific Architectures


Programmable dedicated architectures

Programmable Dedicated Architectures

  • Increased flexibility: enables the processing of different tasks under software control.

  • Higher cost for design andmanufacturing: additional hardware for program control is required.

  • Require software development for the application: parallelizationstrategies have to be applied


Flexible programmable architectures

Flexible Programmable Architectures

TI’sMultimedia Video Processor (MVP) TMS320C80

coldfire:

The MVP combines a RISC master

processor and four DSP processors in a crossbar-based SIMD shared-memory architecture, as shown.

The master processor can be used for control, floating-point operations, audio processing, or

3D graphics transformations. Each DSP performs all the typical operations of a generalpurpose

DSP and can also perform bit-field and multiple-pixel operations. Each DSP has

multiple functional elements (multiplier, ALU, local registers, a barrel shifter, address

generators, and a program-control flow unit), all controlled by very long 64-bit instruction

words (VLIW concept). The RISC processor, DSP processors, and the memory modules are

fully interconnected through the global crossbar network that can be switched at an

instruction clock rate of 20 ns. A 50 MHz MVP executes more than 2 GOPS.


Adapted programmable architectures

Adapted Programmable Architectures

C-Cube’s VRP – VRP2

coldfire:

The VRP2

processor consists of a 32-bit RISC processor and two special functional units for variablelengthcoding and motion estimation, as shown in the block diagram in Figure 7. Speciallydesigned instructions in the RISC processor provide an efficient implementation of the DCTand other video-related operations.


Vliw advanced architectures

VLIW Advanced Architectures

  • Reduce the number of cycles per instruction required forexecution of highly complex and parallel algorithms

  • Multiple independentfunctional units that are directly controlled by long instruction words.

  • Unefficient use of silicon: requires a giant routing network of buses and crossbar switches.

  • All functional units share a common large register file

  • Code compaction is typically done by a special compiler, which can predictbranch outcomes by applying an algorithm known as trace scheduling

  • Can be combined with SIMD arch.for increased parallelism

    e.g. : Mitsubishi D30V and Philips Semiconductor’s TriMedia

coldfire:

The VLIW architectural model is used in the latest dedicated multimedia processors. A typicalVLIW architecture uses long instruction words with more than hundreds of bits in length. Theidea behind VLIW concept is to reduce the number of cycles per instruction required forexecution of highly complex and parallel algorithms by the use of multiple independentfunctional units that are directly controlled by long instruction words. Thisconcept isillustrated in Figure 10, where multiple functional units operate in parallel under control of along instruction. All functional units share a common large register file [11]. Different fields ofthe long instruction word contain opcodes to activate different functional units. Programswritten for conventional 32-bit instruction word computers must be compacted to fit the VLIWinstructions. This code compaction is typically done by a special compiler, which can predictbranch outcomes by applying an algorithm known as trace scheduling.


Philips trimedia cpu64 arch

Philips TriMedia CPU64 Arch.


Philips trimedia cpu64 arch1

Philips TriMedia CPU64 Arch.

  • 5 slot VLIW architecture with a 64-bit word size;

  • 27 functional units, offering a choice of operation types

  • in each slot in the instruction any operation can be guarded to provide conditionalexecution without branching;

  • All functional units provide vector-style subword parallelismon byte, half-word, or word entities.

  • instruction set and functional units optimized withrespect to media processing;

  • a single multi-ported register file with bypass network,allowing 1-cycle latency operations;

  • 32 kB, 8-way instruction cache16 kB, 8-way, quasi-dual ported, data cache;

  • a variable-length (compressed) instruction set design.

coldfire:

The TriMedia CPU64 architecture is a 5-slot VLIWmachine, in principle launching a long instruction everyclock cycle. It has a uniform 64-bit wordsize through allfunctional units, the register file, load/store units, on-chiphighway and external memory. The 5 operations in a singleinstruction can in principle each read 2 register argumentsand write one register result every clock cycle. In addition,each operation can be guarded with an optional (4th) registerfor conditional execution without branch penalty.All functional units provide vector-style subword parallelismon byte, half-word, or word entities. This SIMDstyleoperation in each of the 5 slots in parallel allows for avery high media processing throughput. There is almost nosupport for arithmetic on 64-bit integers, 64-bit (doubleprecision) floating point numbers, or 64-bit address ranges,since this was not considered important for the intendedapplication area.With the exception of floating point divide and squareroot, all functional units are pipelined, allowing a restartevery cycle. The latencies vary from 1 (for operations likeadd, compare, bitand, bitshift, byteshuffle) to 4 (word multiplywith round). A register-file bypass allows an operationresult to be used as an argument for a next operation withouthaving to wait for registerfile storage and retrieval.


Multiple instruction multiple data mimd architectures

Multiple-instruction, multiple-data (MIMD) architectures

  • offer 10 to 100 times morethroughput than existing VLIW and SIMD architectures

  • Multipleinstructions are executed in parallel on multiple data: a control unit for each datapath.

  • asynchronous nature increases the complexity of software development.


Simd extensions to general purp processors

SIMD Extensions to General Purp. Processors

WHY ?

  • Performance

    – A 1.2GHz Athlon can do MPEG-4 encoding at 6.4fps

    – One 384Kbps W-CDMA channel requires 6.9 GOPS

  • Power consumption

    – A 1.2GHz Athlon consumes ~60W

    – Power consumption increases with clock frequency andcomplexity

  • Cost

    – A 1.2GHz Athlon costs ~$62 to manufacture and has a listprice of ~$600 (module) (year 2000)

    – Cost increases with complexity

coldfire:

The real-time multimedia processing on PCs and workstations is still handled by dedicatedmultimedia processors. However, the advanced GP processors provide an efficient support forcertain multimedia applications. These processors can provide software-only solutions formany multimedia functions, which may significantly reduce the cost of the system.

GP processors apply the SIMD approach, described in previous section, by sharing their existing integer or floating-point data paths with a SIMD coprocessor.

Many microprocessor instruction sets include instructions for

accelerating multimedia applications such as DVD playback,

speech recognition and 3D graphics.

All leading processorvendors have recently designed GP processors that support multimedia, as shown in Figure 1.

The main differences among these processors are in the way they reconfigure the internal

register file structure to accommodate SIMD operations, and the multimedia instructions they

choose to add.


Simd extensions to general purp processors1

SIMD Extensions to General Purp. Processors

  • Motivation

    – Low media-processing performance of GPPs

    – Cost and lack of flexibility of specialized ASICs forgraphics/video

    – Underutilized datapaths and registers

  • Basic idea: sub-word parallelism

    – The mismatch between wide data pathsand the relatively short data types found in multimediaapplications

    – Treat a 64-bit register as a vector of 2 32-bit or 4 16-bitor 8 8-bit values (short vectors)

    – Partition 64-bit datapaths to handle multiple narrowoperations in parallel

  • Initial constraints

    – No additional architecture state (registers)

    – No additional exceptions

    – Minimum area overhead


Overwiew of simd extensions

Overwiew of SIMD Extensions


Intel s mmx example

Intel’s MMX Example

  • targeted to accelerate multimedia and communications applications, especially on the Internet.

  • MMX system extends the basic integer instructions: add, subtract, multiply, compare, and shift into SIMD versions.

  • Added DCT / IDCT kernels

  • MPEG-1 video decompression speed up with MMX is about 80%,while some other applications, such as image filtering speed up to 370%.


Summary of simd instructions

Summary of SIMD Instructions

  • Integer arithmetic

    – Addition and subtraction with saturation

    – Fixed-point rounding modes for multiply and shift

    – Sum of absolute differences

    – Multiply-add, multiplication with reduction

    – Min, max

  • Floating-point arithmetic

    – Packed floating-point operations

    – Square root, reciprocal

    – Exception masks

  • Data communication

    – Merge, insert, extract

    – Pack, unpack (width conversion)


Summary of simd instructions1

Summary of SIMD Instructions

  • Comparisons

    – Integer and FP packed comparison

    – Compare absolute values

    – Element masks and bit vectors

  • Memory

    – No new load-store instructions for short vector

    –No support for strides or indexing

    –Short vectors handled with 64b load and storeinstructions

    – Pack, unpack, shift, rotate, shuffle to handle alignment ofnarrow data-types within a wider one

    – Prefetch instructions for utilizing temporal locality


Simd ext for gpp summary

SIMD Ext. for GPP Summary

  • Narrow vector extensions for GPPs

    – 64b or 128b registers as vectors of 32b, 16b, and 8belements

  • Based on sub-word parallelism and partitioneddatapaths

  • Instructions

    – Packed fixed- and floating-point, multiply-add, reductions

    – Pack, unpack, permutations

  • 2x to 4x performance improvement over basearchitecture

    – Limited by memory bandwidth

  • Difficult to use (no compilers)

  • Overhead of handling alignment and datawidth adjustment

  • Optimized shared libraries

    – Written in assembly, distributed by vendor

    – Need well defined API for data format and use


Summary

SUMMARY

  • Computationally intensive multimedia functions, such as MPEG encoding,HDTV codecs, 3D processing, and virtual reality, will still require dedicated processors

  • We should expect that new generations of GP processors would devote more and more transistors to multimedia by investing some of the available chip real estate to support multimedia.


  • Login