Implementation of mpeg2 codec with mmx sse sse2 technology l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 15

Implementation of MPEG2 Codec with MMX/SSE/SSE2 Technology PowerPoint PPT Presentation


  • 110 Views
  • Uploaded on
  • Presentation posted in: General

Implementation of MPEG2 Codec with MMX/SSE/SSE2 Technology. Speaker: Rong Jiang, Xu Jin Instructor: Yu-Hen Hu. Outline. Introduction MMX/SSE/SSE2 MPEG 2 Video Compression What we have done? Conclusion. MMX/SSE/SSE2. MMX 57 new instructions; 8 64-bit wide MMX registers;

Download Presentation

Implementation of MPEG2 Codec with MMX/SSE/SSE2 Technology

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Implementation of mpeg2 codec with mmx sse sse2 technology l.jpg

Implementation of MPEG2 Codec with MMX/SSE/SSE2 Technology

Speaker: Rong Jiang, Xu Jin

Instructor: Yu-Hen Hu


Outline l.jpg

Outline

  • Introduction

    • MMX/SSE/SSE2

    • MPEG 2 Video Compression

  • What we have done?

  • Conclusion


Mmx sse sse2 l.jpg

MMX/SSE/SSE2

  • MMX

    • 57 new instructions;

    • 8 64-bit wide MMX registers;

    • 4 new data types. (3 packed data type and 1 64-bit entity)

  • SSE

    • 8 new 128-bit SIMD floating-point registers;

    • 50 new instructions that work on packed floating-point data;

    • 8 new instructions to control data cacheability;

    • 12 new instructions that extend the MMX instruction set.

  • SSE2

    • Support 64-bit floating-point values


Mpeg 2 video compression l.jpg

MPEG 2 video compression


Project outline l.jpg

1.

Dig out a MPEG2 Enc/Dec C code

2.

Generate profiling information

3.

Identify the kernels

4.

Rewrite kernels using SSE

5.

Performance results

Project outline


Profiling results of the original code l.jpg

Profiling results of the original code

mpeg2decode

mpeg2encode

idct()

dist1()

fdct()


Example 1 optimizing dist1 l.jpg

Example 1 – optimizing dist1()

if ((v = p1[0] - p2[0])<0) v = -v; s+= v;

if ((v = p1[1] - p2[1])<0) v = -v; s+= v;

if ((v = p1[2] - p2[2])<0) v = -v; s+= v;

if ((v = p1[3] - p2[3])<0) v = -v; s+= v;

if ((v = p1[4] - p2[4])<0) v = -v; s+= v;

if ((v = p1[5] - p2[5])<0) v = -v; s+= v;

if ((v = p1[6] - p2[6])<0) v = -v; s+= v;

if ((v = p1[7] - p2[7])<0) v = -v; s+= v;

if ((v = p1[8] - p2[8])<0) v = -v; s+= v;

if ((v = p1[9] - p2[9])<0) v = -v; s+= v;

if ((v = p1[10] - p2[10])<0) v = -v; s+= v;

if ((v = p1[11] - p2[11])<0) v = -v; s+= v;

if ((v = p1[12] - p2[12])<0) v = -v; s+= v;

if ((v = p1[13] - p2[13])<0) v = -v; s+= v;

if ((v = p1[14] - p2[14])<0) v = -v; s+= v;

if ((v = p1[15] - p2[15])<0) v = -v; s+= v;

asm volatile ("

movdqu (%1), %%XMM0

movdqu (%2), %%XMM1

psadbw %%XMM0, %%XMM1

movdq2q %%XMM1, %%MM0

pslldq $8, %%XMM1

movdq2q %%XMM1, %%MM1

paddd %%MM1, %%MM0

movd %%MM0, %0"

: "=r"(s)

: "r"(p1), "r"(p2));

4-5X speed-up, but it can be faster!

This code segment is for calculating residual matrices in the prediction stage in Encoder


Four ways to write super fast code l.jpg

Four ways to write super-fast code

  • Rearrange data fetching to maximize cache hit;

  • Unroll loops to eliminate unnecessary branches;

  • Utilize SSE instructions to take full advantage of parallelism;

  • Apply code scheduling to exploit multiple issue capability of Pentium 4's superscalar micro- architecture.


Example 2 optimize idct l.jpg

Example 2 – optimize idct()

Three nested loops forms the kernel of DCT:

for (i=0; i<8; i++)

for (j=0; j<8; j++)

{

partial_product = 0.0;

for (k=0; k<8; k++)

partial_product+= c[k][j]*block[i][k];

tmp[i][j] = partial_product;

}


Slide10 l.jpg

A verbatim translation from C to assembly doesn’t do much better. It misses the whole point of manually writing an assembly procedure.


We need parallelism l.jpg

We need parallelism!


Results l.jpg

Results

68.72%

50.1s

25X

in

idct()

4X

in

dist1()

34.39%

16.34s

13.04%

9.99%

2.45s

3.83s

Experimental Results are averaged over 3 runs.


Platform compatibility 1 l.jpg

Platform Compatibility (1)

Algorithm for Checking Availability of MMX

bool isMMXSupported()

{ int fSupported; asm

{mov eax,1 // CPUID level 1 cpuid // EDX = feature flag and edx,0x800000 // test bit 23 of feature flag mov fSupported,edx // != 0 if MMX is supported} if (fSupported != 0)return true;

else

return false; }


Platform compatibility 2 l.jpg

Y

SSE?

SSE Routine

N

MMX Routine

MMX?

Y

N

Normal Routine

END

Platform Compatibility (2)

Algorithm for Checking Availability of SSE

bool isISSESupported()

{ int processor; int features; int extfeatures = 0; asm{

pusha mov eax,1 cpuid mov processor,eax // Store processor family/model/step mov features,edx // Store features bits

mov eax,080000000h

cpuid // Check which extended functions can be called cmp eax,080000001h // Extended Feature Bits jb nofeatures // Jump if not supported mov eax,080000001h // Select function 0x80000001 cpuid mov extfeatures,edx // Store extended features bits nofeatures:

popa } if (((features $>>$ 25) \& 1) != 0)return true; else if (((extfeatures $>>$ 22) \& 1) != 0)return true; else

return false; }


Thank you l.jpg

Thank you!


  • Login