Implementation of mpeg2 codec with mmx sse sse2 technology l.jpg
Sponsored Links
This presentation is the property of its rightful owner.
1 / 15

Implementation of MPEG2 Codec with MMX/SSE/SSE2 Technology PowerPoint PPT Presentation


  • 124 Views
  • Uploaded on
  • Presentation posted in: General

Implementation of MPEG2 Codec with MMX/SSE/SSE2 Technology. Speaker: Rong Jiang, Xu Jin Instructor: Yu-Hen Hu. Outline. Introduction MMX/SSE/SSE2 MPEG 2 Video Compression What we have done? Conclusion. MMX/SSE/SSE2. MMX 57 new instructions; 8 64-bit wide MMX registers;

Download Presentation

Implementation of MPEG2 Codec with MMX/SSE/SSE2 Technology

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Implementation of MPEG2 Codec with MMX/SSE/SSE2 Technology

Speaker: Rong Jiang, Xu Jin

Instructor: Yu-Hen Hu


Outline

  • Introduction

    • MMX/SSE/SSE2

    • MPEG 2 Video Compression

  • What we have done?

  • Conclusion


MMX/SSE/SSE2

  • MMX

    • 57 new instructions;

    • 8 64-bit wide MMX registers;

    • 4 new data types. (3 packed data type and 1 64-bit entity)

  • SSE

    • 8 new 128-bit SIMD floating-point registers;

    • 50 new instructions that work on packed floating-point data;

    • 8 new instructions to control data cacheability;

    • 12 new instructions that extend the MMX instruction set.

  • SSE2

    • Support 64-bit floating-point values


MPEG 2 video compression


1.

Dig out a MPEG2 Enc/Dec C code

2.

Generate profiling information

3.

Identify the kernels

4.

Rewrite kernels using SSE

5.

Performance results

Project outline


Profiling results of the original code

mpeg2decode

mpeg2encode

idct()

dist1()

fdct()


Example 1 – optimizing dist1()

if ((v = p1[0] - p2[0])<0) v = -v; s+= v;

if ((v = p1[1] - p2[1])<0) v = -v; s+= v;

if ((v = p1[2] - p2[2])<0) v = -v; s+= v;

if ((v = p1[3] - p2[3])<0) v = -v; s+= v;

if ((v = p1[4] - p2[4])<0) v = -v; s+= v;

if ((v = p1[5] - p2[5])<0) v = -v; s+= v;

if ((v = p1[6] - p2[6])<0) v = -v; s+= v;

if ((v = p1[7] - p2[7])<0) v = -v; s+= v;

if ((v = p1[8] - p2[8])<0) v = -v; s+= v;

if ((v = p1[9] - p2[9])<0) v = -v; s+= v;

if ((v = p1[10] - p2[10])<0) v = -v; s+= v;

if ((v = p1[11] - p2[11])<0) v = -v; s+= v;

if ((v = p1[12] - p2[12])<0) v = -v; s+= v;

if ((v = p1[13] - p2[13])<0) v = -v; s+= v;

if ((v = p1[14] - p2[14])<0) v = -v; s+= v;

if ((v = p1[15] - p2[15])<0) v = -v; s+= v;

asm volatile ("

movdqu (%1), %%XMM0

movdqu (%2), %%XMM1

psadbw %%XMM0, %%XMM1

movdq2q %%XMM1, %%MM0

pslldq $8, %%XMM1

movdq2q %%XMM1, %%MM1

paddd %%MM1, %%MM0

movd %%MM0, %0"

: "=r"(s)

: "r"(p1), "r"(p2));

4-5X speed-up, but it can be faster!

This code segment is for calculating residual matrices in the prediction stage in Encoder


Four ways to write super-fast code

  • Rearrange data fetching to maximize cache hit;

  • Unroll loops to eliminate unnecessary branches;

  • Utilize SSE instructions to take full advantage of parallelism;

  • Apply code scheduling to exploit multiple issue capability of Pentium 4's superscalar micro- architecture.


Example 2 – optimize idct()

Three nested loops forms the kernel of DCT:

for (i=0; i<8; i++)

for (j=0; j<8; j++)

{

partial_product = 0.0;

for (k=0; k<8; k++)

partial_product+= c[k][j]*block[i][k];

tmp[i][j] = partial_product;

}


A verbatim translation from C to assembly doesn’t do much better. It misses the whole point of manually writing an assembly procedure.


We need parallelism!


Results

68.72%

50.1s

25X

in

idct()

4X

in

dist1()

34.39%

16.34s

13.04%

9.99%

2.45s

3.83s

Experimental Results are averaged over 3 runs.


Platform Compatibility (1)

Algorithm for Checking Availability of MMX

bool isMMXSupported()

{ int fSupported; asm

{mov eax,1 // CPUID level 1 cpuid // EDX = feature flag and edx,0x800000 // test bit 23 of feature flag mov fSupported,edx // != 0 if MMX is supported} if (fSupported != 0)return true;

else

return false; }


Y

SSE?

SSE Routine

N

MMX Routine

MMX?

Y

N

Normal Routine

END

Platform Compatibility (2)

Algorithm for Checking Availability of SSE

bool isISSESupported()

{ int processor; int features; int extfeatures = 0; asm{

pusha mov eax,1 cpuid mov processor,eax // Store processor family/model/step mov features,edx // Store features bits

mov eax,080000000h

cpuid // Check which extended functions can be called cmp eax,080000001h // Extended Feature Bits jb nofeatures // Jump if not supported mov eax,080000001h // Select function 0x80000001 cpuid mov extfeatures,edx // Store extended features bits nofeatures:

popa } if (((features $>>$ 25) \& 1) != 0)return true; else if (((extfeatures $>>$ 22) \& 1) != 0)return true; else

return false; }


Thank you!


  • Login