Dongyue mou and zeng xing
Download
1 / 34

Dongyue Mou and Zeng Xing - PowerPoint PPT Presentation


  • 157 Views
  • Uploaded on

Dongyue Mou and Zeng Xing. cujpeg. A Simple JPEG Encoder With CUDA Technology. Outline. JPEG Algorithm Traditional Encoder What's new in cujpeg Benchmark Conclusion. Outline. JPEG Algorithm Traditional Encoder What's new in cujpeg Benchmark Conclusion. JPEG Algorithm.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Dongyue Mou and Zeng Xing' - angie


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Dongyue mou and zeng xing

Dongyue Mou and Zeng Xing

cujpeg

A Simple JPEG EncoderWith CUDA Technology


Outline
Outline

  • JPEG Algorithm

  • Traditional Encoder

  • What's new in cujpeg

  • Benchmark

  • Conclusion


Outline1
Outline

  • JPEG Algorithm

  • Traditional Encoder

  • What's new in cujpeg

  • Benchmark

  • Conclusion


Jpeg algorithm
JPEG Algorithm

JPEG is a commonly used method for image compression.JPEG Encoding Algorithm is consist of 7 steps:

  • Divide image into 8x8 blocks

  • [R,G,B] to [Y,Cb,Cr] conversion

  • Downsampling (optional)

  • FDCT(Forward Discrete Cosine Transform)‏

  • Quantization

  • Serialization in zig-zag style

  • Entropy encoding (Run Length Coding & Huffman coding)


This is an example

JPEG Algorithm -- Example

This is an example


This is an example1

Divide into 8x8 blocks

This is an example


This is an example2

Divide into 8x8 blocks

This is an example


Rgb vs ycc
RGB vs. YCC

The precision of colors suffer less (for a human eye) than the precision of contours (based on luminance)

Color space conversion makes use of it!

Simple color space model: [R,G,B] per pixel

JPEG uses [Y, Cb, Cr] Model

Y = Brightness

Cb = Color blueness

Cr = Color redness


Convert rgb to ycc
Convert RGB to YCC

8x8 pixel

1 pixel = 3 components

MCU with

sampling factor

(1, 1, 1)


Downsampling
Downsampling

Y is taken every pixel , and Cb,Cr are taken for a block of 2x2 pixels

4 blocks

16 x16 pixel

MCU: minimum coded unit: The smallest group of data units that is coded.

Data size reduces to a half immediately

MCU with

sampling

factor

(2, 1, 1)


Apply fdct
Apply FDCT

2D IDCT:

Bottleneck, the complexity of thealgorithm is O(n^4)

1D IDCT:

2-D is equivalent to 1-D applied in each direction

Kernel uses 1-D transforms


Apply fdct1
Apply FDCT

Shift operations

From [0, 255]

To [-128, 127]

Meaning of

each position

in DCT result-

matrix

DCT

Result


Quantization
Quantization

Quantization Matrix

(adjustable according to quality)‏

DCT result

Quantization result


Zigzag reordering run length coding
Zigzag reordering / Run Length Coding

Quantization

result

[ Number of Zero before me, my value]


Huffman encoding
Huffman encoding

RLC result:

[0, -3] [0, 12] [0, 3]......EOB

After group number added:

[0,2,00b] [0,4,1100b] [0,2,00b]

...... EOB

First Huffman coding (i.e. for [0,2,00b]):

[0, 2, 00b] => [100b, 00b]

( look up e.g. table AC Chron)

Total input: 512 bits,

Output: 113 bits output


Outline2
Outline

  • JPEG Algorithm

  • Traditional Encoder

  • What's new in cujpeg

  • Benchmark

  • Conclusion


Traditional encoder
Traditional Encoder

CPU

Image

Load image

Color conversion

DCT

Quantization

Zigzag Reorder

Encoding

.jpg


Outline3
Outline

  • JPEG Algorithm

  • Traditional Encoder

  • What's new in cujpeg

  • Benchmark

  • Conclusion


Algorithm analyse
Algorithm Analyse

1x full 2D DCT scan

O(N4)

8x Row 1D DCT scan

8x Column 1D DCT scanO(N3)

8 threads can paralell work



Dct in place
DCT In Place

__device__void blockDCTInPlace(float *block)

{

for(int row = 0; row < 64; row += 8)

vectorDCTInPlace(block + row, 1);

for(int col = 0; col < 8; col++)

vectorDCTInPlace(block + col, 1);

}

__device__void vectorDCTInPlace(float *Vect0, int Step)

{

float *Vect1 = Vect0 + Step, *Vect2 = Vect1 + Step;

float*Vect3 = Vect2 + Step, *Vect4 = Vect3 + Step;

float *Vect5 = Vect4 + Step, *Vect6 = Vect5 + Step;

float *Vect7 = Vect6 + Step;

float X07P = (*Vect0) + (*Vect7);

float X16P = (*Vect1) + (*Vect6);

float X25P = (*Vect2) + (*Vect5);

float X34P = (*Vect3) + (*Vect4);

float X07M = (*Vect0) - (*Vect7);

float X61M = (*Vect6) - (*Vect1);

float X25M = (*Vect2) - (*Vect5);

float X43M = (*Vect4) - (*Vect3);

float X07P34PP = X07P + X34P;

float X07P34PM = X07P - X34P;

float X16P25PP = X16P + X25P;

float X16P25PM = X16P - X25P;

(*Vect0) = C_norm * (X07P34PP + X16P25PP);

(*Vect2) = C_norm * (C_b * X07P34PM + C_e * X16P25PM);

(*Vect4) = C_norm * (X07P34PP - X16P25PP);

(*Vect6) = C_norm * (C_e * X07P34PM - C_b * X16P25PM);

(*Vect1) = C_norm * (C_a * X07M - C_c * X61M + C_d * X25M - C_f * X43M);

(*Vect3) = C_norm * (C_c * X07M + C_f * X61M - C_a * X25M + C_d * X43M);

(*Vect5) = C_norm * (C_d * X07M + C_a * X61M + C_f * X25M - C_c * X43M);

(*Vect7) = C_norm * (C_f * X07M + C_d * X61M + C_c * X25M + C_a * X43M);

}

__device__void parallelDCTInPlace(float *block)

{

int col = threadIdx.x % 8;

int row = col * 8;

__syncthreads();

vectorDCTInPlace(block + row, 1);

__syncthreads();

vectorDCTInPlace(block + col, 1);

__syncthreads();

}


Allocation
Allocation

Desktop PC

  • CPU:1 P4 Core, 3.0GHz

  • RAM: 2GB

    Graphic Card

  • GPU: 16 Core575MHz8SP/Core, 1.35GHz

  • RAM: 768MB


Binding
Binding

Huffman Encoding

  • many conditions/branchs

  • intensive bit operating

  • less computing

    Color conversion, DCT, Quantize

  • intensive computing

  • less conditions/branchs


Binding1
Binding

Hardware:16KB Shared Memory

Problem: 1 MCU contains702 Byte data

Result: maximal 21 MCUs/CUDA Block

Hardware: 512 threads

Problem: 1 MCU contains 3 Blocks,

1 Block needs 8 threads

Result: 1 MCU needs 24 threads

1 CUDABlock = 504 Threads


Cujpeg encoder
cujpeg Encoder

CPU

GPU

Image

Load image

Color conversion

DCT

Quantization

Zigzag Reorder

Encoding

.jpg


Cujpeg encoder1

Color

Conversion

Load image

In Place

DCT

Quantize

Reorder

Encoding

cujpeg Encoder

cudaMemcpy( ResultHost, ResultDevice, ResultSize, cudaMemcpyDeviceToHost);

CPU

for (int i=0; i<BLOCK_WIDTH; i++)

myDestBlock[myZLine[i]] = (int)(myDCTLine[i] * myDivQLine[i] + 0.5f);

GPU

Texture

Memory

Shared Memory

Image

Global

Memory

Quantization

Reorder

Result

Host

Memory

int b = tex2D(TexSrc, TexPosX++, TexPosY);

int g = tex2D(TexSrc, TexPosX++, TexPosY);

int r = tex2D(TexSrc, TexPosX+=6, TexPosY);

float y = 0.299*r + 0.587*g + 0.114*b - 128.0 + 0.5;

float cb = -0.168763*r - 0.331264*g + 0.500*b + 0.5;

float cr = 0.500*r - 0.418688f*g - 0.081312*b + 0.5;

myDCTLine[Offset + i] = y;

myDCTLine[Offset + 64 + i]= cb;

myDCTLine[Offset + 128 + i]= cb;

cudaMallocArray( &textureCache, &channel, scanlineSize, imgHeight ));

cudaMemcpy2DToArray(textureCache, 0, 0,

image, imageStride, imageWidth, imageHeight,

cudaMemcpyHostToDevice ));

cudaBindTextureToArray(TexSrc, textureCache, channel));

cudaMalloc((void **)(&ResultDevice), ResultSize);

.jpg


Scheduling

Y

Y

Cb

Cb

Cr

Cr

x24

x24

x24

Scheduling

RGB Data

For each MCU:

  • 24 threads

    • Convert 2 pixel

  • 8 threads

    • Convert rest 2 pixel

  • 24 threads

    • Do 1x row vector DCT

    • Do 1x column vector DCT

    • Quantize 8x scalar value

  • YCC Block

    DCT Block

    Quantized/Reordered Data


    Outline4
    Outline

    • JPEG Algorithm

    • Traditional Encoder

    • What's new in cujpeg

    • Benchmark

    • Conclusion



    Benchmark

    Benchmark

    ( Q = 80, Sample = 1:1:1 )



    Benchmark2
    Benchmark

    Each thread has 240 operations

    24 threads process 1 MCU

    4096x4096 image includes 262144 MCUs.

    Total ops: 262144*24*210 = 1509949440 flops

    Speed: (Total ops) /0.043 = 35.12Gflops


    Outline5
    Outline

    • JPEG Algorithm

    • Traditional Encoder

    • What's new in cujpeg

    • Benchmark

    • Conclusion


    Conclusion
    Conclusion

    CUDA can obviouslyaccelerate the JPEG compression.

    The over-all performance

    • Depends on the system speed

    • More bandwidth

    • Besser encoding routine

    • Support downsample


    ad