CUDA
Download
1 / 16

CUDA - PowerPoint PPT Presentation


  • 201 Views
  • Uploaded on
  • Presentation posted in: General

CUDA. Assignment. Subject: DES using CUDA Deliverables: des.c , des.cu , report Due: 12/14, nai0315@snu.ac.kr. Index. What is GPU? Programming model and Simple Example The Environment for CUDA programming What is DES?. What’s in a GPU?.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

CUDA

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript



Assignment
Assignment

  • Subject: DES using CUDA

  • Deliverables: des.c, des.cu, report

  • Due: 12/14, nai0315@snu.ac.kr


Index
Index

  • What is GPU?

  • Programming model and Simple Example

  • The Environment for CUDA programming

  • What is DES?


What s in a gpu
What’s in a GPU?

  • A GPU is a heterogeneous chip multi-processor (highly tuned for graphics)


Slimming down
Slimming down

  • kjkd

Idea #1:

Remove components that help a single instruction stream run fast


Parallel execution
Parallel execution

Two cores

Four cores

Sixteen cores:

16 simultaneous instruction streams

 Be able to share an instruction stream


Simd processing
SIMD processing

Idea #2:

Amortize cost/complexity of managing an instruction stream across many ALUs

16 cores = 128 ALUs



Throughput
Throughput!

Idea #3:

Interleave processing of many fragments on a single core to avoid stalls caused by high latency operations


Summary three key ideas of gpu
Summary: three key ideas of GPU

  • Use many “slimmed down cores” to run in parallel

  • Pack cores full of ALUs (by sharing instruction stream across groups of fragments)

  • Avoid latency stalls by interleaving execution of many groups of fragments

    • When one group stalls, work on another group


Programming model
Programming Model

  • GPU is viewed as a compute device operating as a coprocessor to the main CPU (host)

    • Data-parallel, compute intensive functions should be off-loaded to the device

    • Functions that are executed many times, but independently on different data, are prime candidates

      • I.e. body of for-loops

    • A function compiled for the device is called a kernel

    • The kernel is executed on the device as many different threads

    • Both host (CPU) and device (GPU) manage their own memory, host memory and device memory


Block and thread allocation
Block and Thread Allocation

  • Blocks assigned to SMs (Streaming Multiprocessos)

  • Threads assigned to PEs (Processing Elements)

  • Each thread executes the kernel

  • Each block has an unique block ID

  • Each thread has an unique thread ID within the block

  • Warp: max 32 threads

  • GTX 280: 30SMs

    • 1 SM: 8SPs

    • 1 SM: 32 warps  1024 threads

    • Total threads: 30*1024 = 30,720


Memory model
Memory model

  • Memory types

    • Registers (r/w per thread)

    • Local mem (r/w per thread)

    • Shared mem (r/w per block)

    • Global mem (r/w per kernel)

    • Constant mem (r per kernel)

  • Separate from CPU

    • CPU can access global and constant mem via PCIe bus


Simple example c to cuda conversion
Simple Example (C to CUDA conversion)

__global_ void ForceCalcKernel(intnbodies, struct Body *body, ..) {}

__global_ void Advancing Kernel(intnbodies, struct Body *body, …){}

int main(…) {

Body *body, *body1;

cudaMalloc((void**)&body1, sizeof(Body)*nbodies);

cudaMemcpy(body1, body, sizeof(Body)*nbodies, cuda_HostToDevice);

for(timestep = …) {

ForceCalcKernel<<1, 1>>(nbodies, body1, …);

AdvancingKernel<<1, 1>>(nbodies, body1, …);

}

cudaMemcpy(body, body1, sizeof(Body)*nbodies, cuda_DeviceToHost);

cudaFree(body1);

}

Indicates GPU kernel that CPU can call

Separate address spaces, need two pointers

Allocate memory on GPU

Copy CPU data to GPU

Call GPU kernel with 1block and 1thread per block

Copy GPU data back to CPU


Environment
Environment

  • The NVCC compiler

    • CUDA kernels are typically stored in files ending with .cu

    • NVCC uses the host compiler (CL/G++) to compile CPU code

    • NVCC automatically handles #include’s and linking

  • You can download CUDA toolkit from:

    • http://developer.nvidia.com/cuda-downloads


What is des
What is DES?

  • The archetypal block cipher

    • An algorithm that takes a fixed-length string of plaintext bits and transforms it through a series of complicated operations into another ciphertextbitstring of the same length

    • The block size is 64 bits


ad
  • Login