Using Constant Memory
This presentation is the property of its rightful owner.
Sponsored Links
1 / 11

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 ConstantMemTiming PowerPoint PPT Presentation


  • 46 Views
  • Uploaded on
  • Presentation posted in: General

Using Constant Memory. These notes will introduce: How to declare and use constant memory Results of an experiment using constant memory. ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 ConstantMemTiming.ppt. Global memory, shared memory, and registers. Host. Grid.

Download Presentation

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 ConstantMemTiming

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Itcs 4 5010 cuda programming unc charlotte b wilkinson jan 28 2013 constantmemtiming

Using Constant Memory

These notes will introduce:

How to declare and use constant memory

Results of an experiment using constant memory

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013

ConstantMemTiming.ppt


Itcs 4 5010 cuda programming unc charlotte b wilkinson jan 28 2013 constantmemtiming

Global memory, shared memory, and registers

Host

Grid

Block

Threads

Registers

Shared memory

Local memory

Host memory

Global memory

Constant memory

For storing global constants.

Also a read-only global memory called texture memory exists.


Itcs 4 5010 cuda programming unc charlotte b wilkinson jan 28 2013 constantmemtiming

Constant memory programming

Constant memory part of global memory but much faster because cached, but limited to 64KB (all comp. cap. to 3.5.)

Declared statically using __device__ and __constant__ qualifiers together with global scope (application)

Lifetime of application, like global memory.

Read-only from GPU kernel, i.e. cannot be altered by kernel (enables caching to work)

Read/write from host using cudaMemcpyFromSymbol() and cudaMemcpyToSymbol().


Itcs 4 5010 cuda programming unc charlotte b wilkinson jan 28 2013 constantmemtiming

Sample Code and Experimental Results

The test program simply adds two vectors A and B together to produce a third vector, C

One version uses constant memory for A and B

Another version uses regular global memory for A and B

Note maximum available for constant memory on the GPU (all compute capabilities so far) is 64 Kbytes total.


Itcs 4 5010 cuda programming unc charlotte b wilkinson jan 28 2013 constantmemtiming

Code

Array declarations

#define N 8192// max size allowed for two vectors in const. mem

// Constants held in constant memory

__device__ __constant__ int dev_a_Cont[N];

__device__ __constant__ int dev_b_Cont[N];

// regular global memory for comparison

__device__ int dev_a[N];

__device__ int dev_b[N];

// result in device global memory

__device__ int dev_c[N];


Itcs 4 5010 cuda programming unc charlotte b wilkinson jan 28 2013 constantmemtiming

// kernel routines

__global__ void add_Cont() {// using constant memory

int tid = blockIdx.x * blockDim.x + threadIdx.x;

if(tid < N){

dev_c[tid] = dev_a_Cont[tid] + dev_b_Cont[tid];

}

}

__global__ void add() {//not using constant memory

int tid = blockIdx.x * blockDim.x + threadIdx.x;

if(tid < N){

dev_c[tid] = dev_a[tid] + dev_b[tid];

}

}


Itcs 4 5010 cuda programming unc charlotte b wilkinson jan 28 2013 constantmemtiming

/*----------- GPU using constant memory ------------------------*/

printf("GPU using constant memory\n");

for(int i=0;i<N;i++) { // load arrays with some numbers

a[i] = i;

b[i] = i*2;

}

// copy vectors to constant memory

cudaMemcpyToSymbol(dev_a_Cont,a,N*sizeof(int),0,cudaMemcpyHostToDevice);

cudaMemcpyToSymbol(dev_b_Cont,b,N*sizeof(int),0,cudaMemcpyHostToDevice);

cudaEventRecord(start, 0);// start time

add_Cont<<<B,T>>>();// does not need array ptrs

cudaThreadSynchronize();// wait for all threads to complete

cudaEventRecord(stop, 0); // end time

cudaMemcpyFromSymbol(a,"dev_a_Cont",N*sizeof(int),0,cudaMemcpyDeviceToHost);

cudaMemcpyFromSymbol(b,"dev_b_Cont",N*sizeof(int),0,cudaMemcpyDeviceToHost);

cudaMemcpyFromSymbol(c,"dev_c",N*sizeof(int),0,cudaMemcpyDeviceToHost);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&elapsed_time_Cont, start, stop);

Watch for this zero. I missed it off and it took some time to spot

Missed originally


Itcs 4 5010 cuda programming unc charlotte b wilkinson jan 28 2013 constantmemtiming

/*----------- GPU not using constant memory ------------------------*/

printf("GPU using constant memory\n");

for(int i=0;i<N;i++) { // load arrays with some numbers

a[i] = i;

b[i] = i*2;

}

// copy vectors to constant memory

cudaMemcpyToSymbol(dev_a_Cont,a,N*sizeof(int),0,cudaMemcpyHostToDevice);

cudaMemcpyToSymbol(dev_b_Cont,b,N*sizeof(int),0,cudaMemcpyHostToDevice);

cudaEventRecord(start, 0);// start time

add<<<B,T>>>();// does not need array ptrs

cudaThreadSynchronize();// wait for all threads to complete

cudaEventRecord(stop, 0); // end time

cudaMemcpyFromSymbol(a,"dev_a_Cont",N*sizeof(int),0,cudaMemcpyDeviceToHost);

cudaMemcpyFromSymbol(b,"dev_b_Cont",N*sizeof(int),0,cudaMemcpyDeviceToHost);

cudaMemcpyFromSymbol(c,"dev_c",N*sizeof(int),0,cudaMemcpyDeviceToHost);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&elapsed_time, start, stop);


Itcs 4 5010 cuda programming unc charlotte b wilkinson jan 28 2013 constantmemtiming

Speedup around 1.2 after first launch (20%)

1st launch, 1.6

Textbook says get 50% improvement with ray tracing app.

2nd run, 1.217

3rd run, 1.225

No explanation why first launch is faster.


Itcs 4 5010 cuda programming unc charlotte b wilkinson jan 28 2013 constantmemtiming

Questions


  • Login