lab assignment 2 data parallel reduction n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Lab Assignment #2 Data Parallel Reduction PowerPoint Presentation
Download Presentation
Lab Assignment #2 Data Parallel Reduction

Loading in 2 Seconds...

play fullscreen
1 / 7

Lab Assignment #2 Data Parallel Reduction - PowerPoint PPT Presentation


  • 298 Views
  • Uploaded on

Lab Assignment #2 Data Parallel Reduction. Farhad Parsan. Data Parallel Reduction. Sum reduction kernel (with thread divergence). Data Parallel Reduction. Sum reduction kernel (without thread divergence). Host Code. float computeOnDevice(float* h_data, int num_elements) {

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Lab Assignment #2 Data Parallel Reduction' - march


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
data parallel reduction
Data Parallel Reduction
  • Sum reduction kernel (with thread divergence)
data parallel reduction1
Data Parallel Reduction
  • Sum reduction kernel (without thread divergence)
host code
Host Code

float computeOnDevice(float* h_data, int num_elements)

{

intsize = num_elements*sizeof(float);

float result;

float* hd_data;

// 1. Allocate and Load

cudaMalloc((void**) &hd_data, size);

cudaMemcpy(hd_data, h_data, size, cudaMemcpyHostToDevice);

// 2. Kernel invocation code

dim3 dimBlock(num_elements,1);

dim3 dimGrid(1, 1);

reduction<<<dimGrid, dimBlock>>>(hd_data);

// 3. Store result

cudaMemcpy(h_data, hd_data, size, cudaMemcpyDeviceToHost);

result = h_data[0];

// Free device matrices

cudaFree(hd_data);

return result;

}

device code
Device Code

#define NUM_ELEMENTS 512

__global__ void reduction(float *hd_data)

{

__shared__ float partialSum[NUM_ELEMENTS]

unsigned int t = threadIdx.x;

partialSum[t] = hd_data[t];

for (unsigned int stride = blockDim.x; stride > 1; stride >> 1)

{

__syncthreads();

if (t < stride)

partialSum[t] += partialSum[t+stride];

}

hd_data[t] = partialSum[t];

}

question
Question
  • How many times does your thread block synchronize to reduce the array of 512 elements to a single value?

Number of synchronizations = Number of reduction iterations

= Log2N − 1

Assuming N = 512 => Number of synchronizations = 8

question1
Question
  • What is the minimum, maximum, and average number of "real" operations that a thread will perform? “real" operations are those that directly contribute to the final reduction value.

Maximum : Thread 1 = Log2N − 1

if N = 512 => Maximum = 8

Minimum : Odd Threads = 1

Average : [ 1 + 2 + 4 + … + (N/2) ] / N

if N = 512 => Average = 0.998 ≈ 1