Lab Assignment #2 Data Parallel Reduction

Lab Assignment #2Data Parallel Reduction Farhad Parsan

Data Parallel Reduction • Sum reduction kernel (with thread divergence)

Data Parallel Reduction • Sum reduction kernel (without thread divergence)

Host Code float computeOnDevice(float* h_data, int num_elements) { intsize = num_elements*sizeof(float); float result; float* hd_data; // 1. Allocate and Load cudaMalloc((void**) &hd_data, size); cudaMemcpy(hd_data, h_data, size, cudaMemcpyHostToDevice); // 2. Kernel invocation code dim3 dimBlock(num_elements,1); dim3 dimGrid(1, 1); reduction<<<dimGrid, dimBlock>>>(hd_data); // 3. Store result cudaMemcpy(h_data, hd_data, size, cudaMemcpyDeviceToHost); result = h_data[0]; // Free device matrices cudaFree(hd_data); return result; }

Device Code #define NUM_ELEMENTS 512 __global__ void reduction(float *hd_data) { __shared__ float partialSum[NUM_ELEMENTS] unsigned int t = threadIdx.x; partialSum[t] = hd_data[t]; for (unsigned int stride = blockDim.x; stride > 1; stride >> 1) { __syncthreads(); if (t < stride) partialSum[t] += partialSum[t+stride]; } hd_data[t] = partialSum[t]; }

Question • How many times does your thread block synchronize to reduce the array of 512 elements to a single value? Number of synchronizations = Number of reduction iterations = Log2N − 1 Assuming N = 512 => Number of synchronizations = 8

Question • What is the minimum, maximum, and average number of "real" operations that a thread will perform? “real" operations are those that directly contribute to the final reduction value. Maximum : Thread 1 = Log2N − 1 if N = 512 => Maximum = 8 Minimum : Odd Threads = 1 Average : [ 1 + 2 + 4 + … + (N/2) ] / N if N = 512 => Average = 0.998 ≈ 1

Lab Assignment #2 Data Parallel Reduction

Lab Assignment #2 Data Parallel Reduction

Presentation Transcript

Lab Assignment 1

210 Lab assignment #1

Practice Lab Final Assignment

Lab assignment

Relative Dating Computer Lab Assignment

Lab #3 Assignment

Lab Assignment

Lab Assignment