360 likes | 520 Views
Automated Dynamic Analysis of CUDA Programs. Michael Boyer, Kevin Skadron*, and Westley Weimer University of Virginia {boyer,skadron,weimer}@cs.virginia.edu * currently on sabbatical with NVIDIA Research. Outline. GPGPU CUDA Automated analyses Correctness: race conditions
E N D
Automated Dynamic Analysisof CUDA Programs Michael Boyer, Kevin Skadron*, and Westley Weimer University of Virginia {boyer,skadron,weimer}@cs.virginia.edu * currently on sabbatical with NVIDIA Research
Outline • GPGPU • CUDA • Automated analyses • Correctness: race conditions • Performance: bank conflicts • Preliminary results • Future work • Conclusion
Why GPGPU? From: NVIDIA CUDA Programming Guide, Version 1.1
CPU vs. GPU Design Single-Thread Latency Aggregate Throughput From: NVIDIA CUDA Programming Guide, Version 1.1
GPGPU Programming • Traditional approach: graphics APIs • ATI/AMD: Close-to-the-Metal (CTM) • NVIDIA: Compute Unified Device Architecture (CUDA)
CUDA: Abstractions • Kernel functions • Scratchpad memory • Barrier synchronization
__host__void example(int *cpu_mem) { cudaMalloc(&gpu_mem, mem_size); cudaMemcpy(gpu_mem, cpu_mem, HostToDevice); kernel <<< grid, threads, mem_size >>> (gpu_mem); cudaMemcpy(cpu_mem, gpu_mem, DeviceToHost); } __global__void kernel(int *mem) { int thread_id = threadIdx.x; mem[thread_id] = thread_id; } CUDA: Example Program
Multiprocessor Per-Block Shared Memory (PBSM) Registers Registers Registers Multiprocessor 1 Multiprocessor N Processing Element 1 Processing Element 2 Processing Element M Instruction Unit ● ● ● ● ● ● CUDA: Hardware GPU Multiprocessor 2 Global Device Memory
Outline • GPGPU • CUDA • Automated analyses • Correctness: race conditions • Performance: bank conflicts • Preliminary results • Future work • Conclusion
Race Conditions • Ordering of instructions among multiple threads is arbitrary • Relaxed memory consistency model • Synchronization: __syncthreads() • Barrier / memory fence
W W s threads W 0 0 R W 1 1 R W 2 2 R W 3 3 R 4 4 R 5 5 R Race Conditions: Example 1 extern__shared__int s[ ]; 2 3 __global__void kernel(int *out) { 4 int id = threadIdx.x; 5 int nt = blockDim.x; 6 7 s[id] = id; 8 out = s[(id + 1) % nt]; 9 } 8 out = s[(id + 1) % nt];
Automatic Instrumentation Original CUDA Source Code Intermediate Representation Compile Execute Instrumentation Instrumented CUDA Source Code Output: Race Conditions Detected?
Race Condition Instrumentation • Two global bookkeeping arrays: • Reads & writes of all threads • Two per-thread bookkeeping arrays: • Reads & writes of a single thread • After each shared memory access: • Update bookkeeping arrays • Detect & report race conditions
Race Condition Detection Add synchronization between lines 7 and 8 No race conditions detected Original code RAW hazard at expression: #line 8 out[id] = s[(id + 1) % nt];
Outline • GPGPU • CUDA • Automated analyses • Correctness: race conditions • Performance: bank conflicts • Preliminary results • Future work • Conclusion
Bank Conflicts • PBSM is fast • Much faster than global memory • Potentially as fast as register access • …assuming no bank conflicts • Bank conflicts cause serialized access
Threads Threads 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 Banks Banks Stride = 3 Non-Conflicting Access Patterns Stride = 1
Threads Threads Stride = 4 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 Banks Banks Stride = 16 Conflicting Access Patterns
Automatic Instrumentation Original CUDA Source Code Intermediate Representation Compile Execute Instrumentation Instrumented CUDA Source Code Output: Race Conditions Detected? Output: Bank Conflicts Detected?
Bank Conflict Instrumentation • Global bookkeeping array: • Tracks address accessed by each thread • After each PBSM access: • Each thread updates its entry • One thread computes and reports bank conflicts
Bank Conflict Detection CAUSE_BANK_CONFLICTS = true Bank conflicts at: #line 14 mem[j]++ Bank: 0 1 2 3 4 5 6 7 8 9 … Accesses: 16 0 0 0 0 0 0 0 0 0 … CAUSE_BANK_CONFLICTS = false No bank conflicts at: #line 14 mem[j]++
Preliminary Results • Scan • Included in CUDA SDK • All-prefix sums operation • 400 lines of code • Explicitly prevents race conditions and bank conflicts
Preliminary Results:Race Condition Detection • Original code: • No race conditions detected • Remove any synchronization calls: • Race conditions detected
Preliminary Results:Bank Conflict Detection • Original code: • Small number of minor bank conflicts • Enable bank conflict avoidance macro: • Bank conflicts increased! • Confirmed by manual analysis • Culprit: incorrect emulation mode
Instrumentation Overhead • Two sources: • Emulation • Instrumentation • Assumption: for debugging, programmers will already use emulation mode
Future Work • Find more types of bugs • Correctness: array bounds checking • Performance: memory coalescing • Reduce instrumentation overhead • Execute instrumented code natively
Conclusion • GPGPU: enormous performance potential • But parallel programming is challenging • Automated instrumentation can help • Find synchronization bugs • Identify inefficient memory accesses • And more…
Questions? Instrumentation tool will be available at: http://www.cs.virginia.edu/~mwb7w/cuda
Domain Mapping From: NVIDIA CUDA Programming Guide, Version 1.1
Coalesced Accesses From: NVIDIA CUDA Programming Guide, Version 1.1
Non-Coalesced Accesses From: NVIDIA CUDA Programming Guide, Version 1.1
Race Condition Detection Algorithm • A thread t knows a race condition exists at shared memory location m if: • Location m has been read from and written to • One of the accesses to m came from t • One of the accesses to m came from a thread other than t • Note that we are only checking for RAW and WAR hazards
Bank Conflicts: Example extern __shared__ int mem[]; __global__ void kernel(int iters) { int min, stride, max, id = threadIdx.x; if (CAUSE_BANK_CONFLICTS) // Set stride to cause bank conflicts else // Set stride to avoid bank conflicts for (int i = 0; i < iters; i++) for (int j = min; j < max; j += stride) mem[j]++; }
extern __shared__ int s[] ; __global__ void kernel(void) ; void kernel(void) { // Instrumentation code int block_size = blockDim.x * blockDim.y * blockDim.z; int thread_id = threadIdx.x + (threadIdx.y * blockDim.x) + (threadIdx.z * blockDim.x * blockDim.y); __shared__ char mem_reads[PUT_ARRAY_SIZE_HERE]; __shared__ char mem_writes[PUT_ARRAY_SIZE_HERE]; if (thread_id == 0) { for (int i = 0; i < block_size; i++) { mem_reads[i] = 0; mem_writes[i] = 0; } } __syncthreads(); char hazard = 0; int id ; int nt ; int temp ; { id = (int )threadIdx.x; nt = (int )((blockDim.x * blockDim.y) * blockDim.z); //#line 9 s[id] = id; // Instrumentation code mem_writes[id] = 1; __syncthreads(); if (thread_id == 0) { for (int i = 0; i < block_size; i++) { if (mem_reads[i] && mem_writes[i]) { hazard = 1; break; } } if (hazard) printf("WAR hazard at expression: #line 9 s[id] = id;\n"); hazard = 0; } //#line 10 temp = s[((nt + id) - 1) % nt]; // Instrumentation code mem_reads[((nt + id) - 1) % nt] = 1; __syncthreads(); if (thread_id == 0) { for (int i = 0; i < block_size; i++) { if (mem_reads[i] && mem_writes[i]) { hazard = 1; break; } } if (hazard) printf("RAW hazard at expression: #line 10 temp = s[((nt + id) - 1) %% nt];\n"); hazard = 0; } //#line 11 return; } } Instrumented Code Example Original Code extern __shared__ int s[]; __global__ void kernel() { int id = threadIdx.x; int nt = blockDim.x * blockDim.y * blockDim.z; s[id] = id; int temp = s[(nt+id-1) % nt]; } RAW hazard at expression: #line 10 temp = s[((nt + id) - 1) % nt]; Instrumentation