Automated Dynamic Analysis of CUDA Programs

Automated Dynamic Analysisof CUDA Programs Michael Boyer, Kevin Skadron*, and Westley Weimer University of Virginia {boyer,skadron,weimer}@cs.virginia.edu * currently on sabbatical with NVIDIA Research

Outline • GPGPU • CUDA • Automated analyses • Correctness: race conditions • Performance: bank conflicts • Preliminary results • Future work • Conclusion

Why GPGPU? From: NVIDIA CUDA Programming Guide, Version 1.1

CPU vs. GPU Design Single-Thread Latency Aggregate Throughput From: NVIDIA CUDA Programming Guide, Version 1.1

GPGPU Programming • Traditional approach: graphics APIs • ATI/AMD: Close-to-the-Metal (CTM) • NVIDIA: Compute Unified Device Architecture (CUDA)

CUDA: Abstractions • Kernel functions • Scratchpad memory • Barrier synchronization

__host__void example(int *cpu_mem) { cudaMalloc(&gpu_mem, mem_size); cudaMemcpy(gpu_mem, cpu_mem, HostToDevice); kernel <<< grid, threads, mem_size >>> (gpu_mem); cudaMemcpy(cpu_mem, gpu_mem, DeviceToHost); } __global__void kernel(int *mem) { int thread_id = threadIdx.x; mem[thread_id] = thread_id; } CUDA: Example Program

Multiprocessor Per-Block Shared Memory (PBSM) Registers Registers Registers Multiprocessor 1 Multiprocessor N Processing Element 1 Processing Element 2 Processing Element M Instruction Unit ● ● ● ● ● ● CUDA: Hardware GPU Multiprocessor 2 Global Device Memory

Race Conditions • Ordering of instructions among multiple threads is arbitrary • Relaxed memory consistency model • Synchronization: __syncthreads() • Barrier / memory fence

W W s threads W 0 0 R W 1 1 R W 2 2 R W 3 3 R 4 4 R 5 5 R Race Conditions: Example 1 extern__shared__int s[ ]; 2 3 __global__void kernel(int *out) { 4 int id = threadIdx.x; 5 int nt = blockDim.x; 6 7 s[id] = id; 8 out = s[(id + 1) % nt]; 9 } 8 out = s[(id + 1) % nt];

Automatic Instrumentation Original CUDA Source Code Intermediate Representation Compile Execute Instrumentation Instrumented CUDA Source Code Output: Race Conditions Detected?

Race Condition Instrumentation • Two global bookkeeping arrays: • Reads & writes of all threads • Two per-thread bookkeeping arrays: • Reads & writes of a single thread • After each shared memory access: • Update bookkeeping arrays • Detect & report race conditions

Race Condition Detection Add synchronization between lines 7 and 8 No race conditions detected Original code RAW hazard at expression: #line 8 out[id] = s[(id + 1) % nt];

Bank Conflicts • PBSM is fast • Much faster than global memory • Potentially as fast as register access • …assuming no bank conflicts • Bank conflicts cause serialized access

Threads Threads 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 Banks Banks Stride = 3 Non-Conflicting Access Patterns Stride = 1

Threads Threads Stride = 4 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 Banks Banks Stride = 16 Conflicting Access Patterns

Impact of Bank Conflicts

Automatic Instrumentation Original CUDA Source Code Intermediate Representation Compile Execute Instrumentation Instrumented CUDA Source Code Output: Race Conditions Detected? Output: Bank Conflicts Detected?

Bank Conflict Instrumentation • Global bookkeeping array: • Tracks address accessed by each thread • After each PBSM access: • Each thread updates its entry • One thread computes and reports bank conflicts

Bank Conflict Detection CAUSE_BANK_CONFLICTS = true Bank conflicts at: #line 14 mem[j]++ Bank: 0 1 2 3 4 5 6 7 8 9 … Accesses: 16 0 0 0 0 0 0 0 0 0 … CAUSE_BANK_CONFLICTS = false No bank conflicts at: #line 14 mem[j]++

Preliminary Results • Scan • Included in CUDA SDK • All-prefix sums operation • 400 lines of code • Explicitly prevents race conditions and bank conflicts

Preliminary Results:Race Condition Detection • Original code: • No race conditions detected • Remove any synchronization calls: • Race conditions detected

Preliminary Results:Bank Conflict Detection • Original code: • Small number of minor bank conflicts • Enable bank conflict avoidance macro: • Bank conflicts increased! • Confirmed by manual analysis • Culprit: incorrect emulation mode

Instrumentation Overhead • Two sources: • Emulation • Instrumentation • Assumption: for debugging, programmers will already use emulation mode

Instrumentation Overhead

Future Work • Find more types of bugs • Correctness: array bounds checking • Performance: memory coalescing • Reduce instrumentation overhead • Execute instrumented code natively

Conclusion • GPGPU: enormous performance potential • But parallel programming is challenging • Automated instrumentation can help • Find synchronization bugs • Identify inefficient memory accesses • And more…

Questions? Instrumentation tool will be available at: http://www.cs.virginia.edu/~mwb7w/cuda

Domain Mapping From: NVIDIA CUDA Programming Guide, Version 1.1

Coalesced Accesses From: NVIDIA CUDA Programming Guide, Version 1.1

Non-Coalesced Accesses From: NVIDIA CUDA Programming Guide, Version 1.1

Race Condition Detection Algorithm • A thread t knows a race condition exists at shared memory location m if: • Location m has been read from and written to • One of the accesses to m came from t • One of the accesses to m came from a thread other than t • Note that we are only checking for RAW and WAR hazards

Bank Conflicts: Example extern __shared__ int mem[]; __global__ void kernel(int iters) { int min, stride, max, id = threadIdx.x; if (CAUSE_BANK_CONFLICTS) // Set stride to cause bank conflicts else // Set stride to avoid bank conflicts for (int i = 0; i < iters; i++) for (int j = min; j < max; j += stride) mem[j]++; }

extern __shared__ int s[] ; __global__ void kernel(void) ; void kernel(void) { // Instrumentation code int block_size = blockDim.x * blockDim.y * blockDim.z; int thread_id = threadIdx.x + (threadIdx.y * blockDim.x) + (threadIdx.z * blockDim.x * blockDim.y); __shared__ char mem_reads[PUT_ARRAY_SIZE_HERE]; __shared__ char mem_writes[PUT_ARRAY_SIZE_HERE]; if (thread_id == 0) { for (int i = 0; i < block_size; i++) { mem_reads[i] = 0; mem_writes[i] = 0; } } __syncthreads(); char hazard = 0; int id ; int nt ; int temp ; { id = (int )threadIdx.x; nt = (int )((blockDim.x * blockDim.y) * blockDim.z); //#line 9 s[id] = id; // Instrumentation code mem_writes[id] = 1; __syncthreads(); if (thread_id == 0) { for (int i = 0; i < block_size; i++) { if (mem_reads[i] && mem_writes[i]) { hazard = 1; break; } } if (hazard) printf("WAR hazard at expression: #line 9 s[id] = id;\n"); hazard = 0; } //#line 10 temp = s[((nt + id) - 1) % nt]; // Instrumentation code mem_reads[((nt + id) - 1) % nt] = 1; __syncthreads(); if (thread_id == 0) { for (int i = 0; i < block_size; i++) { if (mem_reads[i] && mem_writes[i]) { hazard = 1; break; } } if (hazard) printf("RAW hazard at expression: #line 10 temp = s[((nt + id) - 1) %% nt];\n"); hazard = 0; } //#line 11 return; } } Instrumented Code Example Original Code extern __shared__ int s[]; __global__ void kernel() { int id = threadIdx.x; int nt = blockDim.x * blockDim.y * blockDim.z; s[id] = id; int temp = s[(nt+id-1) % nt]; } RAW hazard at expression: #line 10 temp = s[((nt + id) - 1) % nt]; Instrumentation

Automated Dynamic Analysis of CUDA Programs

Automated Dynamic Analysis of CUDA Programs

Presentation Transcript

automated slide analysis

Towards Automated Tuning of Parallel Programs

Analysis of Multithreaded Programs

Dynamic Analysis and Visualizing of Data Structures in Java Programs

Automated analysis of Spatial Grids

Automated Malware Analysis

Dynamic Adaptive Automated Software Engineering

Beam Dynamic Calculation by NVIDIA® CUDA Technology

Automated Fixing of Programs with Contracts

Writing Efficient CUDA Programs

Dynamic Purity Analysis for Java Programs

Automated Data Analysis

Automated Email Programs

Automated Analysis of Spectra

Scalable Dynamic Analysis for Automated Fault Location and Avoidance

Runtime Analysis and Program Transformations for Dynamic Programs

Analysis of Dynamic Process Models

Automated Revision of Existing Programs

Automated analysis of Spatial Grids

Towards Automated Tuning of Parallel Programs

Dynamic Analysis of Bionic ARM