Peng Li , Guodong Li, and Ganesh Gopalakrishnan { peterlee , ligd , ganesh }@cs.utah

Parametric FlowsAutomated Behavior Equivalencingfor Symbolic Analysis of Racesin CUDA Programs Peng Li,Guodong Li, and Ganesh Gopalakrishnan {peterlee, ligd, ganesh}@cs.utah.edu School of Computing, University of Utah, Salt Lake City, UT 84112, USA

GPU-based Computing (courtesy of Intel) (courtesy of Microsoft) (courtesy of NVidia) (courtesy of AMD) CUDA OpenCL C++ AMP C/C++ Titan [AMD+NVidiaKepler] is ranked 1st in the latest top 500! Various of GPU Programming models exist 2

CUDA programs harbor insidious bugs! • Data Races • Caused by unsynchronized accesses tid = 1 tid = 2 … = a[tid] a[tid-1] = … • Can produce unpredictable results • Compilers can misbehave if given code with races • Deadlocks and other problems 3

CUDA Thread + Memory Organization Thread Warp Block Grid 4

Illustration of Race tid0 1 63 A ... t63: read A[0] __global__ void inc_gpu(int*A, intb, int N) { unsigned tid = threadIdx.x; A[tid] = A[(tid+1)% 64] + b; } t0: write A[0] RACE! t0 t63 5

Illustration of Deadlock t0 t1 t2 t3 tid %2 == 0 true false __syncthreads() t0 t2 t1 t3

Debugging CUDA Programs is hard!

Why Hard? t0 t1 t2 t3 t4 … E1 … E2 En … Read (Addr=10) Write (Addr=10) …

Why Hard? • Traditional Methods • bugs only w.r.t. current platforms + inputs + schedules • Formal Methods • bugs analyzed w.r.t. future / different platforms (PORTING ISSUE!) • all relevant inputs • all relevant schedules

Solution to relevant inputs: symbolic execution X X = x<3 x>=3 X < 3 X < 10 x>=3 & x>=10 x>=3 & x<10 Example Test Case 1 : x = 2 Example Test Case 2 : x = 3 Example Test Case 3 : x = 11 Path 1 : x < 3 Path 2 : 3 <= x < 10 Path 3 : x >= 10 Constraint Solver

Solution to relevant schedules: representative interleaving

Solution to relevant schedules: representative interleaving __device__ int d[64]; __global void foo(int *d) { 1: __shared__ int a[64]; 2: int tid= threadIdx.x; 3: a[tid] = d[tid]; 4: __syncthreads(); 5: a[tid]++; 6: if (tid % 2 == 0) { 7: a[tid] = a[tid]+2;8: } else { 9: a[tid] = a[tid%32];10:}11:__syncthreads(); }

Solution to relevant schedules: representative interleaving __device__ int d[64]; __global void foo(int *d) { 1: __shared__ int a[64]; 2: int tid= threadIdx.x; 3: a[tid] = d[tid]; 4: __syncthreads(); 5: a[tid]++; 6: if (tid % 2 == 0) { 7: a[tid] = a[tid]+2;8: } else { 9: a[tid] = a[tid%32];10:}11:__syncthreads(); } Barrier Barrier Interval Barrier

Solution to relevant schedules: representative interleaving __device__ int d[64]; __global void foo(int *d) { 1: __shared__ int a[64]; 2: int tid= threadIdx.x; 3: a[tid] = d[tid]; 4: __syncthreads(); 5: a[tid]++; 6: if (tid % 2 == 0) { 7: a[tid] = a[tid]+2;8: } else { 9: a[tid] = a[tid%32];10:}11:__syncthreads(); } t1 t2 t29 t30 t31 t0 Barrier Barrier Interval Barrier

Solution to relevant schedules: representative interleaving __device__ int d[64]; __global void foo(int *d) { 1: __shared__ int a[64]; 2: int tid= threadIdx.x; 3: a[tid] = d[tid]; 4: __syncthreads(); 5: a[tid]++; 6: if (tid % 2 == 0) { 7: a[tid] = a[tid]+2;8: } else { 9: a[tid] = a[tid] + 1;10:}11:__syncthreads(); } t1 t2 t29 t30 t31 t0 Barrier Barrier Interval t0 t2 … t30 Barrier

Solution to relevant schedules: representative interleaving __device__ int d[64]; __global void foo(int *d) { 1: __shared__ int a[64]; 2: int tid= threadIdx.x; 3: a[tid] = d[tid]; 4: __syncthreads(); 5: a[tid]++; 6: if (tid % 2 == 0) { 7: a[tid] = a[tid]+2;8: } else { 9: a[tid] = a[tid]+1;10:}11:__syncthreads(); } t1 t2 t29 t30 t31 t0 Barrier Barrier Interval t0 t2 … t30 t1 t3 … t31 Barrier

Solution to relevant schedules: representative interleaving SIMD-Aware Canonical Schedule __device__ int d[64]; __global void foo(int *d) { 1: __shared__ int a[64]; 2: int tid= threadIdx.x; 3: a[tid] = d[tid]; 4: __syncthreads(); 5: a[tid]++; 6: if (tid % 2 == 0) { 7: a[tid] = a[tid]+2;8: } else { 9: a[tid] = a[tid%32];10:}11:__syncthreads(); } t1 t2 t29 t30 t31 t33 t34 t61 t62 t63 t0 t32 Barrier t0 t2 … t30 t32 t34 … t62 t1 t3 … t31 t33 t35 … t63 Barrier

Solution to relevant schedules: representative interleaving SIMD-Aware Canonical Schedule __device__ int d[64]; __global void foo(int *d) { 1: __shared__ int a[64]; 2: int tid= threadIdx.x; 3: a[tid] = d[tid]; 4: __syncthreads(); 5: a[tid]++; 6: if (tid % 2 == 0) { 7: a[tid] = a[tid]+2;8: } else { 9: a[tid] = a[tid%32];10:}11:__syncthreads(); } t1 t2 t29 t30 t31 t33 t34 t61 t62 t63 t0 t32 Barrier t0 t2 … t30 t32 t34 … t62 t1 t3 … t31 t33 t35 … t63 Barrier Result in PPoPP’ 12: Guarantee to find races !! Around 16K pairs

Evolution of Formal Analysis Tools for CUDA in our group • Previous tool : GKLEE [PPoPP’12] • complete • does not scale, because every thread (e.g. 20K or more) explicitly modeled • This paper [SC’12] : GKLEEp • complete (in practice) • scales to 20k threads or more..

GKLEEp’s Flow • Data races • Deadlocks • Concrete test inputs • Bank conflicts • Warp divergences • Non-coalesced • Test Cases • Provide high coverage • Can be run on HW C++ CUDA Programs with Symbolic Variable Declarations Error Monitors LLVM byte-code instructions LLVM-GCC Symbolic Analyzer and Scheduler

Key Contributions • Parametric flows are the control-flow equivalence classes of threads that diverge in the same manner • GKLEEp found bugs missed by GKLEE (GKLEEp scales!) • GKLEE: upto 2K threads • GKLEEp: well beyond 20K threads • GKLEEp finds all races (except in contrived programs)

Key Idea: Branching on TDC (Thread-ID Dependent Conditional) __global void foo(int *d) { 1: __shared__ int a[64]; 2: int tid= threadIdx.x; 3: a[tid] = d[tid]; 4: __syncthreads(); 5: a[tid]++; 6: if (tid % 2 == 0) { 7: a[tid] = a[tid]+2;8: }else { 9: a[tid] = a[tid%32];10:}11: __syncthreads(); } Barrier Barrier

A Motivating Example • __shared__ unsigned b[2048]; • __global__ void test(unsigned * a) { • 1: unsigned tid = threadIdx.x; • 2: int x, y; • 3: if (tid < 1024) { • 4: b[tid] = a[tid] + 1; • 5: if (tid % 2 != 0) { • 6: b[tid] = 2; • 7: } else { • 8: if (tid > 0) • 9: b[tid] = b[tid-1]+1; • 10: if (x < y) … • 11: } • 12: } • 13: } else { • 14: b[tid] = b[tid-1]; • 15: } • }

A Motivating Example • __shared__ unsigned b[2048]; • __global__ void test(unsigned * a) { • 1: unsigned tid = threadIdx.x; • 2: int x, y; • 3: if (tid < 1024) { <<== TDC • 4: b[tid] = a[tid] + 1; • 5: if (tid % 2 != 0) { <<== TDC • 6: b[tid] = 2; • 7: } else { • 8: if (tid > 0){ <<== TDC • 9: b[tid] = b[tid-1]+1; • 10: if (x < y) … << == Not TDC • 11: } • 12: } • 13: } else { • 14: b[tid] = b[tid-1]; • 15: } • }

A Motivating Example tid < 1024 tid %2 != 0 tid >= 1024 tid == 0 b[tid] = b[tid-1]; 4 Parametric Flows tid < 1024 b[tid] = a[tid] + 1; tid %2 != 0 tid > 0 b[tid] = 2; tid % 2 == 0 Parametric Flow Tree b[tid] = b[tid-1]+1 tid > 0

Correctness of GKLEEp • No False Alarms • guaranteed - because of exact symbolic constraint solving!! • No Omissions • "no omissions" true in practice Details in paper!!

SDK Kernel Example: Symbolic race checking __global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, intdataN) { constintthreadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for (intpos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { unsigned data4 = d_Data[pos]; addData64(s_Hist, threadPos, (data4 >> 2) & 0x3FU); ... } __syncthreads(); ... } __device__ void addData64(unsigned char *s_Hist, intthreadPos, unsigned intdata) { s_Hist[threadPos + IMUL(data, THREAD_N)]++; } t1 t2 threadPos = … threadPos = … data = (data4>>2) & 0x3FU data = (data4>>2) & 0x3FU s_Hist[threadPos + data*THREAD_N]++; s_Hist[threadPos + data*THREAD_N]++;

SDK Kernel Example: Symbolic race checking t1 t2 RW set: t1: writes s_Hist((((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 32), … t2: writes s_Hist((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 32), … threadPos = … threadPos = … data = (data4>>2) & 0x3FU data = (data4>>2) & 0x3FU ? s_Hist[threadPos + data*THREAD_N]++; s_Hist[threadPos + data*THREAD_N]++; t1,t2,d_Data: (t1 t2)  (((t1& (~63)) >> 0) | ((t1& 15) << 2) | ((t1& 48) >> 4)) + ((d_Data[t1] >> 2) & 0x3FU) * 32) == ((((t2& (~63)) >> 0) | ((t2& 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2]>> 2) & 0x3FU) * 32) Satisfiable! There is a race!!

SDK Kernel Example: race checking t1 t2 RW set: t1: writes s_Hist((((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64), … t2: writes s_Hist((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64), … threadPos = … threadPos = … data = (data4>>2) & 0x3FU data = (data4>>2) & 0x3FU GKLEEpindicates that these two addresses are equalwhen t1 = 23, t2 = 31, d_data[23]= 0xfcfcfcfc, and d_data[31] = 0xf4f4f4f4 indicating a Write-Write race s_Hist[threadPos + data*THREAD_N]++; s_Hist[threadPos + data*THREAD_N]++;

Evaluation Timed Out!

Evaluation

GKLEEp in practice • Accepts host program with many kernel calls • Each kernel can be ~1K LOC, e.g., eigenvalues • Finds races as well as inputs causing them

Evaluation TABLE I SDK 2.0 KERNEL RESULTS. WE SET 7200 SECONDS AS THE THRESHOLD FOR TIME OUT (ABBREVIATED AS T.O.). A/B , A is the tool runtime (in seconds) and B is the number of control flow paths

Related formal methods based work: compare with other formal tools • [M.Zheng et al, PPoPP’11]: • Combination of static analysis and dynamic analysis • [A. Leung et al, PLDI’12]: • A single dynamic run can be used to learn much more information about a CUDA program’s behavior • [A. Betts et al, SPLASH’12]: • Two threads abstraction • Found errors in real SDK kernels GKLEEp scales more and finds races in real kernels!

Conclusion • New formal approach for analyzing CUDA kernels • Employs a “parametric” reasoning style which capitalizes on thread symmetry • Scales to over 10^5 threads on realistic CUDA programs • Finds races missed by • Traditional testing • Previous formal approaches • Tool will be released soon – check website http://www.cs.utah.edu/fv/GKLEE

Thanks! Questions?

Extra Slides • How to pick symbolic inputs? • taint analyzer being developed • help pick inputs that matter and make symbolic • Loops invariant • Static analysis to avoid loop unrolling

A Motivating Example • __global__ void test(unsigned * a) { • 1: unsigned bid = blockIdx.x; • 2: unsigned tid = threadIdx.x; • 3: • 4: if (bid % 2 != 0) { • 5: if (tid < 1024) { • 6: unsigned idx = bid * blockDim.x + tid; • 7: b[tid]= a[idx] + 1; • 8: if (tid % 2 != 0) { • 9: b[tid] = 2; • 10: } else { • 11: if (tid > 0) • 12: b[tid] = b[tid-1]+1; • 13: } • 14: } else { • 15: b[tid] = b[tid-1]; • 16: } • 17: } else { • 18: unsigned idx = bid * blockDim.x + tid; • 19: b[tid] = a[idx] + 1; • 20: } • } GKLEE: T1: <1,0,0><31,0,0> and T2: <1,0,0><32,0,0> incur the write-read race, needs 50.5ss GKLEEp: T1: <1,0,0><511,0,0> and T2: <1,0,0><512,0,0> incur the write-read race, needs 1.9ss

A Motivating Example • 7: b[tid]= a[idx] + 1; • 8: if (tid % 2 != 0) { • 9: b[tid] = 2; • 10: } else { • 11: if (tid > 0) • 12: b[tid] = b[tid-1]+1; • 13: } • 14: } • Constraint for race checking: • Configuration Constraint: • TDC Constraint from Parametric Flow Tree: • Thread Relation Constraint: Precondition

A Motivating Example • 7: b[tid]= a[idx] + 1; • 8: if (tid % 2 != 0) { • 9: b[tid] = 2; • 10: } else { • 11: if (tid > 0) • 12: b[tid] = b[tid-1]+1; • 13: } • 14: } • Constraint for race checking: • Configuration Constraint: • TDC Constraint from Parametric Flow Tree: • Thread Relation Constraint: • Race Constraint: GKLEEp: T1: <1,0,0><511,0,0> and T2: <1,0,0><512,0,0> incur the inter-warp write-read races Precondition

Peng Li , Guodong Li, and Ganesh Gopalakrishnan { peterlee , ligd , ganesh }@cs.utah

Peng Li , Guodong Li, and Ganesh Gopalakrishnan { peterlee , ligd , ganesh }@cs.utah

Presentation Transcript

Memory-Model-Sensitive Data Race Analysis

Formal Methods in System Design, Lec 3

Towards Formal Approaches to System Resilience

(1) Formal Verification for Message Passing and GPU Computing (2) XUM: An Experimental Multicore supporting MCAPI

http://www.cs.utah.edu/fv

On characterizing hardware platforms Ganesh Gopalakrishnan Lecture of 2-9-09, Week 5, CS 5966/6966

An Overview of Formal Mathematical Reasoning with applications to Digital System Verification

Verifying Transaction Ordering Properties in Unbounded Multi-Bus Networks

Scaling Formal Methods Toward Hierarchical Protocols in Shared Memory Processors

Hardware Design, Synthesis, and Verification of a Multicore Communication API

BCC Called Public Hearing on BZA # SE-07-12-011, 12/6/07 APPLICANT: Ganesh Bansrupan

JAY GANESH!!

Garuda

An Integration of Dynamic MPI Formal Verification Within Eclipse PTP

Lord Ganesh

* Work supported in part by SRC Contract 1031.001 and NSF Award 0219805

Dr. Santosh Yatam Ganesh MBBS, MPH.,

SHREE GANESH VANDANA Mantra

GKLEE: Concolic Verification and Test Generation for GPUs

Amazing Discount on Modak for Ganesh Festival - MM Mithaiwala

Beautify your Home with Shri Ganesh Interior Decor

festival products ganesh chaturthi, Ganesh Chaturthi Store, Ganesh Festival