1 / 41

Peng Li , Guodong Li, and Ganesh Gopalakrishnan { peterlee , ligd , ganesh }@cs.utah

Parametric Flows Automated Behavior Equivalencing for Symbolic Analysis of Races in CUDA Programs. Peng Li , Guodong Li, and Ganesh Gopalakrishnan { peterlee , ligd , ganesh }@cs.utah.edu School of Computing, University of Utah, Salt Lake City, UT 84112, USA. GPU-based Computing.

nyx
Download Presentation

Peng Li , Guodong Li, and Ganesh Gopalakrishnan { peterlee , ligd , ganesh }@cs.utah

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parametric FlowsAutomated Behavior Equivalencingfor Symbolic Analysis of Racesin CUDA Programs Peng Li,Guodong Li, and Ganesh Gopalakrishnan {peterlee, ligd, ganesh}@cs.utah.edu School of Computing, University of Utah, Salt Lake City, UT 84112, USA

  2. GPU-based Computing (courtesy of Intel) (courtesy of Microsoft) (courtesy of NVidia) (courtesy of AMD) CUDA OpenCL C++ AMP C/C++ Titan [AMD+NVidiaKepler] is ranked 1st in the latest top 500! Various of GPU Programming models exist 2

  3. CUDA programs harbor insidious bugs! • Data Races • Caused by unsynchronized accesses tid = 1 tid = 2 … = a[tid] a[tid-1] = … • Can produce unpredictable results • Compilers can misbehave if given code with races • Deadlocks and other problems 3

  4. CUDA Thread + Memory Organization Thread Warp Block Grid 4

  5. Illustration of Race tid0 1 63 A ... t63: read A[0] __global__ void inc_gpu(int*A, intb, int N) { unsigned tid = threadIdx.x; A[tid] = A[(tid+1)% 64] + b; } t0: write A[0] RACE! t0 t63 5

  6. Illustration of Deadlock t0 t1 t2 t3 tid %2 == 0 true false __syncthreads() t0 t2 t1 t3

  7. Debugging CUDA Programs is hard!

  8. Why Hard? t0 t1 t2 t3 t4 … E1 … E2 En … Read (Addr=10) Write (Addr=10) …

  9. Why Hard? • Traditional Methods • bugs only w.r.t. current platforms + inputs + schedules • Formal Methods • bugs analyzed w.r.t. future / different platforms (PORTING ISSUE!) • all relevant inputs • all relevant schedules

  10. Solution to relevant inputs: symbolic execution X X = x<3 x>=3 X < 3 X < 10 x>=3 & x>=10 x>=3 & x<10 Example Test Case 1 : x = 2 Example Test Case 2 : x = 3 Example Test Case 3 : x = 11 Path 1 : x < 3 Path 2 : 3 <= x < 10 Path 3 : x >= 10 Constraint Solver

  11. Solution to relevant schedules: representative interleaving

  12. Solution to relevant schedules: representative interleaving __device__ int d[64]; __global void foo(int *d) { 1: __shared__ int a[64]; 2: int tid= threadIdx.x; 3: a[tid] = d[tid]; 4: __syncthreads(); 5: a[tid]++; 6: if (tid % 2 == 0) { 7: a[tid] = a[tid]+2;8: } else { 9: a[tid] = a[tid%32];10:}11:__syncthreads(); }

  13. Solution to relevant schedules: representative interleaving __device__ int d[64]; __global void foo(int *d) { 1: __shared__ int a[64]; 2: int tid= threadIdx.x; 3: a[tid] = d[tid]; 4: __syncthreads(); 5: a[tid]++; 6: if (tid % 2 == 0) { 7: a[tid] = a[tid]+2;8: } else { 9: a[tid] = a[tid%32];10:}11:__syncthreads(); }

  14. Solution to relevant schedules: representative interleaving __device__ int d[64]; __global void foo(int *d) { 1: __shared__ int a[64]; 2: int tid= threadIdx.x; 3: a[tid] = d[tid]; 4: __syncthreads(); 5: a[tid]++; 6: if (tid % 2 == 0) { 7: a[tid] = a[tid]+2;8: } else { 9: a[tid] = a[tid%32];10:}11:__syncthreads(); } Barrier Barrier Interval Barrier

  15. Solution to relevant schedules: representative interleaving __device__ int d[64]; __global void foo(int *d) { 1: __shared__ int a[64]; 2: int tid= threadIdx.x; 3: a[tid] = d[tid]; 4: __syncthreads(); 5: a[tid]++; 6: if (tid % 2 == 0) { 7: a[tid] = a[tid]+2;8: } else { 9: a[tid] = a[tid%32];10:}11:__syncthreads(); } t1 t2 t29 t30 t31 t0 Barrier Barrier Interval Barrier

  16. Solution to relevant schedules: representative interleaving __device__ int d[64]; __global void foo(int *d) { 1: __shared__ int a[64]; 2: int tid= threadIdx.x; 3: a[tid] = d[tid]; 4: __syncthreads(); 5: a[tid]++; 6: if (tid % 2 == 0) { 7: a[tid] = a[tid]+2;8: } else { 9: a[tid] = a[tid] + 1;10:}11:__syncthreads(); } t1 t2 t29 t30 t31 t0 Barrier Barrier Interval t0 t2 … t30 Barrier

  17. Solution to relevant schedules: representative interleaving __device__ int d[64]; __global void foo(int *d) { 1: __shared__ int a[64]; 2: int tid= threadIdx.x; 3: a[tid] = d[tid]; 4: __syncthreads(); 5: a[tid]++; 6: if (tid % 2 == 0) { 7: a[tid] = a[tid]+2;8: } else { 9: a[tid] = a[tid]+1;10:}11:__syncthreads(); } t1 t2 t29 t30 t31 t0 Barrier Barrier Interval t0 t2 … t30 t1 t3 … t31 Barrier

  18. Solution to relevant schedules: representative interleaving SIMD-Aware Canonical Schedule __device__ int d[64]; __global void foo(int *d) { 1: __shared__ int a[64]; 2: int tid= threadIdx.x; 3: a[tid] = d[tid]; 4: __syncthreads(); 5: a[tid]++; 6: if (tid % 2 == 0) { 7: a[tid] = a[tid]+2;8: } else { 9: a[tid] = a[tid%32];10:}11:__syncthreads(); } t1 t2 t29 t30 t31 t33 t34 t61 t62 t63 t0 t32 Barrier t0 t2 … t30 t32 t34 … t62 t1 t3 … t31 t33 t35 … t63 Barrier

  19. Solution to relevant schedules: representative interleaving SIMD-Aware Canonical Schedule __device__ int d[64]; __global void foo(int *d) { 1: __shared__ int a[64]; 2: int tid= threadIdx.x; 3: a[tid] = d[tid]; 4: __syncthreads(); 5: a[tid]++; 6: if (tid % 2 == 0) { 7: a[tid] = a[tid]+2;8: } else { 9: a[tid] = a[tid%32];10:}11:__syncthreads(); } t1 t2 t29 t30 t31 t33 t34 t61 t62 t63 t0 t32 Barrier t0 t2 … t30 t32 t34 … t62 t1 t3 … t31 t33 t35 … t63 Barrier Result in PPoPP’ 12: Guarantee to find races !! Around 16K pairs

  20. Evolution of Formal Analysis Tools for CUDA in our group • Previous tool : GKLEE [PPoPP’12] • complete • does not scale, because every thread (e.g. 20K or more) explicitly modeled • This paper [SC’12] : GKLEEp • complete (in practice) • scales to 20k threads or more..

  21. GKLEEp’s Flow • Data races • Deadlocks • Concrete test inputs • Bank conflicts • Warp divergences • Non-coalesced • Test Cases • Provide high coverage • Can be run on HW C++ CUDA Programs with Symbolic Variable Declarations Error Monitors LLVM byte-code instructions LLVM-GCC Symbolic Analyzer and Scheduler

  22. Key Contributions • Parametric flows are the control-flow equivalence classes of threads that diverge in the same manner • GKLEEp found bugs missed by GKLEE (GKLEEp scales!) • GKLEE: upto 2K threads • GKLEEp: well beyond 20K threads • GKLEEp finds all races (except in contrived programs)

  23. Key Idea: Branching on TDC (Thread-ID Dependent Conditional) __global void foo(int *d) { 1: __shared__ int a[64]; 2: int tid= threadIdx.x; 3: a[tid] = d[tid]; 4: __syncthreads(); 5: a[tid]++; 6: if (tid % 2 == 0) { 7: a[tid] = a[tid]+2;8: }else { 9: a[tid] = a[tid%32];10:}11: __syncthreads(); } Barrier Barrier

  24. A Motivating Example • __shared__ unsigned b[2048]; • __global__ void test(unsigned * a) { • 1: unsigned tid = threadIdx.x; • 2: int x, y; • 3: if (tid < 1024) { • 4: b[tid] = a[tid] + 1; • 5: if (tid % 2 != 0) { • 6: b[tid] = 2; • 7: } else { • 8: if (tid > 0) • 9: b[tid] = b[tid-1]+1; • 10: if (x < y) … • 11: } • 12: } • 13: } else { • 14: b[tid] = b[tid-1]; • 15: } • }

  25. A Motivating Example • __shared__ unsigned b[2048]; • __global__ void test(unsigned * a) { • 1: unsigned tid = threadIdx.x; • 2: int x, y; • 3: if (tid < 1024) { <<== TDC • 4: b[tid] = a[tid] + 1; • 5: if (tid % 2 != 0) { <<== TDC • 6: b[tid] = 2; • 7: } else { • 8: if (tid > 0){ <<== TDC • 9: b[tid] = b[tid-1]+1; • 10: if (x < y) … << == Not TDC • 11: } • 12: } • 13: } else { • 14: b[tid] = b[tid-1]; • 15: } • }

  26. A Motivating Example tid < 1024 tid %2 != 0 tid >= 1024 tid == 0 b[tid] = b[tid-1]; 4 Parametric Flows tid < 1024 b[tid] = a[tid] + 1; tid %2 != 0 tid > 0 b[tid] = 2; tid % 2 == 0 Parametric Flow Tree b[tid] = b[tid-1]+1 tid > 0

  27. Correctness of GKLEEp • No False Alarms • guaranteed - because of exact symbolic constraint solving!! • No Omissions • "no omissions" true in practice Details in paper!!

  28. SDK Kernel Example: Symbolic race checking __global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, intdataN) { constintthreadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for (intpos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { unsigned data4 = d_Data[pos]; addData64(s_Hist, threadPos, (data4 >> 2) & 0x3FU); ... } __syncthreads(); ... } __device__ void addData64(unsigned char *s_Hist, intthreadPos, unsigned intdata) { s_Hist[threadPos + IMUL(data, THREAD_N)]++; } t1 t2 threadPos = … threadPos = … data = (data4>>2) & 0x3FU data = (data4>>2) & 0x3FU s_Hist[threadPos + data*THREAD_N]++; s_Hist[threadPos + data*THREAD_N]++;

  29. SDK Kernel Example: Symbolic race checking t1 t2 RW set: t1: writes s_Hist((((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 32), … t2: writes s_Hist((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 32), … threadPos = … threadPos = … data = (data4>>2) & 0x3FU data = (data4>>2) & 0x3FU ? s_Hist[threadPos + data*THREAD_N]++; s_Hist[threadPos + data*THREAD_N]++; t1,t2,d_Data: (t1 t2)  (((t1& (~63)) >> 0) | ((t1& 15) << 2) | ((t1& 48) >> 4)) + ((d_Data[t1] >> 2) & 0x3FU) * 32) == ((((t2& (~63)) >> 0) | ((t2& 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2]>> 2) & 0x3FU) * 32) Satisfiable! There is a race!!

  30. SDK Kernel Example: race checking t1 t2 RW set: t1: writes s_Hist((((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64), … t2: writes s_Hist((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64), … threadPos = … threadPos = … data = (data4>>2) & 0x3FU data = (data4>>2) & 0x3FU GKLEEpindicates that these two addresses are equalwhen t1 = 23, t2 = 31, d_data[23]= 0xfcfcfcfc, and d_data[31] = 0xf4f4f4f4 indicating a Write-Write race s_Hist[threadPos + data*THREAD_N]++; s_Hist[threadPos + data*THREAD_N]++;

  31. Evaluation Timed Out!

  32. Evaluation

  33. GKLEEp in practice • Accepts host program with many kernel calls • Each kernel can be ~1K LOC, e.g., eigenvalues • Finds races as well as inputs causing them

  34. Evaluation TABLE I SDK 2.0 KERNEL RESULTS. WE SET 7200 SECONDS AS THE THRESHOLD FOR TIME OUT (ABBREVIATED AS T.O.). A/B , A is the tool runtime (in seconds) and B is the number of control flow paths

  35. Related formal methods based work: compare with other formal tools • [M.Zheng et al, PPoPP’11]: • Combination of static analysis and dynamic analysis • [A. Leung et al, PLDI’12]: • A single dynamic run can be used to learn much more information about a CUDA program’s behavior • [A. Betts et al, SPLASH’12]: • Two threads abstraction • Found errors in real SDK kernels GKLEEp scales more and finds races in real kernels!

  36. Conclusion • New formal approach for analyzing CUDA kernels • Employs a “parametric” reasoning style which capitalizes on thread symmetry • Scales to over 10^5 threads on realistic CUDA programs • Finds races missed by • Traditional testing • Previous formal approaches • Tool will be released soon – check website http://www.cs.utah.edu/fv/GKLEE

  37. Thanks! Questions?

  38. Extra Slides • How to pick symbolic inputs? • taint analyzer being developed • help pick inputs that matter and make symbolic • Loops invariant • Static analysis to avoid loop unrolling

  39. A Motivating Example • __global__ void test(unsigned * a) { • 1: unsigned bid = blockIdx.x; • 2: unsigned tid = threadIdx.x; • 3: • 4: if (bid % 2 != 0) { • 5: if (tid < 1024) { • 6: unsigned idx = bid * blockDim.x + tid; • 7: b[tid]= a[idx] + 1; • 8: if (tid % 2 != 0) { • 9: b[tid] = 2; • 10: } else { • 11: if (tid > 0) • 12: b[tid] = b[tid-1]+1; • 13: } • 14: } else { • 15: b[tid] = b[tid-1]; • 16: } • 17: } else { • 18: unsigned idx = bid * blockDim.x + tid; • 19: b[tid] = a[idx] + 1; • 20: } • } GKLEE: T1: <1,0,0><31,0,0> and T2: <1,0,0><32,0,0> incur the write-read race, needs 50.5ss GKLEEp: T1: <1,0,0><511,0,0> and T2: <1,0,0><512,0,0> incur the write-read race, needs 1.9ss

  40. A Motivating Example • 7: b[tid]= a[idx] + 1; • 8: if (tid % 2 != 0) { • 9: b[tid] = 2; • 10: } else { • 11: if (tid > 0) • 12: b[tid] = b[tid-1]+1; • 13: } • 14: } • Constraint for race checking: • Configuration Constraint: • TDC Constraint from Parametric Flow Tree: • Thread Relation Constraint: Precondition

  41. A Motivating Example • 7: b[tid]= a[idx] + 1; • 8: if (tid % 2 != 0) { • 9: b[tid] = 2; • 10: } else { • 11: if (tid > 0) • 12: b[tid] = b[tid-1]+1; • 13: } • 14: } • Constraint for race checking: • Configuration Constraint: • TDC Constraint from Parametric Flow Tree: • Thread Relation Constraint: • Race Constraint: GKLEEp: T1: <1,0,0><511,0,0> and T2: <1,0,0><512,0,0> incur the inter-warp write-read races Precondition

More Related