gklee concolic verification and test generation for gpus
Download
Skip this Video
Download Presentation
GKLEE: Concolic Verification and Test Generation for GPUs

Loading in 2 Seconds...

play fullscreen
1 / 52

GKLEE: Concolic Verification and Test Generation for GPUs - PowerPoint PPT Presentation


  • 182 Views
  • Uploaded on

GKLEE: Concolic Verification and Test Generation for GPUs. Guodong Li 1,2 , Peng Li 1 , Geof Sawaya 1 , Ganesh Gopalakrishnan 1 , Indradeep Ghosh 2 , Sreeranga P. Rajan 2. 1. 2. Fujitsu Labs of America. Feb. 2012. 1. GPUs are widely used!. (courtesy of Nvidia, www.engadget.com).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' GKLEE: Concolic Verification and Test Generation for GPUs' - hilda-hewitt


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
gklee concolic verification and test generation for gpus

GKLEE: Concolic Verification and Test Generation for GPUs

Guodong Li1,2, Peng Li1, Geof Sawaya1, Ganesh Gopalakrishnan1, Indradeep Ghosh2, Sreeranga P. Rajan2

1

2

Fujitsu Labs of America

Feb. 2012

1

gpus are widely used
GPUs are widely used!

(courtesy of Nvidia,

www.engadget.com)

(courtesy of Nvidia)

(courtesy of AMD)

(courtesy of Intel)

In such application domains, it is important that GPU computations yield correct answers and are bug-free.

About 40 of the top 500 machines are GPU based

Personal supercomputers used for scientific research (biology, physics, …) increasingly based on GPUs

2

existing gpu testing methods are inadequate
Existing GPU Testing Methods are Inadequate
  • Insufficient branch-coverage and interleaving-coverage, leading to
    • Missed data races
existing gpu testing methods are inadequate1
Existing GPU Testing Methods are Inadequate
  • Insufficient branch-coverage and interleaving-coverage, leading to
    • Missed data races

Write(a)

Read(a)

Write(a)

Write(a)

existing gpu testing methods are inadequate2
Existing GPU Testing Methods are Inadequate
  • Data races are a huge problem
    • Testing is NEVER conclusive
    • One has to infer data race\'s ill effects indirectly through corrupted values
    • Even instrumented race checking gives results only for a specific platform, and not for future validations,
      • for example for a different warp scheduling, e.g. change over from old Tesla to New Fermi
existing gpu testing methods are inadequate3
Existing GPU Testing Methods are Inadequate
  • Insufficient branch-coverage and interleaving-coverage, leading to
    • Missed data races
    • Missed deadlocks
existing gpu testing methods are inadequate4
Existing GPU Testing Methods are Inadequate
  • Insufficient branch-coverage and interleaving-coverage, leading to
    • Missed data races
    • Missed deadlocks

__SyncThreads()

existing gpu testing methods are inadequate5
Existing GPU Testing Methods are Inadequate
  • Insufficient branch-coverage and interleaving-coverage, leading to
    • Missed data races
    • Missed deadlocks
  • Insufficient measurement of performance penalties due to
    • Warp Divergence
existing gpu testing methods are inadequate6
Existing GPU Testing Methods are Inadequate
  • Insufficient branch-coverage and interleaving-coverage, leading to
    • Missed data races
    • Missed deadlocks
  • Insufficient measurement of performance penalties due to
    • Warp Divergence
existing gpu testing methods are inadequate7
Existing GPU Testing Methods are Inadequate
  • Insufficient branch-coverage and interleaving-coverage, leading to
    • Missed data races
    • Missed deadlocks
  • Insufficient measurement of performance penalties due to
    • Warp Divergence
    • Non-coalesced memory accesses
existing gpu testing methods are inadequate8
Existing GPU Testing Methods are Inadequate
  • Insufficient branch-coverage and interleaving-coverage, leading to
    • Missed data races
    • Missed deadlocks
  • Insufficient measurement of performance penalties due to
    • Warp Divergence
    • Non-coalesced memory accesses

Memory

existing gpu testing methods are inadequate9
Existing GPU Testing Methods are Inadequate
  • Insufficient branch-coverage and interleaving-coverage, leading to
    • Missed data races
    • Missed deadlocks
  • Insufficient measurement of performance penalties due to
    • Warp Divergence
    • Non-coalesced memory accesses
    • Bank conflicts

Memory Banks

existing gpu testing methods are inadequate10
Existing GPU Testing Methods are Inadequate
  • CUDA GDB Debugger
    • Manually debug the code and check races and deadlocks
  • CUDA Profiler
    • Report numbers difficult to read
    • Low coverage (i.e. no all possible inputs)
  • GKLEE
    • Better tool for verification and testing
    • Can address all the previously mentioned points
      • e.g.has found bugs in real SDK kernels previously thought to be bug-free
      • give root causes of the bugs
our contributions
Our Contributions

GKLEE: a Symbolic Virtual GPU for

Verification, Analysis, and Test-generation

GKLEE reports

Races, Deadlocks, Bank Conflicts, Non-Coalesced Accesses, Warp Divergences

GKLEE generates Tests to Run on GPU Hardware

14

architecture of gklee
Architecture of GKLEE

C++ GPU Program (with Sym. Inputs)

GKLEE

(Executor, scheduler, checker, test generator)

LLVM GCC Compiler

LLVMcuda

NVCC

CUDA Syntax Handler

GPU configuration

Test Cases

Statistics /Bugs

Replay on Real GPU

15

rest of the talk
Rest of the Talk

Simple CUDA example

Details of Symbolic Virtual GPU

Analysis Details:

Races, Deadlocks

Degree of

Warp divergences, Bank Conflicts, Non-Coalesced Accesses

Functional Correctness

Automatic Test Generation

Coverage-directed test-case reduction

16

slide17
CUDA
  • A simple dialect of C++ with CUDA directives
  • Thread blocks / teams -- SIMD “warps”
  • Synchronization through barriers / atomics

(GKLEE being extended to handle atomics)

17

example increment array elements
Example: Increment Array Elements

Increment N-element array A by scalar b

tid 0 1 …

A

A[0]+b

A[1]+b

...

t0

t1

__global__ void inc_gpu(int*A, int b, intN) {

int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx < N)

A[idx] = A[idx] + b;

}

18

illustration of race
Illustration of Race

Increment N-element vector A by scalar b

tid 0 1 63

A

t0:

read A[63]

...

t63:

write A[63]

__global__ void inc_gpu(int*A, int b, int N) {

int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx < N)

A[idx] = A[(idx – 1) % N] + b;

}

RACE!

19

illustration of deadlock
Illustration of Deadlock

Increment N-element vector A by scalar b

tid 0 1 …

A

...

__global__ void inc_gpu(int*A, int b, int N) {

int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx < N) {

A[idx] = A[idx] + b;

__syncthreads();

}

idx ≥ N

idx < N

DEADLOCK!

20

example of a race found by gklee
Example of a Race Found by GKLEE

__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) {

const int threadPos = ((threadIdx.x & (~63)) >> 0)

| ((threadIdx.x & 15) << 2)

| ((threadIdx.x & 48) >> 4);

...

__syncthreads();

for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN;

pos += IMUL(blockDim.x, gridDim.x)) {

unsigned data4 = d_Data[pos];

...

addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); }

__syncthreads(); ...

}

inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data)

{ s_Hist[ threadPos + IMUL(data, THREAD_N) ]++; }

“GKLEE: Is there a Race ?”

21

example of a race found by gklee1
Example of a Race Found by GKLEE

__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) {

const int threadPos = ((threadIdx.x & (~63)) >> 0)

| ((threadIdx.x & 15) << 2)

| ((threadIdx.x & 48) >> 4);

...

__syncthreads();

for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN;

pos += IMUL(blockDim.x, gridDim.x)) {

unsigned data4 = d_Data[pos];

...

addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); }

__syncthreads(); ...

}

inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data)

{ s_Hist[ threadPos + IMUL(data, THREAD_N) ]++; }

Threads 5 and and 13 have a WW race

when d_Data[5] = 0x04040404 and d_Data[13] = 0.

GKLEE

22

example of test coverage due to gklee
Example of Test Coverage due to GKLEE

__global__ void Bitonic_Sort(unsigned* values) {

unsigned int tid = tid.x;

shared[tid] = values[tid];

__syncthreads();

for (unsigned k = 2; k <= bdim.x; k *= 2)

for (unsigned j = k / 2; j > 0; j /= 2) {

unsigned ixj = tid ^ j;

if (ixj > tid) {

if ((tid & k) == 0)

if (shared[tid] > shared[ixj])

swap(shared[tid], shared[ixj]);

else

if (shared[tid] < shared[ixj])

swap(shared[tid], shared[ixj]);

}

__syncthreads();

}

values[tid] = shared[tid];

}

__shared__ unsigned shared[NUM];

inline void swap(unsigned& a, unsigned& b)

{ unsigned tmp = a; a = b; b = tmp; }

23

23

example of test coverage due to gklee1
Example of Test Coverage due to GKLEE

__global__ void Bitonic_Sort(unsigned* values) {

unsigned int tid = tid.x;

shared[tid] = values[tid];

__syncthreads();

for (unsigned k = 2; k <= bdim.x; k *= 2)

for (unsigned j = k / 2; j > 0; j /= 2) {

unsigned ixj = tid ^ j;

if (ixj > tid) {

if ((tid & k) == 0)

if (shared[tid] > shared[ixj])

swap(shared[tid], shared[ixj]);

else

if (shared[tid] < shared[ixj])

swap(shared[tid], shared[ixj]);

}

__syncthreads();

}

values[tid] = shared[tid];

}

__shared__ unsigned shared[NUM];

inline void swap(unsigned& a, unsigned& b)

{ unsigned tmp = a; a = b; b = tmp; }

“How do we test this?”

24

24

example of test coverage due to gklee2
Example of Test Coverage due to GKLEE

__global__ void Bitonic_Sort(unsigned* values) {

unsigned int tid = tid.x;

shared[tid] = values[tid];

__syncthreads();

for (unsigned k = 2; k <= bdim.x; k *= 2)

for (unsigned j = k / 2; j > 0; j /= 2) {

unsigned ixj = tid ^ j;

if (ixj > tid) {

if ((tid & k) == 0)

if (shared[tid] > shared[ixj])

swap(shared[tid], shared[ixj]);

else

if (shared[tid] < shared[ixj])

swap(shared[tid], shared[ixj]);

}

__syncthreads();

}

values[tid] = shared[tid];

}

__shared__ unsigned shared[NUM];

inline void swap(unsigned& a, unsigned& b)

{ unsigned tmp = a; a = b; b = tmp; }

Answer 1 :

“Random + “

25

25

example of test coverage due to gklee3
Example of Test Coverage due to GKLEE

__global__ void Bitonic_Sort(unsigned* values) {

unsigned int tid = tid.x;

shared[tid] = values[tid];

__syncthreads();

for (unsigned k = 2; k <= bdim.x; k *= 2)

for (unsigned j = k / 2; j > 0; j /= 2) {

unsigned ixj = tid ^ j;

if (ixj > tid) {

if ((tid & k) == 0)

if (shared[tid] > shared[ixj])

swap(shared[tid], shared[ixj]);

else

if (shared[tid] < shared[ixj])

swap(shared[tid], shared[ixj]);

}

__syncthreads();

}

values[tid] = shared[tid];

}

__shared__ unsigned shared[NUM];

inline void swap(unsigned& a, unsigned& b)

{ unsigned tmp = a; a = b; b = tmp; }

Here are 5 tests with

100% source code coverage

79% avg. thread + barrier interval

coverage

Answer 2 :

Ask GKLEE:

26

26

gklee symbolic virtual gpu
GKLEE: Symbolic Virtual GPU

Host

Device

Kernel 1

Kernel 2

Grid 1

Block

(0, 0)

Block

(0, 1)

Block

(1, 0)

Block

(1, 1)

Block

(2, 0)

Block

(2, 1)

Grid 2

Block (1, 1)

Thread

(0, 0)

Thread

(0, 2)

Thread

(0, 1)

Thread

(1, 0)

Thread

(1, 2)

Thread

(1, 1)

Thread

(2, 1)

Thread

(2, 2)

Thread

(2, 0)

Thread

(3, 2)

Thread

(3, 1)

Thread

(3, 0)

Thread

(4, 2)

Thread

(4, 1)

Thread

(4, 0)

  • GKLEE models a GPU using software
    • The virtual GPU represents the CUDA Programming Model (hence hide many hardware details)
    • Similar to the CUDA emulator in this aspect; but with many unique features
    • Can simulate CPU+GPU

GKLEE

virtual GPU

virtual CPU

27

concolic execution on the virtual gpu
Concolic Execution on the Virtual GPU
  • The values can be CONCrete or symbOLIC (CONCOLIC) in GKLEE
    • A value may be a complicated symbolic expression
    • Symbolic expressions are handled by constraint solvers
      • Determine satisfiability
      • Give concrete values as evidence
    • Constraint solving has become 1,000x faster over the last 10 years

28

comparing concrete and symbolic execution
Comparing Concrete and Symbolic Execution

All values are concrete

a

b

c

10

Program:

b = a * 2;

c = a + b;

if (c > 100)

assert(0);

10

20

30

10

20

unreachable

29

comparing concrete and symbolic execution1
Comparing Concrete and Symbolic Execution

The values can be concrete or symbolic

a

b

c

x(-,+ )

Program:

b = a * 2;

c = a + b;

if (c > 100)

assert(0);

else

x(-,+ )

2x

x(-,+ )

3x

2x

reachable, e.g. x = 40

reachable, e.g. x = 30

Now path condition is: 3x <= 100

30

gklee works on llvm bytecode
GKLEE Works on LLVM Bytecode
  • CUDA C++ programs are compiled to LLVM bytecode by LLVM-GCC with our CUDA syntax handler
  • Our online technical report contains detailed description
  • GKLEE extends KLEE to handle CUDA features

LLVMcuda Syntax and Semantics

31

thread scheduling in general an exp number of schedules
Thread Scheduling: In general, an Exp. Number of Schedules!

It is like shuffling decks of cards

> 13 trillion shuffles exist for 5 decks with 5 cards !!

> 13 trillion schedules exist for 5 threads with 5 instructions !!

More precisely, 25! / (5!)5

32

gklee avoids examining exp schedules
GKLEE Avoids Examining Exp. Schedules !!

Instead of considering all

Schedules and

All Potential Races…

33

gklee avoids examining exp schedules1
GKLEE Avoids Examining Exp. Schedules !!

Consider JUST THIS SINGLE

CANONICAL SCHEDULE !!

Folk Theorem (proved in our paper):

“We will find A RACE

If there is ANY race” !!

Instead of considering all

Schedules and

All Potential Races…

34

closer look canonical scheduling
Closer Look: canonical scheduling

Race-free operations can be exchanged

a valid schedule:

  • The scheduler:
  • Applies the canonical schedule;
  • Checks races upon the barriers;
  • If no race then continues; otherwise reports the race and terminate

t1:a3:

write x

t2:a4:

write y

t2:a6:

read y

t1:a5:

read x

t2:a2:

write y

t1:a1:

read x

another valid schedule (e.g. canonical schedule):

t1:a3:

write x

t2:a4:

write y

t1:a5:

read x

t2:a6:

read y

t1:a1:

read x

t2:a2:

write y

35

simd aware canonical scheduling in gklee
SIMD-aware Canonical Scheduling in GKLEE

SIMD/Barrier Aware Canonical scheduling within warp/block

t33

t34

t64

t1

t2

t32

Instr. 1

Instr. 1

Barrier

Interval (BI1)

Instr. 2

Instr. 2

Instr. 3

Instr. 3

Instr. 4

Instr. 4

Barrier

Interval (BI2)

Instr. 5

Instr. 5

Instr. 6

Instr. 6

Record accesses in canonical schedule

Check whether the accesses conflict (e.g. have the same address)

36

simd aware race checking in gklee
SIMD-aware Race Checking in GKLEE

Check races on the fly (in the canonical schedule)

t33

t34

t64

t1

t2

t32

Instr. 1

Instr. 1

Barrier

Interval (BI1)

Instr. 2

Instr. 2

Instr. 3

Instr. 3

Instr. 4

Instr. 4

Barrier

Interval (BI2)

Instr. 5

Instr. 5

Instr. 6

Instr. 6

intra-warp races

inter-warp and inter-block races

37

simd aware race checking in gklee1
SIMD-aware Race Checking in GKLEE

Check races on the fly (in the canonical schedule)

t33

t34

t64

t1

t2

t32

Instr. 1

Instr. 1

Barrier

Interval (BI1)

Instr. 2

Instr. 2

Instr. 3

Instr. 3

Instr. 4

Instr. 4

Barrier

Interval (BI2)

Instr. 5

Instr. 5

Instr. 6

Instr. 6

intra-warp races

inter-warp and inter-block races

38

sdk kernel example race checking
SDK Kernel Example: race checking

__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) {

const int threadPos = ((threadIdx.x & (~63)) >> 0)

| ((threadIdx.x & 15) << 2)

| ((threadIdx.x & 48) >> 4);

...

__syncthreads();

for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) {

unsigned data4 = d_Data[pos];

...

addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); }

__syncthreads(); ...

}

inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data)

{ s_Hist[threadPos + IMUL(data, THREAD_N)]++; }

t1

t2

threadPos = …

threadPos = …

data = (data4>26) & 0x3FU

data = (data4>26) & 0x3FU

s_Hist[threadPos +

Data*THREAD_N]++;

s_Hist[threadPos + data*THREAD_N]++;

sdk kernel example race checking1
SDK Kernel Example: race checking

t1

t2

RW set:

t1: writes

s_Hist((((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64), …

t2: writes

s_Hist((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64), …

threadPos = …

threadPos = …

data = (data4>26) & 0x3FU

data = (data4>26) & 0x3FU

s_Hist[threadPos + data*THREAD_N]++;

?

s_Hist[threadPos +

data*THREAD_N]++;

t1,t2,d_Data:

(t1  t2) 

(((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64) ==

((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64)

sdk kernel example race checking2
SDK Kernel Example: race checking

t1

t2

RW set:

t1: writes

s_Hist((((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64), …

t2: writes

s_Hist((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64), …

threadPos = …

threadPos = …

data = (data4>26) & 0x3FU

data = (data4>26) & 0x3FU

GKLEE indicates that these two addresses are equalwhen

t1 = 5, t2 = 13, d_data[5]= 0x04040404,

and d_data[13] = 0

indicating a Write-Write race

s_Hist[threadPos + data*THREAD_N]++;

s_Hist[threadPos +

data*THREAD_N]++;

experimental results part i check correctness and performance issues
Experimental Results, Part I (check correctness and performance issues)
  • The results of running GKLEE on CUDA SDK 2.0 kernels. GKLEE checks
  • well synchronized barriers; (2) races; (3) functional correctness; (4) bank conflicts; (5) memory coalescing; (6) warp divergence; (7) required volatile keyword.

42

automatic test generation
Automatic Test Generation

t1

t2

c1

¬c1

c3

¬c3

c3

c3

¬c3

¬c3

c2

¬c2

c4

¬c4

¬c4

¬c4

c4

c4

  • GKLEE guarantees to explore all paths w.r.t. given inputs
  • The path constraint at the end of each path is solved to generate concrete test cases
  • GKLEE supports many heuristic reduction techniques

t1+t2

c1

¬c1

c2

¬c2

c3

¬c3

¬c1  ¬c3

c4

¬c4

c1c2 c3 c4

solve this constraint to give a concrete test

43

sdk example comprehensive testing
SDK Example: comprehensive testing

__global__ void BitonicKernel(unsigned* values) {

unsigned int tid = tid.x;

shared[tid] = values[tid];

__syncthreads();

for (unsigned k = 2; k <= bdim.x; k *= 2)

for (unsigned j = k / 2; j > 0; j /= 2) {

unsigned ixj = tid ^ j;

if (ixj > tid) {

if ((tid & k) == 0)

if (shared[tid] > shared[ixj])

swap(shared[tid], shared[ixj]);

else

if (shared[tid] < shared[ixj])

swap(shared[tid], shared[ixj]);

}

__syncthreads();

}

values[tid] = shared[tid];

}

shared[0]≤shared[1]

shared[0]

> shared[1]

shared[1] <

shared[2]

shared[1] ≥shared[2]

shared[0]

≤ shared[2]

shared[0]

> shared[2]

Unsat:

shared[0] > shared[1] shared[1] ≥shared[2]  shared[0] ≤ shared[2]

44

44

sdk example comprehensive verification
SDK Example: comprehensive verification

Functional correctness: output values is sorted: values[0] ≤ values[1] ≤ … ≤ values[n]

values=…

values=…

values=…

values=…

values=…

values=…

45

45

experimental results part ii automatic test generation
Experimental Results, Part II… (Automatic Test Generation)

Coverage information about the generated tests for some CUDA kernels.

Covtand CovTBtmeasure bytecode coverage w.r.t threads. No test reductions used in generating this table. Exec. time on typical workstation.

46

experimental results part ii coverage directed test reduction
Experimental Results, Part II (Coverage Directed Test Reduction)

Results after applying reduction Heuristics

RedTB and RedBI cut the paths according to the coverage information of Thread+Barrier and Barrier respectively. Basically a path is pruned if it is unlikely to contribute new coverage.

47

additional gklee features
Additional GKLEE Features
  • GKLEE employs an efficient memory organization
  • Employs many expression evaluation optimizations
    • Simplify concolicexpressions on the fly
    • Dynamically cache results
    • Apply dependency analysis before constraint solving
    • Use manually optimized C/C++ Libraries
  • GKLEE also handles all of the C++ Syntax
  • GKLEE never generates false alarms

48

experimental results part iii performance comparison of two tools
Experimental Results, Part III(performance comparison of two tools)

Execution times (in seconds) of GKLEE and PUG [SIGSOFT FSE 2010] for functional correctness check.

#T is the number of threads. Time is reported in the format of GPU time(entire time); T.O means > 5 minutes.

49

other details
Other Details

50

  • Diverged warp scheduling, intra-warp, inter-warp/-block race checking, textual aligned barrier checking
  • Checking performance issues
    • warp divergence, bank conflicts, global memory coalescing
  • Path/Test reduction techniques
  • Volatile declaration checking
  • Handling symbolic aliasing and pointers
  • Drivers for the kernels and replaying on the real GPU
  • Other results, e.g. on CUDA SDK 4.0 programs
  • CUDA’s relaxed memory model and semantics
summary
Summary
  • GKLEE: symbolic virtual GPU
    • Identify correctness and performance issues
    • Produce concrete tests with high code coverage
    • Enable symbolic parallel debugging for CUDA programs
    • Good for other CUDA applications (e.g. compiler optimization verification, regression testing, etc.)
  • The tool is open source and available at:
    • www.cs.utah.edu/fv/GKLEE
    • with tutorial, manual, tech. report, liveDVD,, etc.
  • Future Work
    • Parameterized verification (e.g. equivalence checking)
    • Support for floating point numbers
    • Combination with runtime execution (on the real GPU)
thank you
Thank You!

Questions?

Obtain GKLEE from

www . cs . utah . edu / fv / GKLEE

ad