chess systematic testing of concurrent programs n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
CHESS : Systematic Testing of Concurrent Programs PowerPoint Presentation
Download Presentation
CHESS : Systematic Testing of Concurrent Programs

Loading in 2 Seconds...

play fullscreen
1 / 36

CHESS : Systematic Testing of Concurrent Programs - PowerPoint PPT Presentation


  • 133 Views
  • Uploaded on

CHESS : Systematic Testing of Concurrent Programs. Madan Musuvathi Shaz Qadeer Microsoft Research. Testing multithreaded programs is HARD. Specific thread interleavings expose subtle errors Testing often misses these errors Even when found, errors are hard to debug No repeatable trace

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'CHESS : Systematic Testing of Concurrent Programs' - merry


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
chess systematic testing of concurrent programs

CHESS : Systematic Testing of Concurrent Programs

Madan Musuvathi

Shaz Qadeer

Microsoft Research

testing multithreaded programs is hard
Testing multithreaded programs is HARD
  • Specific thread interleavings expose subtle errors
    • Testing often misses these errors
  • Even when found, errors are hard to debug
    • No repeatable trace
    • Source of the bug is far away from where it manifests
concurrency is a real problem
Concurrency is a real problem
  • Windows 2000 hot fixes
    • Concurrency errors most common defects among “detectable errors”
    • Incorrect synchronization and protocol errors most common defects among all coding errors
  • Windows Server 2003 late cycle defects
    • Synchronization errors second in the list, next to buffer overruns
  • Race conditions can result in security exploits
current practice
Current practice
  • Concurrency testing == Stress testing
  • Example: testing a concurrent queue
    • Create 100 threads performing queue operations
    • Run for days/weeks
    • Pepper the code with sleep ( random() )
  • Stress increases the likelihood of rare interleavings
    • Makes any error found hard to debug
chess unit testing for concurrency
CHESS: Unit testing for concurrency
  • Example: testing a concurrent queue
    • Create 1 reader thread and 1 writer thread
    • Exhaustively try all thread interleavings
  • Run the test repeatedly on a specialized scheduler
  • Explore a different thread interleaving each time
    • Use model checking techniques to avoid redundancy
  • Check for assertions and deadlocks in every run
    • The error-trace is repeatable
systematic stress testing using chess
Systematic Stress Testing Using CHESS

Program

Tester Provides a Test Scenario

While(not done) {

TestScenario()

}

CHESS

TestScenario() {

}

CHESS runs the scenario in a loop

  • Every run takes a different interleaving
  • Every run is repeatable

Win32 API

Kernel:

Threads, Scheduler,

Synchronization Objects

conditions on test scenario
Conditions on Test Scenario
  • Test scenario should terminate in all interleavings
  • Test scenario should be idempotent
      • Free all resources (handles, memory, …)
      • Clear the hardware state
  • Key observation:
    • Existing stress tests already have these properties
    • Because they repeatedly run for ever
perturb the system as little as possible
Perturb the System as Little as Possible
  • Run the system as is
  • On the actual OS, hardware
  • Using system threads, synchronization

Program

While(not done){

TestScenario()

}

CHESS

TestScenario(){

}

  • Detour Win32 API calls
  • To control and introduce nondeterminism

Win32 API

  • Advantages
  • Avoid reporting false errors
  • Easy to add to existing test frameworks
  • Use existing debuggers

Kernel:

Threads, Scheduler,

Synchronization Objects

implementation details
Implementation details
  • Handle all the Win32 synchronization mechanisms
      • Critical sections, locks, semaphores, events,…
      • Threadpools
      • Asynchronous procedure calls
      • Timers
      • IO Completions
  • No modification to the kernel scheduler / Win32 library
  • CHESS drives the system along a desired by interleaving by ‘hijacking’ the scheduler
controlling the scheduling nondeterminism
Controlling the Scheduling Nondeterminism
  • Nondeterministic choices for the scheduler
    • Determine when to context switch
    • On context switch, pick the next runnable thread to run
    • On resource release, wake up one of the waiting threads
  • Hijack these choices from the scheduler
    • Ensure at most one thread is runnable
    • No thread is waiting on a resource
    • At chosen schedule points, block the current thread while waking the next thread
  • Emulate program execution on a uniprocessor with context switches only at synchronization points
partial order reduction
Partial-order reduction
  • Many thread interleavings are equivalent
    • Accesses to separate memory locations by different threads can be reordered
  • Avoid exploring equivalent thread interleavings

T1: x := 1

T2: y := 2

T2: y := 2

T1: x := 1

partial order reduction in chess
Partial-order reduction in CHESS
  • Algorithm:
    • Assume the program is data-race free
    • Context switch only at synchronization points
    • Check for data-races in each execution
  • Theorem:
    • If the algorithm terminates without reporting races,
      • then the program has no assertion failures
executions on multi cores
Executions on Multi-cores
  • CHESS checks for data-races
  • If a Test Scenario manifests a bug on a multi-core machine, then CHESS will
    • Either report a data-race
    • Or the bug
  • CHESS systematically enumerates all sequentially consistent executions
    • Any data-race free multi-core execution is equivalent to a sequentially consistent execution
state space explosion
State space explosion

Thread 1

Thread 2

x = 1;

y = 1;

x = 2;

y = 2;

0,0

2,0

1,0

x = 1;

1,0

2,2

1,1

2,0

y = 1;

x = 2;

1,2

1,2

2,1

2,1

1,1

2,2

y = 2;

1,2

1,1

1,1

2,1

2,2

2,2

state space explosion1
State space explosion
  • Number of executions

= O( nnk )

  • Exponential in both n and k
    • Typically: n < 10 k > 100
  • Limits scalability to large programs (large k)

Thread 1

Thread n

x = 1;

y = 1;

x = 2;

y = 2;

k steps

each

n threads

bounding execution depth
Bounding execution depth
  • Works very well for message-passing programs
    • Limit the number of message exchanges
  • Message processing code executed atomically
    • Can go ‘deep’ in the state space
  • Does not work for multithreaded programs
    • Even toy programs can have large number of steps (shared-variable accesses)
iterative context bounding
Iterative context bounding
  • Prioritize executions with small number of preemptions
  • Two kinds of context switches:
    • Preemptions – forced by the scheduler
      • e.g. Time-slice expiration
    • Non-preemptions – a thread voluntarily yields
      • e.g. Blocking on an unavailable lock, thread end

Thread 1

Thread 2

x = 1;

if (p != 0) {

x = p->f;

}

x = 1;

if (p != 0) {

p = 0;

preemption

x = p->f;

}

non-preemption

iterative context bounding algorithm
Iterative context-bounding algorithm
  • The scheduler has a budget of c preemptions
    • Nondeterministically choose the preemption points
  • Resort to non-preemptive scheduling after c preemptions
  • Once all executions explored with c preemptions
    • Try with c+1 preemptions
  • Iterative context-bounding has desirable properties
    • Property 0: Easy to implement
property 1 polynomial state space
Property 1: Polynomial state space
  • Terminating program with fixed inputs and deterministic threads
    • n threads, k steps each, c preemptions
  • Number of executions <= nkCc . (n+c)!

= O( (n2k)c. n! )

Exponential in n and c, but not in k

Thread 1

Thread 2

  • Choose c preemption points

x = 1;

y = 1;

x = 1;

x = 2;

y = 2;

x = 2;

  • Permute n+c atomic blocks

y = 1;

y = 2;

property 2 deep exploration possible with small bounds
Property 2: Deep exploration possible with small bounds
  • A context-bounded execution has unbounded depth
    • a thread may execute unbounded number of steps within each context
  • Event a context-bound of zero yields complete terminating executions
property 3 finds the simplest error trace
Property 3: Finds the ‘simplest’ error trace
  • Finds smallest number of preemptions to the error
  • Number of preemptions better metric of error complexity than execution length
property 4 coverage metric
Property 4: Coverage metric
  • If search terminates with context-bound of c, then any remaining error must require at least c+1 preemptions
  • Intuitive estimate for
    • The complexity of the bugs remaining in the program
    • The chance of their occurrence in practice
property 5 lots of bugs with small number of preemptions
Property 5: Lots of bugs with small number of preemptions
  • A non-blocking implementation of the work-stealing queue algorithm
    • bounded circular buffer accessed concurrently by readers and stealers
  • Developer provided
    • test harness
    • three buggy variations of the program
  • Each bug found with at most 2 preemptions
    • executions with 35 preemptions are possible!
context bounding partial order reduction
Context-bounding + Partial-order reduction
  • Algorithm:
    • Assume the program is data-race free
    • Context switch only at synchronization points
    • Explore executions with c preemptions
    • Check for data-races in each execution
  • Theorem:
    • If the algorithm terminates without reporting races,
      • Then the program has no assertion failures reachable with c preemptions
    • Requires that a thread can block only at synchronization points
    • Proof (Musuvathi-Q, PLDI 2007)
slide26

// Function called by the main thread

void TestChannel(WorkQueue* workQueue, ...)

{

// Creating a channel

// allocates worker threads

RChannelReader* channel =

new RChannelReaderImpl(..., workQueue);

// ... do work here

channel->Close();

// wrong assumption that channel->Close()

// waits for worker threads to be finished

delete channel;

// BUG: deleting the channel when

// worker threads still have a valid

// reference to the channel

}

// Function called by a worker thread

// of RChannelReaderImpl

void RChannelReaderImpl::

AlertApplication(RChannelItem* item)

{

// Notify Application

// XXX: Preempt here for the bug

EnterCriticalSection(&m_baseCS);

// process before exit

LeaveCriticalSection(&m_baseCS);

}

slide27

// Function called by the main thread

void TestChannel(WorkQueue* workQueue, ...)

{

// Creating a channel

// allocates worker threads

RChannelReader* channel =

new RChannelReaderImpl(..., workQueue);

// ... do work here

channel->Close();

// wrong assumption that channel->Close()

// waits for worker threads to be finished

delete channel;

// BUG: deleting the channel when

// worker threads still have a valid

// reference to the channel

}

// Function called by a worker thread

// of RChannelReaderImpl

void RChannelReaderImpl::

AlertApplication(RChannelItem* item)

{

// Notify Application

// XXX: Preempt here for the bug

EnterCriticalSection(&m_baseCS);

// process before exit

LeaveCriticalSection(&m_baseCS);

}

slide28

// Function called by the main thread

void TestChannel(WorkQueue* workQueue, ...)

{

// Creating a channel

// allocates worker threads

RChannelReader* channel =

new RChannelReaderImpl(..., workQueue);

// ... do work here

channel->Close();

// wrong assumption that channel->Close()

// waits for worker threads to be finished

delete channel;

// BUG: deleting the channel when

// worker threads still have a valid

// reference to the channel

}

// Function called by a worker thread

// of RChannelReaderImpl

void RChannelReaderImpl::

AlertApplication(RChannelItem* item)

{

// Notify Application

// XXX: Preempt here for the bug

EnterCriticalSection(&m_baseCS);

// process before exit

LeaveCriticalSection(&m_baseCS);

}

slide29

// Function called by the main thread

void TestChannel(WorkQueue* workQueue, ...)

{

// Creating a channel

// allocates worker threads

RChannelReader* channel =

new RChannelReaderImpl(..., workQueue);

// ... do work here

channel->Close();

// wrong assumption that channel->Close()

// waits for worker threads to be finished

delete channel;

// BUG: deleting the channel when

// worker threads still have a valid

// reference to the channel

}

// Function called by a worker thread

// of RChannelReaderImpl

void RChannelReaderImpl::

AlertApplication(RChannelItem* item)

{

// Notify Application

// XXX: Preempt here for the bug

EnterCriticalSection(&m_baseCS);

// process before exit

LeaveCriticalSection(&m_baseCS);

}

slide30

// Function called by the main thread

void TestChannel(WorkQueue* workQueue, ...)

{

// Creating a channel

// allocates worker threads

RChannelReader* channel =

new RChannelReaderImpl(..., workQueue);

// ... do work here

channel->Close();

// wrong assumption that channel->Close()

// waits for worker threads to be finished

delete channel;

// BUG: deleting the channel when

// worker threads still have a valid

// reference to the channel

}

// Function called by a worker thread

// of RChannelReaderImpl

void RChannelReaderImpl::

AlertApplication(RChannelItem* item)

{

// Notify Application

// XXX: Preempt here for the bug

EnterCriticalSection(&m_baseCS);

// process before exit

LeaveCriticalSection(&m_baseCS);

}

facts about dryad error trace
Facts about Dryad error trace
  • Long error trace but requires only one preemption
    • Depth-bounding cannot find it without a lot of luck
  • The error trace has 6 non-preempting context switches
    • It is important to leave unbounded the number of non-preempting context switches
  • This (and the other 6 errors) in Dryad remained in spite of careful regression testing and months of production use
current chess applications work in progress
Current CHESS applications (work in progress)
  • Dryad (library for distributed dataflow programming)
  • Singularity/Midori (OS in managed code)
  • User-mode drivers
  • Cosmos (distributed file system)
  • SQL database
conclusion
Conclusion
  • Concurrency is important
    • Building robust concurrent software is still a challenge
  • Lack of debugging and testing tools
  • CHESS: Concurrency unit-testing
    • Exhaustively try all interleavings
    • Attempt to seamlessly integrate with existing test frameworks
    • Provide replay capability
  • Iterative context-bounding algorithm key to the design