slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Dongyoon Lee , Peter Chen, Jason Flinn , Satish Narayanasamy PowerPoint Presentation
Download Presentation
Dongyoon Lee , Peter Chen, Jason Flinn , Satish Narayanasamy

Loading in 2 Seconds...

play fullscreen
1 / 36

Dongyoon Lee , Peter Chen, Jason Flinn , Satish Narayanasamy - PowerPoint PPT Presentation


  • 140 Views
  • Uploaded on

Chimera: Hybrid Program Analysis for Determinism. Dongyoon Lee , Peter Chen, Jason Flinn , Satish Narayanasamy University of Michigan, Ann Arbor. * Chimera image from http ://superpunch.blogspot.com/2009/02/chimera-sketch.html. Deterministic Replay.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Dongyoon Lee , Peter Chen, Jason Flinn , Satish Narayanasamy' - armina


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Chimera:

Hybrid Program Analysis

for Determinism

DongyoonLee, Peter Chen,

Jason Flinn, SatishNarayanasamy

University of Michigan, Ann Arbor

* Chimera image from http://superpunch.blogspot.com/2009/02/chimera-sketch.html

deterministic replay
Deterministic Replay

Goal: record and reproduce multithreaded execution

  • Debugging concurrency bugs
  • Offline heavyweight dynamic analysis
  • Forensics and intrusion detection
  • … and many more uses

Problem

  • Multithreaded record-and-replay is too slow (>2x)

or requires custom hardware

multithreaded record and replay is slow
Multithreaded Record-and-Replay is Slow

Thread 1

Thread 2

Thread 3

Checkpoint Memory and Register State

Log non-deterministic program input

- Interrupts, I/O values, DMA, etc.

Write

Write

Read

Log shared memory dependencies

replay for data race free programs is cheap
Replay for Data-Race-Free Programs is Cheap

Data-race-free programs

  • Shared memory accesses are well ordered by synchronization ops.
  • Recording happens-before order of sync. ops. is sufficient

Problem: Programs with data races

T1

T2

T3

X=0

order of mem. ops.

Y=0

order of sync. ops.

Unlock(l)

Lock(l)

X=1

Y=1

Unlock(l)

Z=1

Signal(c)

Wait(c)

X=2

Y=2

Z=2

our contribution a hybrid analysis
Our Contribution: A Hybrid Analysis

Sound static data race analysis

  • Add synchronizations for potential data races
  • Problem: Too many false positives

Profilingnon-concurrent code regions

Symbolic bounds analysis

Chimera

Data-race-free

program P’

Potentially racy

program P

roadmap
Roadmap
  • Motivation
  • Chimera Analysis
    • Static data race analysis
    • Profiling non-concurrent code regions
    • Symbolic bounds analysis
  • Weak-lock Design
  • Evaluation
  • Conclusion
roadmap1
Roadmap
  • Motivation
  • Chimera Analysis
    • Static data race analysis
    • Profiling non-concurrent code regions
    • Symbolic bounds analysis
  • Weak-lockDesign
  • Evaluation
  • Conclusion
static data race analysis
Static Data Race Analysis
    • Find potential data-races using a sound static data race detector RELAY [Voung et al., FSE’07]
    • Protect all potential data-races using weak-locks
      • A new time-out lock which may be preempted (discussed later)
  • Record and replay the happens-before order of weak-locks
protect p otential r aces using weak locks
Protect Potential Races using Weak-locks

Static analysis helps avoid instrumentation for access to Z

void foo() {

X = 0;

for(i = ... ){

Y[ tid ][ i ] = 0;

}

}

void bar() {

X = 1;

for(i = … ){

Y[ tid ][ i ] = 1;

Z = 1;

}

}

Potential

racy-pair

Potential

racy-pair

No race report

sources of false positives in relay
Sources of False Positives in RELAY
  • Sound data-race detector reports too many false data-races
    • 53x overhead
  • Source 1: Non-mutexsynchronizations are ignored
    • Lockset based analysis ignores fork-join, barrier, signal-wait, etc.
    • May report a false data-race between memory instructions that can never execute concurrently
  • Source 2: Conservative pointeranalysis
    • Overestimate variables accessed by a memory instruction
    • May report a false data-race between memory instructions that can never access the same location

Solution:

Profiling non-concurrent code regions

Solution:

Symbolic bounds Analysis

roadmap2
Roadmap
  • Motivation
  • Chimera Analysis
    • Static data race analysis
    • Profiling non-concurrent code regions
    • Symbolic bounds analysis
  • Weak-lock Design
  • Evaluation
  • Conclusion
profiling non concurrent code regions
Profiling Non-concurrent Code Regions

Problem

  • Lockset based analysisignores non-mutex synchronization ops.

Solution

  • Profile non-concurrent code regions (e.g., functions)
  • Increase the granularity of weak-locks to protect a larger code region instead of each potential racy instruction
  • Parallelism is preserved unless mis-profiled

T1

T2

foo()

False Race

BARRIER

BARRIER

bar()

function level weak locks
Function-Level Weak-Locks

if profiler says foo() and bar() are not likely to run concurrently

void foo() {

X = 0;

for(i = … ){

Y[ tid][ i ] = 0;

}

}

void bar() {

X = 1;

for(i = … ){

Y[ tid][ i ] = 1;

Z = 1;

}

}

foo()

False Race

BARRIER

BARRIER

bar()

roadmap3
Roadmap
  • Motivation
  • Chimera Analysis
    • Static data race analysis
    • Profiling non-concurrent code regions
    • Symbolic bounds analysis
  • Design
  • Evaluation
  • Conclusion
imprecision in conservative pointer analysis
Imprecision in Conservative Pointer Analysis

May run

Concurrently

T1

T2

bar()

foo()

BARRIER

BARRIER

imprecision in conservative pointer analysis1
Imprecision in Conservative Pointer Analysis
  • RELAY uses Steensgaard’s and Anderson’s pointer analysis
    • Flow-Insensitive and Context-Insensitive (FICI) analysis
    • Naming heap objects is conservative
  • Overestimate the variables accessed by a memory instruction

void foo() {

for(i = 0 to N){

Y[ tid ][ i ] = 0;

}

}

void bar() {

for(i= 0 to N){

Y[ tid ][ i ] = 1;

}

}

Potential

racy-pair

False Race

Thread 2

Thread1

Y[][]

symbolic bounds analysis
Symbolic Bounds Analysis

Our Solution

  • Derive the symbolic lower and upper bounds that a racy code region may access (e.g., loops)

[Rugina and Rinard, PLDI’00]

  • Increase the granularity of weak-locks to protect a larger code region for a set of addresses specified by a symbolic expression
  • Parallelism is preserved if the bounds are precise enough

void foo() {

for(i = 0 to N){

Y[ tid ][ i ] = 0;

}

}

Symbolic

Bounds

Analysis

Bounds: &Y[tid][0] to &Y[tid][N]

loop level weak locks
Loop-level Weak-locks

Symbolic bounds: &Y[tid][0] ~ &Y[tid][N]

void foo() {

X = 0;

for(i = 0 to N){

Y[ tid][ i ] = 0;

}

}

void bar() {

X = 1;

for(i = 0 to N){

Y[ tid ][ i ] = 1;

Z = 1;

}

}

(&Y[tid][0],&Y[tid][N])

(&Y[tid][0],&Y[tid][N])

(&Y[tid][0],&Y[tid][N])

(&Y[tid][0],&Y[tid][N])

i mprecise symbolic bounds
Imprecise Symbolic Bounds

Sources

  • Depend on the value computed inside the code region
  • Depend on arithmetic operations not supported in the analysis
    • e.g.,modulo operations, logical AND/OR, etc.

Choosing the optimal granularity

  • If bounds are too imprecise and the loop body is long enough, resort to instruction (basic-block) level weak-locks for parallelism

void qux() {

for(i = 0 to N){

prev= Z[ prev];

}

}

Symbolic

Bounds

Analysis

Bounds: -INF to +INF

roadmap4
Roadmap
  • Motivation
  • Chimera Analysis
    • Weak-lock Design
  • Evaluation
  • Conclusion
deadlock due to weak locks
Deadlock due to Weak-locks

No deadlocks between weak-locks

  • function-level > loop-level > instruction-level

Deadlock between weak-locks and original sync. ops. is possible

T1

T2

Time-out !!

wait (cv)

signal(cv)

weak lock time out
Weak-lock Time-out

A weak-lock might time-out

  • Invoke a special system call to handle it

Current owner

Current owner

Time-out !!

T2

T1

Logged order

of weak-locks

signal(cv)

wait (cv)

Weak-lock guarantee

  • Only one thread holds a given weak-lock at any given time
  • Mutual exclusion may be compromised; but sufficient for replay
roadmap5
Roadmap
  • Motivation
  • Chimera Analysis
  • Weak-lock Design
  • Evaluation
  • Conclusion
implementation
Implementation

Source-to-source Instrumentation

  • Implemented in OCaml using CIL as a front end

Static analysis

  • Data race detection: RELAY [Voung et al., FSE’07]
    • Include all library source codes for soundness (uClibc’slibc, libm, etc.)
  • Symbolic bounds analysis: [Rugina and Rinard, PLDI’00]
    • Intra-procedural analysis for racy loops only

Runtime system

  • Modified Linux kernel to record/replay program input
  • Modified pthread library to record/replay happens-before order of original synchronization operations and weak-locks
evaluation setup
Evaluation Setup

Test Environment

  • 2.66 GHz 8-core Xeon processor with 4 GB of RAM
  • Different set of inputs for profiling and performance evaluation
  • Average of five trials with 4 worker threads
  • 2, 4, 8 threads for scalability results

Benchmarks

  • Desktop applications
    • aget, pfscan, and pbzip2
  • Server programs
    • knot and apache
  • SPLASH-2 suite
    • ocean, water-nsq, fft, and radix
record and replay performance
Record and Replay Performance

86% slowdown

39%

2.4% slowdown

  • Recording : 39% on average
  • Replay : similar to recording; much lower for I/O intensive prgs.
effectiveness of coarse grained weak locks1
Effectiveness of Coarse-grained Weak-locks
  • Coarse-grained weak-locks reduce the cost of instrumentation
effectiveness of coarse grained weak locks2
Effectiveness of Coarse-grained Weak-locks
  • Coarse-grained weak-locks reduce the cost of instrumentation
  • Exception: control-flow dependency (e.g., pfscan)
effectiveness of coarse grained weak locks3
Effectiveness of Coarse-grained Weak-locks
  • Coarse-grained weak-locks reduce the cost of instrumentation
  • Exception: control-flow dependency (e.g., pfscan)
effectiveness of coarse grained weak locks4
Effectiveness of Coarse-grained Weak-locks

1.39x

  • Coarse-grained weak-locks reduce the cost of instrumentation
  • Exception: control-flow dependency (e.g., pfscan)
breakdown of recording overhead
Breakdown of Recording Overhead

funclocks

loop locks

instr/bb locks

sync op & system log

  • Weak-lock overhead = contention (waiting) cost + logging cost
breakdown of recording overhead1
Breakdown of Recording Overhead

func wait

func log

loop wait

loop log

instr/bb wait

instr/bb log

sync op & system log

  • Weak-lock overhead = contention (waiting) cost + logging cost
  • High loop-lock contention
  • High instr/bb-lock contention
scalability
Scalability
  • Scientific applications scale worse due to imprecise symbolic bounds analysis
conclusion
Conclusion

Goal: Software-only deterministic multiprocessor replay systems

Chimera Analysis

  • Static data race analysis
    • Find and protect potential data races with weak-locks
    • Instruction/basic-block-level weak-locks
  • Profiling non-concurrent code regions
    • Address the inadequacy of lockset-based algorithm
    • Function-level weak-locks
  • Symbolic bounds analysis
    • Address the imprecision of conservative pointer analysis
    • Loop-level weak-locks

Low Recording Overhead

  • 39% recording overhead for 4 worker threads