What is a data race
This presentation is the property of its rightful owner.
Sponsored Links
1 / 25

What is a Data Race? PowerPoint PPT Presentation


  • 45 Views
  • Uploaded on
  • Presentation posted in: General

What is a Data Race?. Two concurrent accesses to a shared location, at least one of them for writing. Indicative of a bug. Thread 1 Thread 2 X++T=Y Z=2T=X. How Can Data Races be Prevented?. Explicit synchronization between threads: Locks Critical Sections Barriers Mutexes

Download Presentation

What is a Data Race?

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


What is a data race

What is a Data Race?

  • Two concurrent accesses to a shared location, at least one of them for writing.

    • Indicative of a bug

Thread 1Thread 2

X++T=Y

Z=2T=X


How can data races be prevented

How Can Data Races be Prevented?

  • Explicit synchronization between threads:

    • Locks

    • Critical Sections

    • Barriers

    • Mutexes

    • Semaphores

    • Monitors

    • Events

    • Etc.

Lock(m)

Unlock(m)Lock(m)

Unlock(m)

Thread 1Thread 2

X++

T=X


Is this sufficient

Is This Sufficient?

  • Yes!

  • No!

    • Programmer dependent

      • Correctness – programmer may forget to synch

        • Need tools to detect data races

    • Expensive

      • Efficiency – to achieve correctness, programmer may overdo.

        • Need tools to remove excessive synch’s


Where is waldo

Where is Waldo?

#define N 100

Type g_stack = new Type[N];

int g_counter = 0;

Lock g_lock;

void push( Type& obj ){lock(g_lock);...unlock(g_lock);}

void pop( Type& obj ) {lock(g_lock);...unlock(g_lock);}

void popAll( ) {

lock(g_lock);

delete[] g_stack;

g_stack = new Type[N];

g_counter = 0;

unlock(g_lock);

}

int find( Type& obj, int number ) {

lock(g_lock);

for (int i = 0; i < number; i++)

if (obj == g_stack[i]) break; // Found!!!

if (i == number) i = -1; // Not found… Return -1 to caller

unlock(g_lock);

return i;

}

int find( Type& obj ) {

return find( obj, g_counter );

}


Can you find the race

Can You Find the Race?

Similar problem was found

in java.util.Vector

#define N 100

Type g_stack = new Type[N];

int g_counter = 0;

Lock g_lock;

void push( Type& obj ){lock(g_lock);...unlock(g_lock);}

void pop( Type& obj ) {lock(g_lock);...unlock(g_lock);}

void popAll( ) {

lock(g_lock);

delete[] g_stack;

g_stack = new Type[N];

g_counter = 0;

unlock(g_lock);

}

int find( Type& obj, int number ) {

lock(g_lock);

for (int i = 0; i < number; i++)

if (obj == g_stack[i]) break; // Found!!!

if (i == number) i = -1; // Not found… Return -1 to caller

unlock(g_lock);

return i;

}

int find( Type& obj ) {

return find( obj, g_counter );

}

write

read


Detecting data races

Detecting Data Races?

  • NP-hard [Netzer&Miller 1990]

    • Input size = # instructions performed

    • Even for 3 threads only

    • Even with no loops/recursion

  • Execution orders/scheduling (#threads)thread_length

  • # inputs

  • Detection-code’s side-effects

  • Weak memory, instruction reorder, atomicity


Motivation

Motivation

Run-time framework goals

  • Collect a complete trace of a program’s user-mode execution

  • Keep the tracing overhead for both space and time low

  • Re-simulate the traced execution deterministically based on the collected trace with full fidelity down to the instruction level

    • Full fidelity: user mode only, no tracing of kernel, only user-mode I/O callbacks

      Advantages

  • Complete program trace that can be analyzed from multiple perspectives (replay analyzers: debuggers, locality, etc)

  • Trace can be collected on one machine and re-played on other machines (or perform live analysis by streaming)

    Challenges: Trace Size and Performance


Original record replay approaches

Original Record-Replay Approaches

  • InstantReplay ’87

    • Record order or memory accesses

    • overhead may affect program behavior

  • RecPlay ’00

    • Record only synchronizations

    • Not deterministic if have data races

  • Netzer ’93

    • Record optimal trace

    • too expensive to keep track of all memory locations

  • Bacon & Goldstein ’91

    • Record memory bus transactions with hardware

    • high logging bandwidth


Motivation1

Motivation

Increasing use and development for multi-core processors

  • MT program behavior is non-deterministic

  • To effectively debug software, developers must be able to replay executions that exhibit concurrency bugs

    • Shared memory updates happen in different order


Related concepts

Related Concepts

  • Runtime interpretation/translation of binary instructions

    • Requires no static instrumentation, or special symbol information

    • Handle dynamically generated code, self modifying code

    • Recording/Logging: ~100-200x

  • More recent logging

    • Proposed hardware support (for MT domain)

    • FDR (Flight Data Recorder)

    • BugNet (cache bits set on first load)

    • RTR (Regulated Transitive Reduction)

    • DeLorean (ISCA 2008- chunks of instructions)

    • Strata (time layer across all the logs for the running threads)

    • iDNA (Diagnostic infrastructure using NirvanA- Microsoft)


Deterministic replay

Deterministic Replay

Re-execute the exact same sequence of instructions as recorded in a previous run

  • Single threaded programs

    • Record Load Values needed for reproducing behavior of a run (Load Log)

    • Registers updated by system calls and signal handlers (Reg Log)

    • Output of special instructions: RDTSC, CPUID (Reg Log)

    • System call (virtualization- cloning arguments, updates)

    • Checkpointing (log summary ~10Million)

  • Multi-threaded programs

    • Log interleaving among threads (shared memory updates ordering – SMO Log)


Pinsel system effect log sel

PinSEL – System Effect Log (SEL)

Logging program load values needed for deterministic replay:

  • First access from a memory location

  • Values modified by the system (system effect) and read by program

  • Machine and time sensitive instructions (cpuid,rdtsc)

Store A; (A  111)

Program execution

Store B; (B  55)

Load C; (C = 9)

Load D; (D = 10)

Syscall modifies location (B -> 0) and (C -> 99)

Load A; (A = 111)

system call

Load B; (B = 0)

Logged

Load C; (C = 99)

Not Logged

Load D; (D = 10)

  • Trace size is ~4-5 bytes per instruction


Optimization trace select reads

i = 1;

for (j = 0; j < 10; j++)

{

i = i + j;

}

k = i; // value read is 46

System_call();

k = i; // value read is 0 (not predicted)

Optimization: Trace select reads

  • Observation: Hardware caches eliminate most off-chip reads

  • Optimize logging:

    • Logger and replayer simulate identical cache memories

    • Simple cache (the memory copy structure) to decide which values to log. No tags or valid bits to check. If the values mismatch they are logged.

  • Average trace size is <1 bit per instruction

  • The only read not predicted and logged follows the system call


Example overhead

Example Overhead

  • PinSEL and PinPLAY

    • Initial work (2006) with single threaded programs:

      • SPEC2000 ref runs: 130x slowdown for pinSEL and ~80x for PinPLAY (w/o in-lining)

    • Working with a subset of SPLASH2 benchmarks: 230x slowdown for PinSEL

  • Now: Geo-mean SPEC2006

    • Pin 1.4x

    • Logger 83.6x

    • Replayer 1.4x


Example microsoft idna trace writer performance

Example: Microsoft iDNA Trace Writer Performance

  • Memchecker and valgrind are in 30-40x range on CPU 2006

  • iDNA ~11x, (does not log shared-memory dependences explicitly)

    • Use a sequential number for every lock prefixed memory operation: offline data race analysis


Logging shared memory ordering cristiano s pinsel play overview

Logging Shared Memory Ordering(Cristiano’s PinSEL/PLAY Overview)

  • Emulation of Directory Based Cache Coherence

    • Identifies RAW, WAR, WAW dependences

    • Indexed by hashing effective address

    • Each entry represents an address range

Directory

Dir Entry

Dir Entry

Store A

Program execution

Dir Entry

hash

Load B

Dir Entry


Directory entries

Directory Entries

  • Every DirEntry maintains:

    • Thread id of the last_writer

    • A timestamp is the # of memory ref. the thread has executed

    • Vector of timestamps of last access for each thread to that entry

    • On Loads: update the timestamp for the thread in the entry

    • On Stores: update the timestamp and the last_writer fields

Directory

Thread T1

Thread T2

DirEntry: [A:D]

T2

Last writer id:

T1

1: Store A

1: Load F

1

T1:

2

T2:

2

Vector

2: Store A

Program execution

DirEntry: [E:H]

2: Load A

3: Load F

Last writer id:

T1

3: Store F

3

1

3

T1:

T2:


Detecting dependences

Detecting Dependences

  • RAW dependency between threads T and T’ is established if:

    • T executes a load that maps to the directory entry A

    • T’ is the last_writer for the same entry

  • WAW dependency between T and T’ is established if:

    • T executes a store that maps to the directory entry A

    • T’ is the last_writer for the same entry

  • WAR dependency between T and T’ is established if:

    • T executes a store that maps to the directory entry A

    • T’ has accessed the same entry in the past and T is not the last_writer


Example

Example

Thread T1

Thread T2

DirEntry: [A:D]

T2

Last writer id:

T1

1: Store A

1: Load F

WAW

1

T1:

2

T2:

2

2: Store A

Program execution

DirEntry: [E:H]

RAW

2: Load A

3: Load F

Last writer id:

T1

WAR

3: Store F

3

1

3

T1:

T2:

Last_writer

Last access to

the DirEntry

Last access to the DirEntry

SMO logs:

Thread T2 cannot execute memory

reference 2 until T1 executes its

memory reference 1

T2 2 T1 1

T1 2 T2 2

T1 3 T2 3

Thread T1 cannot execute memory reference 2

until T2 executes its memory reference 2


Ordering memory accesses reducing log size

Ordering Memory Accesses (Reducing log size)

  • Preserving order will reproduce execution

    • a→b: “a happens-before b”

    • Ordering is transitive: a→b, b→c means a→c

  • Two instructions must be ordered if:

    • they both access the same memory, and

    • one of them is a write


Constraints enforcing order

To guarantee a→d:

a→d

b→d

a→c

b→c

Suppose we need b→c

b→c is necessary

a→d is redundant

Constraints: Enforcing Order

P1

P2

a

overconstrained

b

c

d


Problem formulation

Problem Formulation

Dependence

(black)

Conflicts

(red)

Thread I

Thread J

Thread I

Thread J

ld A

add

ld A

add

st B

st B

st C

st C

st C

Log

st C

ld B

ld B

ld D

ld D

st A

st A

sub

sub

st C

st C

ld B

ld B

st D

st D

Recording

Replay

  • Reproduce exact same conflicts: no more, no less


Log all conflicts

Dependence Log

1

1

Log J: 23 14 35 46

16 bytes

2

2

3

3

Log I: 23

4

4

5

5

Log Size: 5*16=80 bytes

(10 integers)

6

6

Log All Conflicts

Thread I

Thread J

ld A

add

  •  Detect conflicts  Write log

st B

st C

st C

ld B

ld D

st A

sub

st C

ld B

st D

Replay

  • Assign IC

  • (logical Timestamps)

  • But too many conflicts


Netzer s transitive reduction

TR Reduced Log

Log J: 23 35 46

Log I: 23

Log Size: 64 bytes

(8 integers)

Netzer’s Transitive Reduction

Thread I

Thread J

TR reduced

1

ld A

add

1

st B

st C

2

2

st C

ld B

3

3

ld D

st A

4

4

sub

st C

5

5

ld B

st D

6

6

Replay


Rtr regulated transitive reduction stricter dependences to aid vectorization

New Reduced Log

Log J: 23 45

Log I: 23

stricter

Reduced

Log Size: 48 bytes

(6 integers)

RTR (Regulated Transitive Reduction): Stricter Dependences to Aid Vectorization

Thread I

Thread J

1

ld A

add

1

st B

st C

2

2

st C

ld B

3

3

ld D

st A

4

4

sub

st C

5

5

ld B

st D

6

6

Replay

4% Overhead RTR+FDR (simulated on GEMs)

.2 MB/core/second logging (Apache)


  • Login