Deterministic logging replaying of applications
This presentation is the property of its rightful owner.
Sponsored Links
1 / 20

Deterministic Logging/Replaying of Applications PowerPoint PPT Presentation


  • 44 Views
  • Uploaded on
  • Presentation posted in: General

Deterministic Logging/Replaying of Applications. Motivation. Run-time framework goals Collect a complete trace of a program’s user-mode execution Keep the tracing overhead for both space and time low

Download Presentation

Deterministic Logging/Replaying of Applications

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Deterministic logging replaying of applications

Deterministic Logging/Replaying of Applications


Motivation

Motivation

Run-time framework goals

  • Collect a complete trace of a program’s user-mode execution

  • Keep the tracing overhead for both space and time low

  • Re-simulate the traced execution deterministically based on the collected trace with full fidelity down to the instruction level

    • Full fidelity: user mode only, no tracing of kernel, only user-mode I/O callbacks

      Advantages

  • Complete program trace that can be analyzed from multiple perspectives (replay analyzers: debuggers, locality, etc)

  • Trace can be collected on one machine and re-played on other machines (or perform live analysis by streaming)

    Challenges: Trace Size and Performance


Original record replay approaches

Original Record-Replay Approaches

  • InstantReplay ’87

    • Record order or memory accesses

    • overhead may affect program behavior

  • RecPlay ’00

    • Record only synchronizations

    • Not deterministic if have data races

  • Netzer ’93

    • Record optimal trace

    • too expensive to keep track of all memory locations

  • Bacon & Goldstein ’91

    • Record memory bus transactions with hardware

    • high logging bandwidth


Motivation1

Motivation

Increasing use and development for multi-core processors

  • MT program behavior is non-deterministic

  • To effectively debug software, developers must be able to replay executions that exhibit concurrency bugs

    • Shared memory updates happen in different order


Related concepts

Related Concepts

  • Runtime interpretation/translation of binary instructions

    • Requires no static instrumentation, or special symbol information

    • Handle dynamically generated code, self modifying code

    • Recording/Logging: ~100-200x

  • More recent logging

    • Proposed hardware support (for MT domain)

    • FDR (Flight Data Recorder)

    • BugNet (cache bits set on first load)

    • RTR (Regulated Transitive Reduction)

    • DeLorean (ISCA 2008- chunks of instructions)

    • Strata (time layer across all the logs for the running threads)

    • iDNA (Diagnostic infrastructure using NirvanA- Microsoft)


Deterministic replay

Deterministic Replay

Re-execute the exact same sequence of instructions as recorded in a previous run

  • Single threaded programs

    • Record Load Values needed for reproducing behavior of a run (Load Log)

    • Registers updated by system calls and signal handlers (Reg Log)

    • Output of special instructions: RDTSC, CPUID (Reg Log)

    • System call (virtualization- cloning arguments, updates)

    • Checkpointing (log summary ~10Million)

  • Multi-threaded programs

    • Log interleaving among threads (shared memory updates ordering – SMO Log)


Pinsel system effect log sel

PinSEL – System Effect Log (SEL)

Logging program load values needed for deterministic replay:

  • First access from a memory location

  • Values modified by the system (system effect) and read by program

  • Machine and time sensitive instructions (cpuid,rdtsc)

Store A; (A  111)

Program execution

Store B; (B  55)

Load C; (C = 9)

Load D; (D = 10)

Syscall modifies location (B  0) and (C  99)

Load A; (A = 111)

system call

Load B; (B = 0)

Logged

Load C; (C = 99)

Not Logged

Load D; (D = 10)

  • Trace size is ~4-5 bytes per instruction


Optimization trace select reads

i = 1;

for (j = 0; j < 10; j++)

{

i = i + j;

}

k = i; // value read is 46

System_call();

k = i; // value read is 0 (not predicted)

Optimization: Trace select reads

  • Observation: Hardware caches eliminate most off-chip reads

  • Optimize logging:

    • Logger and replayer simulate identical cache memories

    • Simple cache (the memory copy structure) to decide which values to log. No tags or valid bits to check. If the values mismatch they are logged.

  • Average trace size is <1 bit per instruction

  • The only read not predicted and logged follows the system call


Example overhead

Example Overhead

  • PinSEL and PinPLAY

    • Initial work (2006) with single threaded programs:

      • SPEC2000 ref runs: 130x slowdown for pinSEL and ~80x for PinPLAY (w/o in-lining)

    • Working with a subset of SPLASH2 benchmarks: 230x slowdown for PinSEL

  • Now: Geo-mean SPEC2006

    • Pin 1.4x

    • Logger 83.6x

    • Replayer 1.4x


Example microsoft idna trace writer performance

Example: Microsoft iDNA Trace Writer Performance

  • Memchecker and valgrind are in 30-40x range on CPU 2006

  • iDNA ~11x, (does not log shared-memory dependences explicitly)

    • Use a sequential number for every lock prefixed memory operation: offline data race analysis


Logging shared memory ordering

Logging Shared Memory Ordering

  • Emulation of Directory Based Cache Coherence

    • Identifies RAW, WAR, WAW dependences

    • Indexed by hashing effective address

    • Each entry represents an address range

Directory

Dir Entry

Dir Entry

Store A

Program execution

Dir Entry

hash

Load B

Dir Entry


Directory entries

Directory Entries

  • Every DirEntry maintains:

    • Thread id of the last_writer

    • A timestamp is the # of memory ref. the thread has executed

    • Vector of timestamps of last access for each thread to that entry

    • On Loads: update the timestamp for the thread in the entry

    • On Stores: update the timestamp and the last_writer fields

Directory

Thread T1

Thread T2

DirEntry: [A:D]

T2

Last writer id:

T1

1: Store A

1: Load F

1

T1:

2

T2:

2

Vector

2: Store A

Program execution

DirEntry: [E:H]

2: Load A

3: Load F

Last writer id:

T1

3: Store F

3

1

3

T1:

T2:


Detecting dependences

Detecting Dependences

  • RAW dependency between threads T and T’ is established if:

    • T executes a load that maps to the directory entry A

    • T’ is the last_writer for the same entry

  • WAW dependency between T and T’ is established if:

    • T executes a store that maps to the directory entry A

    • T’ is the last_writer for the same entry

  • WAR dependency between T and T’ is established if:

    • T executes a store that maps to the directory entry A

    • T’ has accessed the same entry in the past and T is not the last_writer


Example

Example

Thread T1

Thread T2

DirEntry: [A:D]

T2

Last writer id:

T1

1: Store A

1: Load F

WAW

1

T1:

2

T2:

2

2: Store A

Program execution

DirEntry: [E:H]

RAW

2: Load A

3: Load F

Last writer id:

T1

WAR

3: Store F

3

1

3

T1:

T2:

Last_writer

Last access to

the DirEntry

Last access to the DirEntry

SMO logs:

Thread T2 cannot execute memory

reference 2 until T1 executes its

memory reference 1

T2 2 T1 1

T1 2 T2 2

T1 3 T2 3

Thread T1 cannot execute memory reference 2

until T2 executes its memory reference 2


Ordering memory accesses reducing log size

Ordering Memory Accesses (Reducing log size)

  • Preserving order will reproduce execution

    • a→b: “a happens-before b”

    • Ordering is transitive: a→b, b→c means a→c

  • Two instructions must be ordered if:

    • they both access the same memory, and

    • one of them is a write


Constraints enforcing order

To guarantee a→d:

a→d

b→d

a→c

b→c

Suppose we need b→c

b→c is necessary

a→d is redundant

Constraints: Enforcing Order

P1

P2

a

overconstrained

b

c

d


Problem formulation

Problem Formulation

Dependence

(black)

Conflicts

(red)

Thread I

Thread J

Thread I

Thread J

ld A

add

ld A

add

st B

st B

st C

st C

st C

Log

st C

ld B

ld B

ld D

ld D

st A

st A

sub

sub

st C

st C

ld B

ld B

st D

st D

Recording

Replay

  • Reproduce exact same conflicts: no more, no less


Log all conflicts

Dependence Log

1

1

Log J: 23 14 35 46

16 bytes

2

2

3

3

Log I: 23

4

4

5

5

Log Size: 5*16=80 bytes

(10 integers)

6

6

Log All Conflicts

Thread I

Thread J

  •  Detect conflicts  Write log

ld A

add

st B

st C

st C

ld B

ld D

st A

sub

st C

ld B

st D

Replay

  • Assign IC

  • (logical Timestamps)

  • But too many conflicts


Netzer s transitive reduction

TR Reduced Log

Log J: 23 35 46

Log I: 23

Log Size: 64 bytes

(8 integers)

Netzer’s Transitive Reduction

Thread I

Thread J

TR reduced

1

ld A

add

1

st B

st C

2

2

st C

ld B

3

3

ld D

st A

4

4

sub

st C

5

5

ld B

st D

6

6

Replay


Rtr regulated transitive reduction stricter dependences to aid vectorization

New Reduced Log

Log J: 23 45

Log I: 23

stricter

Reduced

Log Size: 48 bytes

(6 integers)

RTR (Regulated Transitive Reduction): Stricter Dependences to Aid Vectorization

Thread I

Thread J

1

ld A

add

1

st B

st C

2

2

st C

ld B

3

3

ld D

st A

4

4

sub

st C

5

5

ld B

st D

6

6

Replay

4% Overhead

.2 MB/core/second logging (Apache)


  • Login