1 / 20

Deterministic Logging/Replaying of Applications

Deterministic Logging/Replaying of Applications. Motivation. Run-time framework goals Collect a complete trace of a program’s user-mode execution Keep the tracing overhead for both space and time low

ting
Download Presentation

Deterministic Logging/Replaying of Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Deterministic Logging/Replaying of Applications

  2. Motivation Run-time framework goals • Collect a complete trace of a program’s user-mode execution • Keep the tracing overhead for both space and time low • Re-simulate the traced execution deterministically based on the collected trace with full fidelity down to the instruction level • Full fidelity: user mode only, no tracing of kernel, only user-mode I/O callbacks Advantages • Complete program trace that can be analyzed from multiple perspectives (replay analyzers: debuggers, locality, etc) • Trace can be collected on one machine and re-played on other machines (or perform live analysis by streaming) Challenges: Trace Size and Performance

  3. Original Record-Replay Approaches • InstantReplay ’87 • Record order or memory accesses • overhead may affect program behavior • RecPlay ’00 • Record only synchronizations • Not deterministic if have data races • Netzer ’93 • Record optimal trace • too expensive to keep track of all memory locations • Bacon & Goldstein ’91 • Record memory bus transactions with hardware • high logging bandwidth

  4. Motivation Increasing use and development for multi-core processors • MT program behavior is non-deterministic • To effectively debug software, developers must be able to replay executions that exhibit concurrency bugs • Shared memory updates happen in different order

  5. Related Concepts • Runtime interpretation/translation of binary instructions • Requires no static instrumentation, or special symbol information • Handle dynamically generated code, self modifying code • Recording/Logging: ~100-200x • More recent logging • Proposed hardware support (for MT domain) • FDR (Flight Data Recorder) • BugNet (cache bits set on first load) • RTR (Regulated Transitive Reduction) • DeLorean (ISCA 2008- chunks of instructions) • Strata (time layer across all the logs for the running threads) • iDNA (Diagnostic infrastructure using NirvanA- Microsoft)

  6. Deterministic Replay Re-execute the exact same sequence of instructions as recorded in a previous run • Single threaded programs • Record Load Values needed for reproducing behavior of a run (Load Log) • Registers updated by system calls and signal handlers (Reg Log) • Output of special instructions: RDTSC, CPUID (Reg Log) • System call (virtualization- cloning arguments, updates) • Checkpointing (log summary ~10Million) • Multi-threaded programs • Log interleaving among threads (shared memory updates ordering – SMO Log)

  7. PinSEL – System Effect Log (SEL) Logging program load values needed for deterministic replay: • First access from a memory location • Values modified by the system (system effect) and read by program • Machine and time sensitive instructions (cpuid,rdtsc) Store A; (A  111) Program execution Store B; (B  55) Load C; (C = 9) Load D; (D = 10) Syscall modifies location (B  0) and (C  99) Load A; (A = 111) system call Load B; (B = 0) Logged Load C; (C = 99) Not Logged Load D; (D = 10) • Trace size is ~4-5 bytes per instruction

  8. i = 1; for (j = 0; j < 10; j++) { i = i + j; } k = i; // value read is 46 System_call(); k = i; // value read is 0 (not predicted) Optimization: Trace select reads • Observation: Hardware caches eliminate most off-chip reads • Optimize logging: • Logger and replayer simulate identical cache memories • Simple cache (the memory copy structure) to decide which values to log. No tags or valid bits to check. If the values mismatch they are logged. • Average trace size is <1 bit per instruction • The only read not predicted and logged follows the system call

  9. Example Overhead • PinSEL and PinPLAY • Initial work (2006) with single threaded programs: • SPEC2000 ref runs: 130x slowdown for pinSEL and ~80x for PinPLAY (w/o in-lining) • Working with a subset of SPLASH2 benchmarks: 230x slowdown for PinSEL • Now: Geo-mean SPEC2006 • Pin 1.4x • Logger 83.6x • Replayer 1.4x

  10. Example: Microsoft iDNA Trace Writer Performance • Memchecker and valgrind are in 30-40x range on CPU 2006 • iDNA ~11x, (does not log shared-memory dependences explicitly) • Use a sequential number for every lock prefixed memory operation: offline data race analysis

  11. Logging Shared Memory Ordering • Emulation of Directory Based Cache Coherence • Identifies RAW, WAR, WAW dependences • Indexed by hashing effective address • Each entry represents an address range Directory Dir Entry Dir Entry Store A Program execution Dir Entry hash Load B Dir Entry

  12. Directory Entries • Every DirEntry maintains: • Thread id of the last_writer • A timestamp is the # of memory ref. the thread has executed • Vector of timestamps of last access for each thread to that entry • On Loads: update the timestamp for the thread in the entry • On Stores: update the timestamp and the last_writer fields Directory Thread T1 Thread T2 DirEntry: [A:D] T2 Last writer id: T1 1: Store A 1: Load F 1 T1: 2 T2: 2 Vector 2: Store A Program execution DirEntry: [E:H] 2: Load A 3: Load F Last writer id: T1 3: Store F 3 1 3 T1: T2:

  13. Detecting Dependences • RAW dependency between threads T and T’ is established if: • T executes a load that maps to the directory entry A • T’ is the last_writer for the same entry • WAW dependency between T and T’ is established if: • T executes a store that maps to the directory entry A • T’ is the last_writer for the same entry • WAR dependency between T and T’ is established if: • T executes a store that maps to the directory entry A • T’ has accessed the same entry in the past and T is not the last_writer

  14. Example Thread T1 Thread T2 DirEntry: [A:D] T2 Last writer id: T1 1: Store A 1: Load F WAW 1 T1: 2 T2: 2 2: Store A Program execution DirEntry: [E:H] RAW 2: Load A 3: Load F Last writer id: T1 WAR 3: Store F 3 1 3 T1: T2: Last_writer Last access to the DirEntry Last access to the DirEntry SMO logs: Thread T2 cannot execute memory reference 2 until T1 executes its memory reference 1 T2 2 T1 1 T1 2 T2 2 T1 3 T2 3 Thread T1 cannot execute memory reference 2 until T2 executes its memory reference 2

  15. Ordering Memory Accesses (Reducing log size) • Preserving order will reproduce execution • a→b: “a happens-before b” • Ordering is transitive: a→b, b→c means a→c • Two instructions must be ordered if: • they both access the same memory, and • one of them is a write

  16. To guarantee a→d: a→d b→d a→c b→c Suppose we need b→c b→c is necessary a→d is redundant Constraints: Enforcing Order P1 P2 a overconstrained b c d

  17. Problem Formulation Dependence (black) Conflicts (red) Thread I Thread J Thread I Thread J ld A add ld A add st B st B st C st C st C Log st C ld B ld B ld D ld D st A st A sub sub st C st C ld B ld B st D st D Recording Replay • Reproduce exact same conflicts: no more, no less

  18. Dependence Log 1 1 Log J: 23 14 35 46 16 bytes 2 2 3 3 Log I: 23 4 4 5 5 Log Size: 5*16=80 bytes (10 integers) 6 6 Log All Conflicts Thread I Thread J •  Detect conflicts  Write log ld A add st B st C st C ld B ld D st A sub st C ld B st D Replay • Assign IC • (logical Timestamps) • But too many conflicts

  19. TR Reduced Log Log J: 23 35 46 Log I: 23 Log Size: 64 bytes (8 integers) Netzer’s Transitive Reduction Thread I Thread J TR reduced 1 ld A add 1 st B st C 2 2 st C ld B 3 3 ld D st A 4 4 sub st C 5 5 ld B st D 6 6 Replay

  20. New Reduced Log Log J: 23 45 Log I: 23 stricter Reduced Log Size: 48 bytes (6 integers) RTR (Regulated Transitive Reduction): Stricter Dependences to Aid Vectorization Thread I Thread J 1 ld A add 1 st B st C 2 2 st C ld B 3 3 ld D st A 4 4 sub st C 5 5 ld B st D 6 6 Replay 4% Overhead .2 MB/core/second logging (Apache)

More Related