Deterministic Multiprocessing

Deterministic Multiprocessing Chris Fallin, David Lewis, Zongwei Zhou Date & location of presentation

What is Deterministic MP? • Multiprocessor executes multiple threads • Threads share resources (ie, memory) • Due to bus arbiters, memory controllers, etc, some orderings in shared resources are undefined • Problem for: debugging (reproducibility), thorough testing (many possible cases) • Deterministic: same input  same output

Types of Determinism • Strong: same input  same output, regardless of race conditions • Must capture all communicating memory access pairs • Weak: same input  same output, as long as locking is correct • Takes advantage of locks for low SW overhead

Types of Deterministic Execution • Record/Replay: HW/SW keeps log of program input • Single-program: system calls, memory interleavings • Full-system: interrupts, I/O, etc • Log allows later replay of a bug • However, several executions may still differ outside of replay • Full-time • Ordering of memory accesses follows a statically-defined deterministic order: for same program and same input, output is always same

DMP: Deterministic Shared Memory Multiprocessing Devietti, Lucia, Ceze, Oskin

Central Idea • To guarantee deterministic behavior: - the direct way is to preserve the same global interleaving of instructions in every execution of a parallel program - unnecessary and significant performance impact • Insight: only communicating pairs matter

Improve a bit....... • Not all memory access is communicating • can parallelize communication-free portion in each quantum • need to know when communications happen! • MESI cache coherence protocol provides this for free DMP Sharing Table - tracks info about mem ownership - two ownership change possibilities: - reading data owned by others - writing data to shared memory

Improve a bit more...... • Transactional Memory + deterministic commit order • TM: atomic and isolation of quantum • Speculation: find quantum not involved in communication • If communication happens, squash + re-execute • potential optimization: • forward uncommitted (or speculative) data between quanta • could save a large number of squashes

Performance

Discussion • Speculation • similar idea, but use for opposite purpose to TLS • require complex hardware • I/O or parts of OS can not execute speculatively • Dealing with nondeterminism • threads can use OS to communicate • nondeterministic OS API calls, e.g. read • Better way of token-passing?

Kendo: Efficient Deterministic Multithreading in Software Olszewski, Ansel, Amarasinghe

Definitions • Strong Determinism • Deterministic order of memory accesses to shared data for particular program input • ALWAYS produces same output for every run with a particular input • Not easily providable without hardware support • Weak Determinism • Deterministic order of lock acquisitions for a given program input • Produces same output for every run if race-free • Can be guaranteed if all accesses to shared data protected by locks • If no data-races, strong and weak determinism provide same guarantees!

Introducing Kendo • Software framework to enforce weak determinism of general lock-based C/C++ code for commodity shared-memory multiprocessors • No special hardware necessary! • Deterministic Logical Time • Each thread has its own monotonically increasing deterministic logical clock • How to implement? Performance counter events? • When is it a thread T's turn to use a lock? • All threads with tid < T have greater logical clocks • All threads with tid ≥ T have greater or equal logical clocks

Simple Locking Mechanism function det_mutex_lock(l) { pause_logical_clock(); wait_for_turn(); lock(l); inc_logical_clock(); resume_logical_clock(); } function det_mutex_unlock(l) { unlock(l); } • Simple algorithm for implementing locks • Pause logical clock during acquisition and wait for turn to access lock (using heuristic in previous slide) • Once in critical section resume the clock and continue • Pros: • Easy to implement • Problems?

Improved Lock function det_mutex_lock(l){ pause_logical_clock(); while(true){ // Loop until we have successfully acquired the lock . wait_for_turn(); // Wait for our deterministic logical clock to be unique global minimum if (try_lock(l)){ // Check the state of the lock , acquiring it if it is free if(l.released_logical_time // Lock is free in physical time, but still acquired in >= get_logical_clock()){ // deterministic logical time so we cannot acquire it yet unlock(l); // Release the lock } else { // Lock is free in both physical and in deterministic logical break; // time, so it is safe to exit the spin loop } } inc_logical_clock(); // Increment our deterministic logical clock and start over } inc_logical_clock(); // Increment our deterministic logical clock before exiting resume_logical_clock(); } function det_mutex_unlock(l){ pause_logical_clock(); l.released_logical_time = get_logical_clock(); unlock(l); inc_logical_clock(); resume_logical_clock(); }

Optimizations • Queuing • Queue for each lock guarantees first-come first-serve • Fast-forwarding • While waiting for a lock can set logical time to lock.released_logical_time (or +1 if queuing) • Lazy reads • If application can read out-of-date shared data, no need to lock on read (i.e. finding a "best" value) • Provide read window (in logical time), if all threads past earliest allowable logical time, can successfully read

Results

Capo: A Software-Hardware Interface for Practical Deterministic Multiprocessor Replay Montesinos, Hicks, King, Torellas

Capo: Motivation • Record/replay system for debugging • Not intended to be deployed in the field • Builds on DeLorean [1] • Chunk-based record/replay system • Terminate chunks at communicating pairs, record chunk commit orderonly • Only half the story • Capo adds software side as a Linux implementation: • Record syscall results • Provide infrastructure to record/replay multiple programs and multiplex hardware record/replay features [1] P. Montesinos, L. Ceze, and J. Torrellas, “DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Efficiently,” in ISCA, June 2008.

Capo's Contributions • Replay Spheres: distinct realms of record/replay • Defining hardware-software interface • Simulated DeLorean hardware (chunk-based recording) • Linux kernel modifications

Capo Architecture • Replay Sphere: set of R-threads; isolated environment • Arbitrary set of processes is inside sphere • Replay Sphere Mgr: multiplexes HW support over spheres • HW: records chunk commit order (DeLorean) • SW: records system calls • OS not inside sphere, except copy_to_user()

Hardware Details

Performance Record Replay

Log Size

Helps with… Capo(record/replay) Kendo DMP debugging testing replicas deployment Needs hw usually no yes Summary (Devietti et al)

Discussion • Which is more useful: record/replay or full-time? • Debugging only, vs. system design philosophy • Tradeoff: cost (log size, overhead) vs. utility • Strong vs. weak determinism • Race conditions are an important class of bugs

Deterministic Multiprocessing

Deterministic Multiprocessing

Presentation Transcript

Deterministic Chaos

Research Accelerator for MultiProcessing

Chapter 4 Symmetric MultiProcessing

Prelude to Multiprocessing

Multiprocessing and NUMA

Deterministic Chaos

Open Multiprocessing

Research Accelerator for MultiProcessing

Symmetric Multiprocessing (SMP)

SSD Multiprocessing / Multithreading

Deterministic BIST

Adaptive Single-Chip Multiprocessing

Deterministic Encryption

Deterministic Petrophysics

Deterministic Scheduling

Microassembly – deterministic

Deterministic Annealing

Shared Memory Multiprocessing

Multiprocessing Memory Management

DMP: Deterministic Shared Memory Multiprocessing

Prelude to Multiprocessing

Microassembly – deterministic