Dynamic Verification of Sequential Consistency

Dynamic Verification of Sequential Consistency Albert Meixner and Daniel J. Sorin Presented by Peter Gilbert ECE259 Spring 2008

Introduction • Problem: How to detect errors in multithreaded memory systems? • Transient physical faults increasingly problematic as memory systems become more complex • Errors possible in many components: caches, memories, cache/memory controllers, interconnect

Potential approach • Tailored detection mechanisms for each component and each type of error • Example: for “single-bit-stuck-at-x” model for system bus, add a parity bit • Problems: • Lots of components… • Requires understanding error models and how they interact • Cannot detect design bugs • Some errors difficult to detect with localized mechanisms

Dynamic verification • Monitor system execution • Verify high-level invariants rather than considering individual components • Can detect transient faults, design bugs, fabrication defects • Any error that affects the high-level invariants • End-to-end correctness

DVSC • Verifying memory consistency == verifying memory system correctness • Goal of DVSC: verify that SC is enforced in a shared memory multiprocessor • SC defines end-to-end correctness

DVSC error model • Caches and memories • Bit corruption • Cache/memory controllers • Corruption of state or outputs • Interconnect • Messages corrupted, dropped, replicated, misrouted, reordered

First idea: DVSC-Direct • Dynamically construct total order of loads and stores • Verify that total order satisfies SC • Trigger system recovery when error is detected

DVSC-Direct Design • For every load and store: • Processor informs block’s home memory node • Inform message: • <address, load/store, data value, logical time> • Logical time: if A causes B, A has smaller logical time • Replay accesses in logical time order at home node • Uses priority queue for Informs and shadow copies of memory blocks • Verify that load gets value from most recent store

Cost of DVSC-Direct • Inform bandwidth proportional to the number of loads and stores • 8-53 times more bandwidth than unprotected system • Uses bandwidth like a system without caches!

Alternative: DVSC-Indirect • Verify sub-invariants proven to be equivalent to SC • Proof due to Plakal et al. • Terminology • coherence epoch - interval of logical time during which a processor has Shared or Exclusive access to a block • A memory access is bound to a coherence transaction T if permission is obtained via T

Constructing SC from sub-invariants • Fact 1: load of block B bound to T receives either: • Most recent store of B bound to T • Value of B received in response to T • Lemmas • Exclusive epochs for block B do not overlap with other Exclusive or Shared epochs for B • Every load or store occurs in some epoch and is bound to the transaction that epoch • Each word w of B received at the start of an epoch equals the most recent store to w

DVSC-Indirect approach • DIVA to verify that memory operations occur in program order (Fact 1) • Recall from Architecture I: DIVA dynamically verifies a speculative core • DIVA presents in-order abstraction to memory system • ECC added to each cache and memory line to detect silent corruptions • Hardware for verifying epoch invariants

Cache controller • Cache controller maintains Cache Epoch Table • CET entry (per-block): • Check that every load and store is performed in appropriate epoch (Lemma 2) • When epoch ends, send info to home memory controller in Inform-Epoch message (CET + end time and end data hash) S/E DRB Logical time at start Hash of data block at start of epoch 1 bit 1 bit 16 bits 16 bits S/E - type of epoch: shared or exclusive DRB - data ready bit

Memory controller • Memory controller maintains directory-like Memory Epoch Table • MET entry (per-block): • When Inform-Epoch is received from cache controller: • Sort in priority queue (VWB) • Process in logical time order, checking: • This epoch does not overlap with other epochs • Correct block data is transferred from epoch to epoch Latest end time of any S epoch Latest end time of any E epoch Hash of data block from latest E epoch 16 bits 16 bits 16 bits

DVSC-Indirect implementation CPU CPU CPU DIVA DIVA DIVA Cache CET Cache CET Cache CET Interconnect Memory VWB Memory VWB Memory VWB Verifier Verifier Verifier MET MET MET

DVSC-Indirect snooping example

DVSC-Indirect summary • Verifies SC through sub-invariants • Costs: • DIVA • Storage structures • CET at each cache controller, MET and VWB at each memory controller, ECC on cache and memory lines • Not large or complicated • Bandwidth usage • Proportional to coherence traffic (as opposed to number of loads and stores for DVSC-Direct)

Evaluation • Can DVSC detect the errors from the error model? • Corrupted, dropped, misrouted, reordered, duplicated messages • Corrupted cache and memory blocks • Don’t consider errors in processor core • DIVA handles this • How much does DVSC increase bandwidth usage? • How does it affect error-free performance?

Methodology • Full-system simulation with Simics • 8-node multiprocessor • Each processor implements SC, speculates for higher performance • Two levels of cache • Support for backward error recovery with SafetyNet at each node • DVSC-Indirect costs • Each CET 68 KB • Each MET 102 KB • VWB: 1024 entries • SPARC V9 running Solaris 8 • Interconnect: 2.5 GB/s links • Benchmarks • Four commercial workloads + barnes-hut from SPLASH-2

Error coverage • DVSC-Direct and DVSC-Indirect detected all injected errors in simulation • Small probability of false negatives in DVSC-Indirect • ECC fails to detect a bit error • hash collisions • False positives also possible in DVSC-Indirect • When VWB not large enough to prevent out-of-order processing of Inform-Epochs

Bottleneck link bandwidth • DVSC-Direct: uses 8-53 times more bandwidth • DVSC-Indirect directory: 8-25% increase • DVSC-Indirect snooping: 0-15% increase Bandwidth on most-utilized link - directory

DVSC-Indirect error-free performance • Performance impact minimal: usually equivalent to that of SafetyNet by itself • Similar results for snooping Runtime with DVSC-Indirect compared to unprotected system - directory

Conclusions • DVSC-Indirect is effective at detecting all memory system errors injected in the simulations • False negatives and false positives will occur with small probability • DVSC-Indirect imposes small error-free performance overhead • Bottleneck bandwidth usage with DVSC-Indirect is only 8-25% greater than unprotected case • Is the hardware cost justified? • Probably depends on the application reliability requirements

Dynamic Verification of Sequential Consistency