1 / 26

Dynamic Verification of End-to-End Multiprocessor Invariants

Dynamic Verification of End-to-End Multiprocessor Invariants. Daniel J. Sorin 1 , Mark D. Hill 2 , David A. Wood 2 1 Department of Electrical & Computer Engineering Duke University 2 Computer Sciences Department University of Wisconsin-Madison. My Talk in One Slide.

denver
Download Presentation

Dynamic Verification of End-to-End Multiprocessor Invariants

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin1, Mark D. Hill2, David A. Wood2 1Department of Electrical & Computer Engineering Duke University 2Computer Sciences Department University of Wisconsin-Madison

  2. My Talk in One Slide • Commercial server availability is important • System model: Symmetric Multiprocessor (SMP) • Fault model: Mostly transient, some permanent • Recent work developed efficient checkpoint/recovery • But we can only recover from hardware errors we detect • Many hardware errors are hard to detect • Proposal: Dynamic verification of invariants • Online checking of end-to-end system invariants • Checking performed with distributed signature analysis • Triggers recovery if invariant is violated

  3. Outline • Background • SMPs and availability • Existing hardware error detection • Invariant checking with distributed signature analysis • Two invariant checkers • Evaluation • Conclusions

  4. shared wire bus P P P P Symmetric Multiprocessor (SMP) System Model Cache Coherence Transaction I M Issue request Wait for response Receive response

  5. switch switch switch P P P P Symmetric Multiprocessor (SMP) System Model Cache Coherence Transaction I M Issue request Wait for response Receive response • Broadcast request not delivered to subset of nodes • Broadcast requests delivered out of order to subset of nodes

  6. switch switch • Broadcast request not delivered to subset of nodes • Broadcast requests delivered out of order to subset of nodes switch P P P P Symmetric Multiprocessor (SMP) System Model Cache Coherence Transaction request arrives response arrives t2 I t1 M issue request t3 request arrives response arrives • More chances for incorrect state transitions

  7. Backward Error Recovery • Can improve availability with backward error recovery • If error detected, then recover to pre-fault state • Backward error recovery (BER) requires: • Checkpoint/recovery mechanism • Error detection mechanisms

  8. SafetyNet Checkpoint/Recovery • SafetyNet: all-hardware scheme [ISCA 2002] • Periodically take logical checkpoint of multiprocessor • MP State: processor registers, caches, memory • Incrementally log changes to caches and memory • Consistent checkpointing performed in logical time • E.g., every 3000 broadcast cache coherence requests • Can tolerate >100,000 cycles of error detection latency CP 1 CP 2 CP 3 CP 4 Validated execution Pending validation – Still detecting errors Active execution time

  9. Error Detection • Error model: mostly due to transient faults • Example error detection mechanisms: • Parity bit on cache line • Checksum on incoming message • Timeout on cache coherence transaction • But error detection for servers is still weak • Why? • Error detection is often on critical path and must be fast • Fast error detection can’t incorporate info from other nodes

  10. switch switch switch P1 P2 P3 P4 Why Local Information Isn’t Sufficient Shared Owned

  11. switch switch switch P1 P2 P3 P4 Why Local Information Isn’t Sufficient Broadcast Request for Exclusive fault! Shared Owned

  12. switch switch switch P1 P2 P3 P4 Why Local Information Isn’t Sufficient Broadcast Request for Exclusive fault! Shared Owned Invalid Data Response

  13. switch switch switch P1 P2 P3 P4 Why Local Information Isn’t Sufficient Modified Shared • Neither P1 nor P2 can detect that an error has occurred!

  14. Outline • Background • End-to-end invariant checking • Two invariant checkers • Evaluation • Conclusions

  15. Distributed Signature Analysis • Reduces long history of events into small signature • Signatures map almost-uniquely to event histories Event N at P1 : Event 2 at P1 Event 1 at P1 Event N at P2 : Event 2 at P2 Event 1 at P2 P1 P2 Signature Signature P1’s signature P2’s signature } Check periodically in logical time (every 3000 requests) Checker

  16. Designing Signature Analysis Schemes • Must devise two functions: Update and Check • Signature(Pi) = Update(Signature(Pi), Event) • Check(Signature(P1),…,Signature(PN)) = true if error • Simple example: check that message inflow=outflow • Assume only unicast messages • Update: +1 for receive, -1 for send • Check: true if sum of all signatures doesn’t equal 0

  17. Implementing Distributed Signature Analysis • All components cooperate to perform checking • Component = cache controller or memory controller • Each component contains: • Local signature register • Logic to compute signature updates • System contains: • System controller that performs check function • Use distributed signature analysis for dynamic verification • Verify end-to-end invariants

  18. Outline • Background • End-to-end invariant checking • Two invariant checkers • Message invariant • Cache coherence invariant • Evaluation • Conclusions

  19. A Message-Level Invariant Checker • Context: symmetric multiprocessor (SMP) • Cache coherence with broadcast snooping protocol • Invariant: all nodes see same total order of broadcast cache coherence requests • Update: for each incoming broadcast, “add” Address • Not quite this simple (e.g., doesn’t detect reorderings) • Check: error if all signatures aren’t equal

  20. Aliasing • Aliasing occurs if two histories have same signature • 3 possible sources of aliasing • Finite resources – b bits can only distinguish 2b histories • Fault in signature analysis hardware itself • Inherent flaw in scheme • Examples of inherent aliasing in previous scheme • Arrival of message with Address=0 doesn’t change signature • Reordering of messages doesn’t change signature • We solve aliasing issues in paper • Tricks: hash more than 1 field of message, use LFSRs, etc.

  21. A Cache Coherence Invariant Checker • Invariant: all coherence upgrades cause downgrades • Upgrade: increase permissions to block (e.g., noneread) • Downgrade: decrease permissions (e.g., write  read) • Update: add Address for upgrade subtract Address for downgrade • Check: error if sum of all signatures doesn’t equal 0 • Challenges • Can be more than one downgrade per upgrade • Upgrader doesn’t know how how many downgraders exist • See paper for solutions to these challenges

  22. Outline • Background • End-to-end invariant checking • Two invariant checkers • Evaluation • Conclusions

  23. Methodology • Full-system simulation of 16-processor machine • Simics provides functional simulation of everything • We added timing simulation for memory system & SafetyNet • Commercial workloads running on Solaris 8 • Database: IBM’s DB2 running online transaction processing • Static web server: Apache • Dynamic web server: Slashdot • Java middleware

  24. Detection Coverage • How do we know if our checkers work? • Inject errors periodically • Corrupt messages • Drop messages • Reorder messages • Improperly process cache coherence messages Global invariant checkers detected all errors

  25. Performance • Error bars represent +/- one standard deviation

  26. Conclusions • Goal: improve multiprocessor availability • How? Dynamic verification of end-to-end invariants • Implemented with distributed signature analysis • Results • Detects previously undetectable hardware errors • Negligible performance overhead for error-free execution • Duke FaultFinder Project • http://www.ee.duke.edu/~sorin/faultfinder • Wisconsin Multifacet Project • http://www.cs.wisc.edu/multifacet/

More Related