1 / 36

Dongyoon Lee , Peter Chen, Jason Flinn , Satish Narayanasamy

Chimera: Hybrid Program Analysis for Determinism. Dongyoon Lee , Peter Chen, Jason Flinn , Satish Narayanasamy University of Michigan, Ann Arbor. * Chimera image from http ://superpunch.blogspot.com/2009/02/chimera-sketch.html. Deterministic Replay.

armina
Download Presentation

Dongyoon Lee , Peter Chen, Jason Flinn , Satish Narayanasamy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chimera: Hybrid Program Analysis for Determinism DongyoonLee, Peter Chen, Jason Flinn, SatishNarayanasamy University of Michigan, Ann Arbor * Chimera image from http://superpunch.blogspot.com/2009/02/chimera-sketch.html

  2. Deterministic Replay Goal: record and reproduce multithreaded execution • Debugging concurrency bugs • Offline heavyweight dynamic analysis • Forensics and intrusion detection • … and many more uses Problem • Multithreaded record-and-replay is too slow (>2x) or requires custom hardware

  3. Multithreaded Record-and-Replay is Slow Thread 1 Thread 2 Thread 3 Checkpoint Memory and Register State Log non-deterministic program input - Interrupts, I/O values, DMA, etc. Write Write Read Log shared memory dependencies

  4. Replay for Data-Race-Free Programs is Cheap Data-race-free programs • Shared memory accesses are well ordered by synchronization ops. • Recording happens-before order of sync. ops. is sufficient Problem: Programs with data races T1 T2 T3 X=0 order of mem. ops. Y=0 order of sync. ops. Unlock(l) Lock(l) X=1 Y=1 Unlock(l) Z=1 Signal(c) Wait(c) X=2 Y=2 Z=2

  5. Our Contribution: A Hybrid Analysis Sound static data race analysis • Add synchronizations for potential data races • Problem: Too many false positives Profilingnon-concurrent code regions Symbolic bounds analysis Chimera Data-race-free program P’ Potentially racy program P

  6. Roadmap • Motivation • Chimera Analysis • Static data race analysis • Profiling non-concurrent code regions • Symbolic bounds analysis • Weak-lock Design • Evaluation • Conclusion

  7. Roadmap • Motivation • Chimera Analysis • Static data race analysis • Profiling non-concurrent code regions • Symbolic bounds analysis • Weak-lockDesign • Evaluation • Conclusion

  8. Static Data Race Analysis • Find potential data-races using a sound static data race detector RELAY [Voung et al., FSE’07] • Protect all potential data-races using weak-locks • A new time-out lock which may be preempted (discussed later) • Record and replay the happens-before order of weak-locks

  9. Protect Potential Races using Weak-locks Static analysis helps avoid instrumentation for access to Z void foo() { X = 0; for(i = ... ){ Y[ tid ][ i ] = 0; } } void bar() { X = 1; for(i = … ){ Y[ tid ][ i ] = 1; Z = 1; } } Potential racy-pair Potential racy-pair No race report

  10. Sources of False Positives in RELAY • Sound data-race detector reports too many false data-races • 53x overhead • Source 1: Non-mutexsynchronizations are ignored • Lockset based analysis ignores fork-join, barrier, signal-wait, etc. • May report a false data-race between memory instructions that can never execute concurrently • Source 2: Conservative pointeranalysis • Overestimate variables accessed by a memory instruction • May report a false data-race between memory instructions that can never access the same location Solution: Profiling non-concurrent code regions Solution: Symbolic bounds Analysis

  11. Roadmap • Motivation • Chimera Analysis • Static data race analysis • Profiling non-concurrent code regions • Symbolic bounds analysis • Weak-lock Design • Evaluation • Conclusion

  12. Profiling Non-concurrent Code Regions Problem • Lockset based analysisignores non-mutex synchronization ops. Solution • Profile non-concurrent code regions (e.g., functions) • Increase the granularity of weak-locks to protect a larger code region instead of each potential racy instruction • Parallelism is preserved unless mis-profiled T1 T2 foo() False Race BARRIER BARRIER bar()

  13. Function-Level Weak-Locks if profiler says foo() and bar() are not likely to run concurrently void foo() { X = 0; for(i = … ){ Y[ tid][ i ] = 0; } } void bar() { X = 1; for(i = … ){ Y[ tid][ i ] = 1; Z = 1; } } foo() False Race BARRIER BARRIER bar()

  14. Roadmap • Motivation • Chimera Analysis • Static data race analysis • Profiling non-concurrent code regions • Symbolic bounds analysis • Design • Evaluation • Conclusion

  15. Imprecision in Conservative Pointer Analysis May run Concurrently T1 T2 bar() foo() BARRIER BARRIER

  16. Imprecision in Conservative Pointer Analysis • RELAY uses Steensgaard’s and Anderson’s pointer analysis • Flow-Insensitive and Context-Insensitive (FICI) analysis • Naming heap objects is conservative • Overestimate the variables accessed by a memory instruction void foo() { … for(i = 0 to N){ Y[ tid ][ i ] = 0; … } } void bar() { … for(i= 0 to N){ Y[ tid ][ i ] = 1; … } } Potential racy-pair False Race Thread 2 Thread1 Y[][] … … …

  17. Symbolic Bounds Analysis Our Solution • Derive the symbolic lower and upper bounds that a racy code region may access (e.g., loops) [Rugina and Rinard, PLDI’00] • Increase the granularity of weak-locks to protect a larger code region for a set of addresses specified by a symbolic expression • Parallelism is preserved if the bounds are precise enough void foo() { … for(i = 0 to N){ Y[ tid ][ i ] = 0; } … } Symbolic Bounds Analysis Bounds: &Y[tid][0] to &Y[tid][N]

  18. Loop-level Weak-locks Symbolic bounds: &Y[tid][0] ~ &Y[tid][N] void foo() { X = 0; for(i = 0 to N){ Y[ tid][ i ] = 0; } } void bar() { X = 1; for(i = 0 to N){ Y[ tid ][ i ] = 1; Z = 1; } } (&Y[tid][0],&Y[tid][N]) (&Y[tid][0],&Y[tid][N]) (&Y[tid][0],&Y[tid][N]) (&Y[tid][0],&Y[tid][N])

  19. Imprecise Symbolic Bounds Sources • Depend on the value computed inside the code region • Depend on arithmetic operations not supported in the analysis • e.g.,modulo operations, logical AND/OR, etc. Choosing the optimal granularity • If bounds are too imprecise and the loop body is long enough, resort to instruction (basic-block) level weak-locks for parallelism void qux() { … for(i = 0 to N){ prev= Z[ prev]; } … } Symbolic Bounds Analysis Bounds: -INF to +INF

  20. Roadmap • Motivation • Chimera Analysis • Weak-lock Design • Evaluation • Conclusion

  21. Deadlock due to Weak-locks No deadlocks between weak-locks • function-level > loop-level > instruction-level Deadlock between weak-locks and original sync. ops. is possible T1 T2 Time-out !! … wait (cv) … … signal(cv) …

  22. Weak-lock Time-out A weak-lock might time-out • Invoke a special system call to handle it Current owner Current owner Time-out !! T2 T1 Logged order of weak-locks … signal(cv) … … wait (cv) … Weak-lock guarantee • Only one thread holds a given weak-lock at any given time • Mutual exclusion may be compromised; but sufficient for replay

  23. Roadmap • Motivation • Chimera Analysis • Weak-lock Design • Evaluation • Conclusion

  24. Implementation Source-to-source Instrumentation • Implemented in OCaml using CIL as a front end Static analysis • Data race detection: RELAY [Voung et al., FSE’07] • Include all library source codes for soundness (uClibc’slibc, libm, etc.) • Symbolic bounds analysis: [Rugina and Rinard, PLDI’00] • Intra-procedural analysis for racy loops only Runtime system • Modified Linux kernel to record/replay program input • Modified pthread library to record/replay happens-before order of original synchronization operations and weak-locks

  25. Evaluation Setup Test Environment • 2.66 GHz 8-core Xeon processor with 4 GB of RAM • Different set of inputs for profiling and performance evaluation • Average of five trials with 4 worker threads • 2, 4, 8 threads for scalability results Benchmarks • Desktop applications • aget, pfscan, and pbzip2 • Server programs • knot and apache • SPLASH-2 suite • ocean, water-nsq, fft, and radix

  26. Record and Replay Performance 86% slowdown 39% 2.4% slowdown • Recording : 39% on average • Replay : similar to recording; much lower for I/O intensive prgs.

  27. Effectiveness of Coarse-grained Weak-locks 53x

  28. Effectiveness of Coarse-grained Weak-locks • Coarse-grained weak-locks reduce the cost of instrumentation

  29. Effectiveness of Coarse-grained Weak-locks • Coarse-grained weak-locks reduce the cost of instrumentation • Exception: control-flow dependency (e.g., pfscan)

  30. Effectiveness of Coarse-grained Weak-locks • Coarse-grained weak-locks reduce the cost of instrumentation • Exception: control-flow dependency (e.g., pfscan)

  31. Effectiveness of Coarse-grained Weak-locks 1.39x • Coarse-grained weak-locks reduce the cost of instrumentation • Exception: control-flow dependency (e.g., pfscan)

  32. Breakdown of Recording Overhead funclocks loop locks instr/bb locks sync op & system log • Weak-lock overhead = contention (waiting) cost + logging cost

  33. Breakdown of Recording Overhead func wait func log loop wait loop log instr/bb wait instr/bb log sync op & system log • Weak-lock overhead = contention (waiting) cost + logging cost • High loop-lock contention • High instr/bb-lock contention

  34. Scalability • Scientific applications scale worse due to imprecise symbolic bounds analysis

  35. Conclusion Goal: Software-only deterministic multiprocessor replay systems Chimera Analysis • Static data race analysis • Find and protect potential data races with weak-locks • Instruction/basic-block-level weak-locks • Profiling non-concurrent code regions • Address the inadequacy of lockset-based algorithm • Function-level weak-locks • Symbolic bounds analysis • Address the imprecision of conservative pointer analysis • Loop-level weak-locks Low Recording Overhead • 39% recording overhead for 4 worker threads

  36. Thank you

More Related