html5-img
1 / 72

Distributed Dynamic Partial Order Reduction based Verification of Threaded Software

Distributed Dynamic Partial Order Reduction based Verification of Threaded Software. Yu Yang (PhD student; summer intern at CBL) Xiaofang Chen (PhD student; summer intern at IBM) Ganesh Gopalakrishnan Robert M. Kirby School of Computing University of Utah

gilead
Download Presentation

Distributed Dynamic Partial Order Reduction based Verification of Threaded Software

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Dynamic Partial Order Reduction based Verification of Threaded Software Yu Yang (PhD student; summer intern at CBL) Xiaofang Chen (PhD student; summer intern at IBM) Ganesh Gopalakrishnan Robert M. Kirby School of Computing University of Utah SPIN 2007 Workshop Presentation Supported by: Microsoft HPC Institutes NSF CNS 0509379

  2. Thread Programming will become more prevalentFV of thread programs will grow in importance

  3. Why FV for Threaded Programs > 80% of chips shipped will be multi-core (photo courtesy of Intel Corporation.)

  4. Model Checking will Increasingly be thru Dynamic Methods Also known as Runtimeor In-Situ methods

  5. Why Dynamic Verification Methods • Even after early life-cycle modeling and validation, • the final code will have far more details • Early life-cycle modeling is often impossible • Use of libraries (API) such as MPI, OpenMP, Shmem, … • Library function semantics can be tricky • The bug may be in the library function implementation

  6. Model Checking will often be “stateless”

  7. Why Stateless • One may not be able to access a lot of the state • e.g. state of the OS • . It is expensive to hash and lookup revisits • . Stateless is easier to parallelize

  8. Partial Order Reduction is Crucial !

  9. Why POR? Process P0: ------------------------------- 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize Process P1: ------------------------------- 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize ONLY DEPENDENT OPERATIONS • 504 interleavings without POR (2 * (10!)) / (5!)^2 • 2 interleavings with POR !!

  10. Dynamic POR is almost a “must” ! ( Dynamic POR as in Flanagan and Godefroid, POPL 2005)

  11. Why Dynamic POR ? a[ k ]-- a[ j ]++ • Ample Set depends on whether j == k • Can be very difficult to determine statically • Can determine dynamically

  12. Why Dynamic POR ? The notion of action dependence (crucial to POR methods) is a function of the execution

  13. Computation of “ample” sets in Static POR versus in DPOR { BT }, { Done } Add Red Process to “Backtrack Set” This builds the Ample set incrementally based on observed dependencies Blue is in “Done” set Ample determined using “local” criteria Nearest Dependent Transition Looking Back Current State Next move of Red process

  14. Putting it all together … • We target C/C++ PThread Programs • Instrument the given program (largely automated) • Run the concurrent program “till the end” • Record interleaving variants while advancing • When # recorded backtrack points reaches a soft limit, spill work to other nodes • In one larger example, a 11-hour run was finished in 11 minutes using 64 nodes • Heuristic to avoid recomputations was essential for speed-up. • First known distributed DPOR

  15. A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {}, {}

  16. A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {}, {} t0: lock

  17. A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {}, {} t0: lock t0: unlock

  18. A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {}, {} t0: lock t0: unlock t1: lock

  19. A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t1}, {t0} t0: lock t0: unlock t1: lock

  20. A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t1}, {t0} t0: lock t0: unlock {}, {} t1: lock t1: unlock t2: lock

  21. A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t1}, {t0} t0: lock t0: unlock {t2}, {t1} t1: lock t1: unlock t2: lock

  22. A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t1}, {t0} t0: lock t0: unlock {t2}, {t1} t1: lock t1: unlock t2: lock t2: unlock

  23. A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t1}, {t0} t0: lock t0: unlock {t2}, {t1} t1: lock t1: unlock t2: lock

  24. A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t1}, {t0} t0: lock t0: unlock {t2}, {t1}

  25. A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t1,t2}, {t0} t0: lock t0: unlock {}, {t1, t2} t2: lock

  26. A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t1,t2}, {t0} t0: lock t0: unlock {}, {t1, t2} t2: lock t2: unlock …

  27. A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t1,t2}, {t0} t0: lock t0: unlock {}, {t1, t2}

  28. A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t2}, {t0,t1}

  29. A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t2}, {t0, t1} t1: lock t1: unlock …

  30. For this example, all the paths explored during DPORFor others, it will be a proper subset

  31. Idea for parallelization: Explore computations from the backtrack set in other processes.“Embarrassingly Parallel” – it seems so, anyway !

  32. instrumentation compile request/permit request/permit We first built a sequential DPOR explorer for C / Pthreads programs, called “Inspect” Multithreaded C/C++ program executable instrumented program scheduler thread 1 thread n Thread library wrapper

  33. We then made the following observations • Stateless search does not maintain search history • Different branches of an acyclic space can be explored concurrently • Simple master-slave scheme can work here • one load balancer + workers

  34. Request unloading report result idle node id work description We then devised a work-distribution scheme… load balancer

  35. We got zero speedup! Why? Deeper investigation revealed that multiple nodes ended up exploring the same interleavings

  36. Illustration of the problem (1 of 5) {t1}, {t0} t0: lock t0: unlock {t2}, {t1} t1: lock t1: unlock t2: lock t2: unlock

  37. Illustration of the problem (2 of 5) {t1}, {t0} To Node 1 t0: lock t0: unlock {t2}, {t1} t1: lock t1: unlock Heuristic : Handoff DEEPEST backtrack point for another node to explore Reason : Largest number of paths emanate from there t2: lock t2: unlock

  38. Detail of (2 of 5) Node 0 {t1}, {t0} { }, {t0,t1} t0: lock t0: lock t0: unlock t0: unlock {t2}, {t1} {t2}, {t1} t1: lock t1: lock t1: unlock t1: unlock t2: lock t2: lock t2: unlock t2: unlock

  39. Detail of (2 of 5) Node 0 Node 1 {t1}, {t0} { }, {t0,t1} {t1}, {t0} t0: lock t0: lock t0: lock t0: unlock t0: unlock {t2}, {t1} {t2}, {t1} t1: lock t1: lock t1: unlock t1: unlock t2: lock t2: lock t2: unlock t2: unlock

  40. t1 is forced into DONE set before work handed to Node 1 Detail of (2 of 5) Node 0 Node 1 {t1}, {t0} { }, { t0,t1 } { t1 }, {t0} t0: lock t0: lock t0: lock t0: unlock t0: unlock Node 1 keeps t1 in backtrack set {t2}, {t1} {t2}, {t1} t1: lock t1: lock t1: unlock t1: unlock t2: lock t2: lock t2: unlock t2: unlock

  41. Illustration of the problem (3 of 5) {t1}, {t0} To Node 1 t0: lock Decide to do THIS work at Node 0 itself… t0: unlock {t2}, {t1} t1: lock t1: unlock t2: lock t2: unlock

  42. Illustration of the problem (4 of 5) {}, {t0,t1} {t1}, {t0} t0: lock Being expanded by Node 1 t0: unlock {t2}, {t1} Being expanded by Node 0

  43. Illustration of the problem (5 of 5) {t2}, {t0,t1} t0: lock t0: unlock {}, {t2} t2: lock t2: unlock

  44. Illustration of the problem (5 of 5) {t2}, {t0,t1} {t1}, {t0} t0: lock t1: lock t0: unlock t1: unlock {}, {t2} t2: lock t2: unlock

  45. Illustration of the problem (5 of 5) {t2}, {t0,t1} {t2}, {t0, t1} t0: lock t1: lock t0: unlock t1: unlock {}, {t2} {}, {t2} t2: lock t2: lock t2: unlock t2: unlock Redundancy!

  46. New Backtrack Set Computation: Aggressively mark up the stack! • Update the backtrack sets of ALL dependent operations! • Forms a good allocation scheme • Does not involve any synchronizations • Redundant work may still be performed • Likelihood is reduced because a node aggressively “owns” one operation and all its dependants {t1,t2}, {t0} t0: lock t0: unlock {t2}, {t1} t1: lock t1: unlock t2: lock t2: unlock

  47. Implementation and Evaluation • Using MPI for communication among nodes • Did experiments on a 72-node cluster • 2.4 GHz Intel XEON process, 2GB memory/node • Two (small) benchmarks Indexer & file system benchmark used in Flanagan and Godefoid’s DPOR paper • Aget -- a multithreaded ftp client • Bbuf – an implementation of bounded buffer

  48. Sequential Checking Time

  49. Speedup on indexer & fs (small exs);so diminishing returns > 40 nodes…

  50. Speedup on aget

More Related