E N D
1. 1 MPI Verification
Ganesh Gopalakrishnan and Robert M. Kirby
Students
Yu Yang, Sarvani Vakkalanka, Guodong Li,
Subodh Sharma, Anh Vo, Michael DeLisi, Geof Sawaya
(http://www.cs.utah.edu/formal_verification)
School of Computing
University of Utah
Supported by:
Microsoft HPC Institutes
NSF CNS 0509379
2. 2 “MPI Verification”orHow to exhaustively verify MPI programs without the pain of model buildingand considering only “relevant interleavings”
3. 3 Computing is at an inflection point
4. 4 Our work pertains to these:
MPI programs
MPI libraries
Shared Memory Threads based on Locks
5. 5 Name of the Game: Progress Through Precision Precision in Understanding
Precision in Modeling
Precision in Analysis
Doing Modeling and Analysis with Low Cost
6. 6 1. Need for Precision in Understanding:The “crooked barrier” quiz
7. 7 Need for Precision in Understanding:The “crooked barrier” quiz
8. 8 Need for Precision in Understanding:The “crooked barrier” quiz
9. 9 Need for Precision in Understanding:The “crooked barrier” quiz
10. 10 Need for Precision in Understanding:The “crooked barrier” quiz
11. 11 Need for Precision in Understanding:The “crooked barrier” quiz
12. 12 Need for Precision in Understanding:The “crooked barrier” quiz
13. 13 Would you rather explain each conceivable situation in a large API with an elaborate “bee dance” and informal English…. or would you rather specify it mathematically and let the user calculate the outcomes?
14. 14
15. 15
16. 16 Executable Formal Specification can help validate our understanding of MPI …
17. 17 Subject the system (or a reduced version of the system) to a collection of inputs (and hence execution paths)
Concrete example: when codes are ported, they typically break
Subject the system (or a reduced version of the system) to a collection of inputs (and hence execution paths)
Concrete example: when codes are ported, they typically break
18. 18 Subject the system (or a reduced version of the system) to a collection of inputs (and hence execution paths)
Concrete example: when codes are ported, they typically break
Subject the system (or a reduced version of the system) to a collection of inputs (and hence execution paths)
Concrete example: when codes are ported, they typically break
19. 19 Error-trace Visualization in VisualStudio
20. 20 2. Precision in Modeling:The “Byte-range Locking Protocol” ChallengeAsked to see if new protocol using MPI 1-sided was OK…
21. 21 Precision in Modeling:The “Byte-range Locking Protocol” Challenge Studied code
Wrote Promela Verification Model (a week)
Applied the SPIN Model Checker
Found Two Deadlocks Previously Unknown
Wrote Paper (EuroPVM / MPI 2006) with Thakur and Gropp – won one of the three best-paper awards
With new insight, Designed Correct AND Faster Protocol !
Still, we felt lucky … what if we had missed the error while hand-modeling
Also hand-modeling was NO FUN – how about running the real MPI code “cleverly” ?
22. 22 Measurement under Low Contention
23. 23 Measurement under High Contention
24. 24 4. Modeling and Analysis with Reduced Cost…
25. 25 What works for cards works for MPI(and for PThreads also) !!
26. 26 4. Modeling and Analysis with Reduced Cost The “Byte-range Locking Protocol” Challenge Studied code ? DID NOT STUDY CODE
Wrote Promela Verification Model (a week) ? NO MODELING
Applied the SPIN Model Checker ? NEW ISP VERIFIER
Found Two Deadlocks Previously Unknown ? FOUND SAME!
Wrote Paper (EuroPVM / MPI 2007) with Thakur and Gropp – won one of the three best-paper awards ? DID NOT WIN ?
Still, we felt lucky … what if we had missed the error while hand-modeling ? NO NEED TO FEEL LUCKY (NO LOST INTERLEAVING – but also did not foolishly do ALL interleavings)
Also hand-modeling was NO FUN – how about running the real MPI code “cleverly” ? ? DIRECT RUNNING WAS FUN
27. 27 3. Precision in AnalysisThe “crooked barrier” quiz again …
28. 28 3. Precision in AnalysisThe “crooked barrier” quiz again …
29. 29 Precision in Analysis POE Works Great (all 41 Umpire Test-Suites Run)
No need to “pad” delay statements to jiggle schedule and force “the other” interleaving
This is a very brittle trick anyway!
Prelim Version Under Submission
Detailed Version for EuroPVM…
Jitterbug uses this approach
We don’t need it
Siegel (MPI_SPIN): Modeling effort
Marmot : Different Coverage Guarantees..
30. 30 1-4: Finally! Precision and Low Cost in Modeling and Analysis, taking advantage of MPI semantics (in our heads…)
31. 31 Discover All Potential Senders by Collecting (but not issuing) operations at runtime…
32. 32 Rewrite “ANY” to ALL POTENTIAL SENDERS
33. 33 Rewrite “ANY” to ALL POTENTIAL SENDERS
34. 34 Recurse over all such configurations !
35. 35 If we now have P0-P2 doing this, and P3-5 doing the same computation between themselves, no need to interleave these groups…
36. 36 Why is all this worth doing ?
37. 37 MPI is the de-facto standard for programming cluster machines
38. 38
39. 39 The Need for Formal Semantics for MPI Send
Receive
Send / Receive
Send / Receive / Replace
Broadcast
Barrier
Reduce
40. 40 Need Formal Semantics for MPI, because we can’t imitate any existing implementation…
41. 41 Look for commonly committed mistakes automatically
Deadlocks
Communication Races
Resource Leaks We are only after “low hanging” bugs…
42. 42 Deadlock pattern… Here is a small program snip taken, and modified, from an example in one of Pacheco’s books. We have modified it so it has a deadlock. Can you find it? It’s a simple off by one. We can find it and others like it.Here is a small program snip taken, and modified, from an example in one of Pacheco’s books. We have modified it so it has a deadlock. Can you find it? It’s a simple off by one. We can find it and others like it.
43. 43 Communication Race Pattern… Here is a small program snip taken, and modified, from an example in one of Pacheco’s books. We have modified it so it has a deadlock. Can you find it? It’s a simple off by one. We can find it and others like it.Here is a small program snip taken, and modified, from an example in one of Pacheco’s books. We have modified it so it has a deadlock. Can you find it? It’s a simple off by one. We can find it and others like it.
44. 44 Resource Leak Pattern… Here is a small program snip taken, and modified, from an example in one of Pacheco’s books. We have modified it so it has a deadlock. Can you find it? It’s a simple off by one. We can find it and others like it.Here is a small program snip taken, and modified, from an example in one of Pacheco’s books. We have modified it so it has a deadlock. Can you find it? It’s a simple off by one. We can find it and others like it.
45. 45 Bugs are hidden within huge state-spaces… Partial order reduction is a technique that chooses a set of representative execution interleaving orders or “traces” for the model checker to explore.
Depending on which properties are to be proved by the model checker, more aggressive reduction can be achieved.
The execution trace to explore is determined by the independence of the transitions in the trace and the type of property to be verified.
Presently we are interested in (i) local assertions, (ii) deadlocks, and (iii) cycles.
When two transitions are independent and enabled in the same state you can defer the execution of one transition in preference to the other without missing any property violations.
Partial order reduction is a technique that chooses a set of representative execution interleaving orders or “traces” for the model checker to explore.
Depending on which properties are to be proved by the model checker, more aggressive reduction can be achieved.
The execution trace to explore is determined by the independence of the transitions in the trace and the type of property to be verified.
Presently we are interested in (i) local assertions, (ii) deadlocks, and (iii) cycles.
When two transitions are independent and enabled in the same state you can defer the execution of one transition in preference to the other without missing any property violations.
46. 46 Partial Order Reduction Illustrated… With 3 processes, the size of an interleaved state space is ps=27
Partial-order reduction explores representative sequences from each equivalence class
Delays the execution of independent transitions
In this example, it is possible to “get away” with 7 states (one interleaving) Partial order reduction is a technique that chooses a set of representative execution interleaving orders or “traces” for the model checker to explore.
Depending on which properties are to be proved by the model checker, more aggressive reduction can be achieved.
The execution trace to explore is determined by the independence of the transitions in the trace and the type of property to be verified.
Presently we are interested in (i) local assertions, (ii) deadlocks, and (iii) cycles.
When two transitions are independent and enabled in the same state you can defer the execution of one transition in preference to the other without missing any property violations.
Partial order reduction is a technique that chooses a set of representative execution interleaving orders or “traces” for the model checker to explore.
Depending on which properties are to be proved by the model checker, more aggressive reduction can be achieved.
The execution trace to explore is determined by the independence of the transitions in the trace and the type of property to be verified.
Presently we are interested in (i) local assertions, (ii) deadlocks, and (iii) cycles.
When two transitions are independent and enabled in the same state you can defer the execution of one transition in preference to the other without missing any property violations.
47. 47 // Add-up integrals calculated by each process
if (my_rank == 0) {
total = integral;
for (source = 0; source < p; source++) {
MPI_Recv(&integral, 1, MPI_FLOAT,source,
tag, MPI_COMM_WORLD, &status);
total = total + integral;
}
} else {
MPI_Send(&integral, 1, MPI_FLOAT, dest,
tag, MPI_COMM_WORLD);
}
Here is a small program snip taken, and modified, from an example in one of Pacheco’s books. We have modified it so it has a deadlock. Can you find it? It’s a simple off by one. We can find it and others like it.Here is a small program snip taken, and modified, from an example in one of Pacheco’s books. We have modified it so it has a deadlock. Can you find it? It’s a simple off by one. We can find it and others like it.
48. 48 Organization of ISP
49. 49 Summary (have posters for each) Formal Semantics for a large subset of MPI 2.0
Executable semantics for about 150 MPI 2.0 functions
User interactions through VisualStudio API
Direct execution of user MPI programs to find issues
Downscale code, remove data that does not affect control, etc
New Partial Order Reduction Algorithm
Explores only Relevant Interleavings
User can insert barriers to contain complexity
New Vector-Clock algorithm determines if barriers are safe
Errors detected
Deadlocks
Communication races
Resource leaks
Direct execution of PThread programs to find issues
Adaptation of Dynamic Partial Order Reduction reduces interleavings
Parallel implementation – scales linearly
50. 50 Also built POR explorer for C / Pthreads programs, called “Inspect”
51. 51 Dynamic POR is almost a “must” !
52. 52 Why Dynamic POR ?
53. 53 Why Dynamic POR ?
54. 54 Computation of “ample” sets in Static POR versus in DPOR
55. 55 We target C/C++ PThread Programs
Instrument the given program (largely automated)
Run the concurrent program “till the end”
Record interleaving variants while advancing
When # recorded backtrack points reaches a soft limit, spill work to other nodes
In one larger example, a 11-hour run was finished in 11 minutes using 64 nodes
Heuristic to avoid recomputations was essential
for speed-up.
First known distributed DPOR
56. 56 A Simple DPOR Example t0:
lock(t)
unlock(t)
t1:
lock(t)
unlock(t)
t2:
lock(t)
unlock(t)
57. 57
58. 58 Idea for parallelization: Explore computations from the backtrack set in other processes.“Embarrassingly Parallel” – it seems so, anyway !
59. 59
60. 60 Speedup on aget
61. 61 Speedup on bbuf
62. 62 Historical Note Model Checking
Proposed in 1981
2007 ACM Turing Award for Clarke, Emerson, and Sifakis
Bug discovery facilitated by
The creation of simplified models
Exhaustively checking the models
Exploring only relevant interleavings
63. 63 Looking ahead… Plans for one year out…
64. 64 Finish tool implementation for MPI and others… Static Analysis to reduce some cost
Inserting Barriers (to contain cost) using new vector-clocking algorithm for MPI
Demonstrate on meaningful apps (e.g. Parmetis)
Plug into MS VisualStudio
Development of PThread (“Inspect”) tool with same capabilities
Evolving these tools to Transaction Memory, Microsoft TPL, OpenMP, …
65. 65 Thanks Microsoft !and Dennis Crain, Shahrokh MortazaviIn these times of unpredictable NSF funding, the HPC Institute Program made it possible for us to produce a great cadre of Formal Verification Engineers
Robert Palmer (PhD – to join Microsoft soon), Sonjong Hwang (MS), Steve Barrus (BS), Salman Pervez (MS)
Yu Yang (PhD), Sarvani Vakkalanka (PhD), Guodong Li (PhD), Subodh Sharma (PhD), Anh Vo (PhD), Michael DeLisi (BS/MS), Geof Sawaya (BS)
(http://www.cs.utah.edu/formal_verification)
Microsoft HPC Institutes
NSF CNS 0509379
66. 66 Extra Slides
67. 67 Looking Further Ahead: Need to clear “idea log-jam in multi-core computing…” There isn’t such a thing as an architectural-only solution, or a compilers-only solution to future problems in multi-core computing…
68. 68 Now you see it; Now you don’t !On the menace of non reproducible bugs. Deterministic replay must ideally be an option
User programmable schedulers greatly emphasized by expert developers
Runtime model-checking methods with state-space reduction holds promise in meshing with current practice…