1 / 67

MPI Verification

taber
Download Presentation

MPI Verification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. 1 MPI Verification Ganesh Gopalakrishnan and Robert M. Kirby Students Yu Yang, Sarvani Vakkalanka, Guodong Li, Subodh Sharma, Anh Vo, Michael DeLisi, Geof Sawaya (http://www.cs.utah.edu/formal_verification) School of Computing University of Utah Supported by: Microsoft HPC Institutes NSF CNS 0509379

    2. 2 “MPI Verification” or How to exhaustively verify MPI programs without the pain of model building and considering only “relevant interleavings”

    3. 3 Computing is at an inflection point

    4. 4 Our work pertains to these: MPI programs MPI libraries Shared Memory Threads based on Locks

    5. 5 Name of the Game: Progress Through Precision Precision in Understanding Precision in Modeling Precision in Analysis Doing Modeling and Analysis with Low Cost

    6. 6 1. Need for Precision in Understanding: The “crooked barrier” quiz

    7. 7 Need for Precision in Understanding: The “crooked barrier” quiz

    8. 8 Need for Precision in Understanding: The “crooked barrier” quiz

    9. 9 Need for Precision in Understanding: The “crooked barrier” quiz

    10. 10 Need for Precision in Understanding: The “crooked barrier” quiz

    11. 11 Need for Precision in Understanding: The “crooked barrier” quiz

    12. 12 Need for Precision in Understanding: The “crooked barrier” quiz

    13. 13 Would you rather explain each conceivable situation in a large API with an elaborate “bee dance” and informal English…. or would you rather specify it mathematically and let the user calculate the outcomes?

    14. 14

    15. 15

    16. 16 Executable Formal Specification can help validate our understanding of MPI …

    17. 17 Subject the system (or a reduced version of the system) to a collection of inputs (and hence execution paths) Concrete example: when codes are ported, they typically break Subject the system (or a reduced version of the system) to a collection of inputs (and hence execution paths) Concrete example: when codes are ported, they typically break

    18. 18 Subject the system (or a reduced version of the system) to a collection of inputs (and hence execution paths) Concrete example: when codes are ported, they typically break Subject the system (or a reduced version of the system) to a collection of inputs (and hence execution paths) Concrete example: when codes are ported, they typically break

    19. 19 Error-trace Visualization in VisualStudio

    20. 20 2. Precision in Modeling: The “Byte-range Locking Protocol” Challenge Asked to see if new protocol using MPI 1-sided was OK…

    21. 21 Precision in Modeling: The “Byte-range Locking Protocol” Challenge Studied code Wrote Promela Verification Model (a week) Applied the SPIN Model Checker Found Two Deadlocks Previously Unknown Wrote Paper (EuroPVM / MPI 2006) with Thakur and Gropp – won one of the three best-paper awards With new insight, Designed Correct AND Faster Protocol ! Still, we felt lucky … what if we had missed the error while hand-modeling Also hand-modeling was NO FUN – how about running the real MPI code “cleverly” ?

    22. 22 Measurement under Low Contention

    23. 23 Measurement under High Contention

    24. 24 4. Modeling and Analysis with Reduced Cost…

    25. 25 What works for cards works for MPI (and for PThreads also) !!

    26. 26 4. Modeling and Analysis with Reduced Cost The “Byte-range Locking Protocol” Challenge Studied code ? DID NOT STUDY CODE Wrote Promela Verification Model (a week) ? NO MODELING Applied the SPIN Model Checker ? NEW ISP VERIFIER Found Two Deadlocks Previously Unknown ? FOUND SAME! Wrote Paper (EuroPVM / MPI 2007) with Thakur and Gropp – won one of the three best-paper awards ? DID NOT WIN ? Still, we felt lucky … what if we had missed the error while hand-modeling ? NO NEED TO FEEL LUCKY (NO LOST INTERLEAVING – but also did not foolishly do ALL interleavings) Also hand-modeling was NO FUN – how about running the real MPI code “cleverly” ? ? DIRECT RUNNING WAS FUN

    27. 27 3. Precision in Analysis The “crooked barrier” quiz again …

    28. 28 3. Precision in Analysis The “crooked barrier” quiz again …

    29. 29 Precision in Analysis POE Works Great (all 41 Umpire Test-Suites Run) No need to “pad” delay statements to jiggle schedule and force “the other” interleaving This is a very brittle trick anyway! Prelim Version Under Submission Detailed Version for EuroPVM… Jitterbug uses this approach We don’t need it Siegel (MPI_SPIN): Modeling effort Marmot : Different Coverage Guarantees..

    30. 30 1-4: Finally! Precision and Low Cost in Modeling and Analysis, taking advantage of MPI semantics (in our heads…)

    31. 31 Discover All Potential Senders by Collecting (but not issuing) operations at runtime…

    32. 32 Rewrite “ANY” to ALL POTENTIAL SENDERS

    33. 33 Rewrite “ANY” to ALL POTENTIAL SENDERS

    34. 34 Recurse over all such configurations !

    35. 35 If we now have P0-P2 doing this, and P3-5 doing the same computation between themselves, no need to interleave these groups…

    36. 36 Why is all this worth doing ?

    37. 37 MPI is the de-facto standard for programming cluster machines

    38. 38

    39. 39 The Need for Formal Semantics for MPI Send Receive Send / Receive Send / Receive / Replace Broadcast Barrier Reduce

    40. 40 Need Formal Semantics for MPI, because we can’t imitate any existing implementation…

    41. 41 Look for commonly committed mistakes automatically Deadlocks Communication Races Resource Leaks We are only after “low hanging” bugs…

    42. 42 Deadlock pattern… Here is a small program snip taken, and modified, from an example in one of Pacheco’s books. We have modified it so it has a deadlock. Can you find it? It’s a simple off by one. We can find it and others like it.Here is a small program snip taken, and modified, from an example in one of Pacheco’s books. We have modified it so it has a deadlock. Can you find it? It’s a simple off by one. We can find it and others like it.

    43. 43 Communication Race Pattern… Here is a small program snip taken, and modified, from an example in one of Pacheco’s books. We have modified it so it has a deadlock. Can you find it? It’s a simple off by one. We can find it and others like it.Here is a small program snip taken, and modified, from an example in one of Pacheco’s books. We have modified it so it has a deadlock. Can you find it? It’s a simple off by one. We can find it and others like it.

    44. 44 Resource Leak Pattern… Here is a small program snip taken, and modified, from an example in one of Pacheco’s books. We have modified it so it has a deadlock. Can you find it? It’s a simple off by one. We can find it and others like it.Here is a small program snip taken, and modified, from an example in one of Pacheco’s books. We have modified it so it has a deadlock. Can you find it? It’s a simple off by one. We can find it and others like it.

    45. 45 Bugs are hidden within huge state-spaces… Partial order reduction is a technique that chooses a set of representative execution interleaving orders or “traces” for the model checker to explore. Depending on which properties are to be proved by the model checker, more aggressive reduction can be achieved. The execution trace to explore is determined by the independence of the transitions in the trace and the type of property to be verified. Presently we are interested in (i) local assertions, (ii) deadlocks, and (iii) cycles. When two transitions are independent and enabled in the same state you can defer the execution of one transition in preference to the other without missing any property violations. Partial order reduction is a technique that chooses a set of representative execution interleaving orders or “traces” for the model checker to explore. Depending on which properties are to be proved by the model checker, more aggressive reduction can be achieved. The execution trace to explore is determined by the independence of the transitions in the trace and the type of property to be verified. Presently we are interested in (i) local assertions, (ii) deadlocks, and (iii) cycles. When two transitions are independent and enabled in the same state you can defer the execution of one transition in preference to the other without missing any property violations.

    46. 46 Partial Order Reduction Illustrated… With 3 processes, the size of an interleaved state space is ps=27 Partial-order reduction explores representative sequences from each equivalence class Delays the execution of independent transitions In this example, it is possible to “get away” with 7 states (one interleaving) Partial order reduction is a technique that chooses a set of representative execution interleaving orders or “traces” for the model checker to explore. Depending on which properties are to be proved by the model checker, more aggressive reduction can be achieved. The execution trace to explore is determined by the independence of the transitions in the trace and the type of property to be verified. Presently we are interested in (i) local assertions, (ii) deadlocks, and (iii) cycles. When two transitions are independent and enabled in the same state you can defer the execution of one transition in preference to the other without missing any property violations. Partial order reduction is a technique that chooses a set of representative execution interleaving orders or “traces” for the model checker to explore. Depending on which properties are to be proved by the model checker, more aggressive reduction can be achieved. The execution trace to explore is determined by the independence of the transitions in the trace and the type of property to be verified. Presently we are interested in (i) local assertions, (ii) deadlocks, and (iii) cycles. When two transitions are independent and enabled in the same state you can defer the execution of one transition in preference to the other without missing any property violations.

    47. 47 // Add-up integrals calculated by each process if (my_rank == 0) { total = integral; for (source = 0; source < p; source++) { MPI_Recv(&integral, 1, MPI_FLOAT,source, tag, MPI_COMM_WORLD, &status); total = total + integral; } } else { MPI_Send(&integral, 1, MPI_FLOAT, dest, tag, MPI_COMM_WORLD); } Here is a small program snip taken, and modified, from an example in one of Pacheco’s books. We have modified it so it has a deadlock. Can you find it? It’s a simple off by one. We can find it and others like it.Here is a small program snip taken, and modified, from an example in one of Pacheco’s books. We have modified it so it has a deadlock. Can you find it? It’s a simple off by one. We can find it and others like it.

    48. 48 Organization of ISP

    49. 49 Summary (have posters for each) Formal Semantics for a large subset of MPI 2.0 Executable semantics for about 150 MPI 2.0 functions User interactions through VisualStudio API Direct execution of user MPI programs to find issues Downscale code, remove data that does not affect control, etc New Partial Order Reduction Algorithm Explores only Relevant Interleavings User can insert barriers to contain complexity New Vector-Clock algorithm determines if barriers are safe Errors detected Deadlocks Communication races Resource leaks Direct execution of PThread programs to find issues Adaptation of Dynamic Partial Order Reduction reduces interleavings Parallel implementation – scales linearly

    50. 50 Also built POR explorer for C / Pthreads programs, called “Inspect”

    51. 51 Dynamic POR is almost a “must” !

    52. 52 Why Dynamic POR ?

    53. 53 Why Dynamic POR ?

    54. 54 Computation of “ample” sets in Static POR versus in DPOR

    55. 55 We target C/C++ PThread Programs Instrument the given program (largely automated) Run the concurrent program “till the end” Record interleaving variants while advancing When # recorded backtrack points reaches a soft limit, spill work to other nodes In one larger example, a 11-hour run was finished in 11 minutes using 64 nodes Heuristic to avoid recomputations was essential for speed-up. First known distributed DPOR

    56. 56 A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t)

    57. 57

    58. 58 Idea for parallelization: Explore computations from the backtrack set in other processes. “Embarrassingly Parallel” – it seems so, anyway !

    59. 59

    60. 60 Speedup on aget

    61. 61 Speedup on bbuf

    62. 62 Historical Note Model Checking Proposed in 1981 2007 ACM Turing Award for Clarke, Emerson, and Sifakis Bug discovery facilitated by The creation of simplified models Exhaustively checking the models Exploring only relevant interleavings

    63. 63 Looking ahead… Plans for one year out…

    64. 64 Finish tool implementation for MPI and others… Static Analysis to reduce some cost Inserting Barriers (to contain cost) using new vector-clocking algorithm for MPI Demonstrate on meaningful apps (e.g. Parmetis) Plug into MS VisualStudio Development of PThread (“Inspect”) tool with same capabilities Evolving these tools to Transaction Memory, Microsoft TPL, OpenMP, …

    65. 65 Thanks Microsoft ! and Dennis Crain, Shahrokh Mortazavi In these times of unpredictable NSF funding, the HPC Institute Program made it possible for us to produce a great cadre of Formal Verification Engineers Robert Palmer (PhD – to join Microsoft soon), Sonjong Hwang (MS), Steve Barrus (BS), Salman Pervez (MS) Yu Yang (PhD), Sarvani Vakkalanka (PhD), Guodong Li (PhD), Subodh Sharma (PhD), Anh Vo (PhD), Michael DeLisi (BS/MS), Geof Sawaya (BS) (http://www.cs.utah.edu/formal_verification) Microsoft HPC Institutes NSF CNS 0509379

    66. 66 Extra Slides

    67. 67 Looking Further Ahead: Need to clear “idea log-jam in multi-core computing…” There isn’t such a thing as an architectural-only solution, or a compilers-only solution to future problems in multi-core computing…

    68. 68 Now you see it; Now you don’t ! On the menace of non reproducible bugs. Deterministic replay must ideally be an option User programmable schedulers greatly emphasized by expert developers Runtime model-checking methods with state-space reduction holds promise in meshing with current practice…

More Related