1 / 133

(1) Formal Verification for Message Passing and GPU Computing (2) XUM: An Experimental Multicore supporting MCAPI

(1) Formal Verification for Message Passing and GPU Computing (2) XUM: An Experimental Multicore supporting MCAPI. Ganesh Gopalakrishnan School of Computing, University of Utah and Center for Parallel Computing (CPU) http://www.cs.utah.edu/fv. General Theme.

peyton
Download Presentation

(1) Formal Verification for Message Passing and GPU Computing (2) XUM: An Experimental Multicore supporting MCAPI

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. (1) Formal Verification for Message Passing and GPU Computing(2) XUM: An Experimental Multicore supporting MCAPI Ganesh Gopalakrishnan School of Computing, University of Utah and Center for Parallel Computing (CPU) http://www.cs.utah.edu/fv

  2. General Theme • Take FM where it hasn’t gone before  • A handful working in these domains which are crucially important • Explore the space of concurrency based on message passing APIs • A little bit of a mid-life crisis for an FV person learning which areas in SW design need help…

  3. Ideas I hope to present • L1: How to test message passing programs used in HPC • Recognize the ubiquity of certain APIs in critical areas (e.g. MPI, in HPC) • With proper semantic characterization, we can formally understand / teach, and formally test • No need to outright dismiss these APIs as “too hairy” • (to the contrary) be able to realize fundamental issues that will be faced in any attempt along the same lines • Long-lived APIs create a tool ecosystem around them that becomes harder to justify replacing • What it takes to print a line in a “real setting” • Need to build stack-walker to peel back profiling layers, and locate the actual source line • Expensive – and pointless – to roll your own stack walker

  4. Ideas I hope to present • L2: How to test at scale • The only practical way to detect communication non-determinism (in this domain) • Can form the backbone of future large-scale replay-based debugging tools

  5. Ideas I hope to present • Realize that the multicore landscape is rapidly changing • Accelerators (e.g. GPUs) are growing in use • Multicore CPUs and GPUs will be integrated more tightly • Energy is a first-rate currency • Lessons learned from the embedded systems world are very relevant

  6. Ideas I hope to present • L3: Creating dedicated verification tools for GPU kernels • How symbolic verification methods can be effectively used to analyze GPU kernel functions • Status of tool and future directions

  7. Ideas I hope to present • L4: Designing an experimental message-passing multicore • Implements an emerging message passing standard called MCAPI in silicon • How the design of special instructions can help with fast messaging • How features in the Network on Chips (NoC) can help support the semantics of MCAPI • Community involvement in the creation of such tangible artifacts can be healthy • Read “The future of Microprocessors” in a recent CACM, by ShekharBorkar and Andrew Chien

  8. Organization • Today • MPI and dyn. FV • Tomorrow • GPU computing and FV • XUM

  9. Context/Motivation: Demand for cycles! • Terascale • Petascale • Exascale • Zettascale

  10. More compute power enables new discoveries, solves new problems • Molecular dynamics simulations • Better drug design facilitated • Sanbonmatsu et al, FSE 2010 keynote • 290 days of simulation to simulate 2 million atom interactions over 2 nano seconds • Better “oil caps” can be designed if we have the right compute infrastructure • Gropp, SC 2010 panel

  11. Commonality among different scalesAlso “HPC” will increasingly go embedded MPI CUDA / OpenCL OpenMP Pthreads Multicore Association APIs High End Machines for HPC / Cloud Desktop Servers and Compute Servers Embedded Systems and Devices

  12. Difficult Road Ahead wrt Debugging • Concurrent software debugging is hard • Gets harder as the degree of parallelism in applications increases • Node level: Message Passing Interface (MPI) • Core level: Threads, OpenMPI, CUDA • Hybrid programming will be the future • MPI + Threads • MPI + OpenMP • MPI + CUDA • Yet tools are lagging behind! • Many tools cannot operate at scale and give measurable coverage HPC Apps HPC Correctness Tools

  13. High-end Debugging Methods areoften Expensive, Inconclusive • Expensive machines, resources • $3M electricity a year (megawatt) • $1B to install hardware • Months of planning to get runtime on cluster • Debugging tools/methods are primitive • Extreme-Scale goal unrealistic w/o better approaches • Inadequate attention from “CS” • Little/no Formal Software Engineering methods • Almost zero critical mass

  14. Importance of Message Passing in HPC (MPI) • Born ~1994 • The world’s fastest CPU ran at 68 MHz • The Internet had 600 sites then! • Java was still not around • Still dominant in 2011 • Large investments in applications, tooling support • Credible FV research in HPC must include MPI • Use of message passing is growing • Erlang, actor languages, MCAPI, .NET async … (not yet for HPC) • Streams in CUDA, Queues in OpenCL,…

  15. Trend: Hybrid Concurrency Problem Solving Environment based User Applications Monolith Large-scale MPI-based User Applications Problem-Solving Environments e.g. Uintah, Charm++, ADLB High Performance MPI Libraries Concurrent Data Structures Infiniband style interconnect Geoforce GTX 480 (Nvidia) Sandybridge (courtesy anandtech.com) AMD Fusion APU

  16. MPI Verification approach depends on type of determinism • Execution Deterministic • Basically one computation per input data • Value Deterministic • Multiple computations, but yield same “final answer” • Nondeterministic • Basically reactive programs built around message passing, possibly also using threads Examples to follow

  17. An example of parallelizing matrix multiplication using message passing X

  18. An example of parallelizing matrix multiplication using message passing X MPI_Send MPI_Recv MPI_Bcast

  19. An example of parallelizing matrix multiplication using message passing X MPI_Send MPI_Recv MPI_Bcast MPI_Send

  20. An example of parallelizing matrix multiplication using message passing MPI_Recv X = MPI_Send MPI_Recv MPI_Bcast MPI_Send

  21. Unoptimized Initial Version : Execution Deterministic MPI_Recv (from: P0, P1, P2, P3…) ; Send Next Row to First Slave which By now must be free MPI_Send

  22. Later Optimized Version : Value DeterministicOpportunistically Send Work to Processor Who Finishes first MPI_Recv ( from : * ) OR Send Next Row to First Worker that returns the answer! MPI_Send

  23. Still More Optimized Value-Deterministic versions:Communications are made Non-blocking, and Software Pipelined(still expected to remain value-deterministic) MPI_Recv ( from : * ) OR Send Next Row to First Worker that returns the answer! MPI_Send

  24. Typical MPI Programs • Value-Nondeterministic MPI programs do exist • Adaptive Dynamic Load Balancing Libraries • But most are value deterministic or execution deterministic • Of course, one does not really know w/o analysis! • Detect replay non-determinism over schedule space • Races can creep into MPI programs • Forgetting to Wait for MPI non-blocking calls to finish • Floating point can make things non-deterministic

  25. Gist of bug-hunting story • MPI programs “die by the bite of a thousand mosquitoes” • No major vulnerabilities one can focus on • E.g. in Thread Programming, focusing on races • With MPI, we need comprehensive “Bug Monitors” • Building MPI bug monitors requires collaboration • Lucky to have collaborations with DOE labs • The lack of FV critical mass hurts

  26. A real-world bug P0 P1 P2 --- --- --- Send( (rank+1)%N ); Send( (rank+1)%N ); Send( (rank+1)%N ); Recv( (rank-1)%N ); Recv( (rank-1)%N ); Recv( (rank-1)%N ); • Expected “circular” msg passing • Found that P0’s Recv entirely vanished !! • REASON : ?? • In C, -1 % N is not N-1 but rather -1 itself • In MPI, “-1” Is MPI_PROC_NULL • Recv posted on MPI_PROC_NULL is ignored !

  27. A real-world bug P0 P1 P2 --- --- --- Send( (rank+1)%N ); Send( (rank+1)%N ); Send( (rank+1)%N ); Recv( (rank-1)%N );Recv( (rank-1)%N ); Recv( (rank-1)%N ); • Expected “circular” msg passing • Found that P0’s Recv entirely vanished !! • REASON : ?? • In C, -1 % N is not N-1, but -1 • In MPI, “-1” Is MPI_PROC_NULL • Recv posted on MPI_PROC_NULL is ignored !

  28. MPI Bugs – more anecdotal evidence • Bug encountered at large scale w.r.t. famous MPI library (Vo) • Bug was absent at a smaller scale • It was a concurrency bug • Attempt to implement collective communication (Thakur) • Bug exists for ranges of size parameters • Wrong assumption: that MPI barrier was irrelevant (Siegel) • It was not – a communication race was created • Other common bugs (we see it a lot; potentially concurrency dep.) • Forgetting to wait for non-blocking receive to finish • Forgetting to free up communicators and type objects • Some codes may be considered buggy if non-determinism arises! • Use of MPI_Recv(*) often does not result in non-deterministic execution • Need something more than “superficial inspection”

  29. Real bug stories in the MPI-land • Typing a[i][i] = init instead of a[i][j] = init • Communication races • Unintended send matches “wildcard receive” • Bugs that show up when ported • Runtime buffering changes; deadlocks erupt • Sometimes, bugs show up when buffering added! • Misunderstood “Collective” semantics • Broadcast does not have “barrier” semantics • MPI + threads • Royal troubles await the newbies

  30. Our Research Agenda in HPC • Solve FV of Pure MPI Applications “well” • Progress in non-determinism coverage for fixed test harness • MUST integrate with good error monitors • (Preliminary) Work on hybrid MPI + Something • Something = Pthreads and CUDA so far • Evaluated heuristics for deterministic replay of Pthreads + MPI • Work on CUDA/OpenCL Analysis • Good progress on Symbolic Static Analyzer for CUDA Kernels • (Prelim.) progress on Symbolic Test Generator for CUDA Pgms • (Future) Symbolic Test Generation to “crash” hybrid pgms • Finding lurking crashes may be a communicable value proposition • (Future) Intelligent schedule-space exploration • Focus on non-monolithic MPI programs

  31. Motivation for Coverage of Communication Nondeterminism

  32. Eliminating wasted search in message passing verif. P0 --- MPI_Send(to P1…); MPI_Send(to P1, data=22); P1 --- MPI_Recv(from P0…); MPI_Recv(from P2…); MPI_Recv(*, x); if (x==22) then error1 else MPI_Recv(*, x); P2 --- MPI_Send(to P1…); MPI_Send(to P1, data=33);

  33. A frequently followed approach: “boil the whole schedule space” – often very wasteful @InProceedings{PADTAD2006:JitterBug, author = {Richard Vuduc and Martin Schulz and Dan Quinlan and Bronis de Supinski and Andreas S{\ae}bj{\"o}rnsen}, title = {Improving distributed memory applications testing by message perturbation}, booktitle = {Proc.~4th Parallel and Distributed Testing and Debugging (PADTAD) Workshop, at the International Symposium on Software Testing and Analysis}, address = {Portland, ME, USA}, month = {July}, year = {2006} }

  34. Eliminating wasted work in message passing verif. No need to play with schedules of deterministic actions P0 --- MPI_Send(to P1…); MPI_Send(to P1, data=22); P1 --- MPI_Recv(from P0…); MPI_Recv(from P2…); MPI_Recv(*, x); if (x==22) then error1 else MPI_Recv(*, x); P2 --- MPI_Send(to P1…); MPI_Send(to P1, data=33); But consider these two cases…

  35. Need to detectResource Dependent Bugs

  36. Example of Resource Dependent Bug P1 Send(to:0); Recv(from:0); P0 Send(to:1); Recv(from:1); We know that this program with lesser Send buffering may deadlock

  37. Example of Resource Dependent Bug P1 Send(to:0); Recv(from:0); P0 Send(to:1); Recv(from:1); … and this program with more Send buffering may avoid a deadlock

  38. Example of Resource Dependent Bug P1 Send(to:2); Recv(from:0); P2 Recv(from:*); Recv(from:0); P0 Send(to:1); Send(to:2); … But this program deadlocks if Send(to:1) has more buffering !

  39. Example of Resource Dependent Bug P1 Send(to:2); Recv(from:0); P2 Recv(from:*); Recv(from:0); P0 Send(to:1); Send(to:2); … But this program deadlocks if Send(to:1) has more buffering !

  40. Example of Resource Dependent Bug P1 Send(to:2); Recv(from:0); P2 Recv(from:*); Recv(from:0); P0 Send(to:1); Send(to:2); … But this program deadlocks if Send(to:1) has more buffering !

  41. Example of Resource Dependent Bug P1 Send(to:2); Recv(from:0); P2 Recv(from:*); Recv(from:0); P0 Send(to:1); Send(to:2); … But this program deadlocks if Send(to:1) has more buffering !

  42. Example of Resource Dependent Bug P1 Send(to:2); Recv(from:0); P2 Recv(from:*); Recv(from:0); P0 Send(to:1); Send(to:2); Mismatched – hence a deadlock … But this program deadlocks if Send(to:1) has more buffering !

  43. Widely Publicized Misunderstandings “”Your program is deadlock free if you have successfully tested it under zero buffering”

  44. MPI at fault? • Perhaps partly • Over 17 years of MPI, things have changed • Inevitable use of shared memory cores, GPUs, … • Yet, many of the issues seem fundamental to • Need for wide adoption across problems, languages, machines • Need to give programmer better handle on resource usage • How to evolve out of MPI? • Whom do we trust to reset the world? • Will they get it any better? • What about the train-wreck meanwhile? • Must one completely evolve out of MPI?

  45. Our Impact So Far Problem Solving Environment based User Applications Monolith Large-scale MPI-based User Applications ISP and DAMPI Problem-Solving Environments e.g. Uintah, Charm++, ADLB High Performance MPI Libraries Useful formalizations to help test these Concurrent Data Structures Infiniband style interconnect PUG and GKLEE Geoforce GTX 480 (Nvidia) Sandybridge (courtesy anandtech.com) AMD Fusion APU

  46. Outline for L1 • Dynamic formal verification of MPI • It is basically testing which discovers all alternate schedules • Coverage of communication non-determinism • Also gives us a “predictive theory” of MPI behavior • Centralized approach : ISP • GEM: Tool Integration within Eclipse Parallel Tools Platform • Demo of GEM

  47. A Simple MPI Example Process P2 Barrier ; Isend(1, req) ; Wait(req) ; Process P0 Isend(1, req) ; Barrier ; Wait(req) ; Process P1 Irecv(*, req) ; Barrier ; Recv(2) ; Wait(req) ;

  48. A Simple MPI Example Process P2 Barrier ; Isend(1, req) ; Wait(req) ; Process P0 Isend(1, req) ; Barrier ; Wait(req) ; Process P1 Irecv(*, req) ; Barrier ; Recv(2) ; Wait(req) ; • Non-blocking Send – send lasts from Isend to Wait • Send buffer can be reclaimed only after Wait clears • Forgetting to issue Wait  MPI “request object” leak

  49. A Simple MPI Example Process P2 Barrier ; Isend(1, req) ; Wait(req) ; Process P0 Isend(1, req) ; Barrier ; Wait(req) ; Process P1 Irecv(*, req) ; Barrier ; Recv(2) ; Wait(req) ;

  50. A Simple MPI Example Process P2 Barrier ; Isend(1, req) ; Wait(req) ; Process P0 Isend(1, req) ; Barrier ; Wait(req) ; Process P1 Irecv(*, req) ; Barrier ; Recv(2) ; Wait(req) ; • Non-blocking Receive – lasts from Irecv to Wait • Recv buffer can be examined only after Wait clears • Forgetting to issue Wait  MPI “request object” leak

More Related