(1) Formal Verification for Message Passing and GPU Computing (2) XUM: An Experimental Multicore supporting MCAPI

(1) Formal Verification for Message Passing and GPU Computing(2) XUM: An Experimental Multicore supporting MCAPI Ganesh Gopalakrishnan School of Computing, University of Utah and Center for Parallel Computing (CPU) http://www.cs.utah.edu/fv

General Theme • Take FM where it hasn’t gone before  • A handful working in these domains which are crucially important • Explore the space of concurrency based on message passing APIs • A little bit of a mid-life crisis for an FV person learning which areas in SW design need help…

Ideas I hope to present • L1: How to test message passing programs used in HPC • Recognize the ubiquity of certain APIs in critical areas (e.g. MPI, in HPC) • With proper semantic characterization, we can formally understand / teach, and formally test • No need to outright dismiss these APIs as “too hairy” • (to the contrary) be able to realize fundamental issues that will be faced in any attempt along the same lines • Long-lived APIs create a tool ecosystem around them that becomes harder to justify replacing • What it takes to print a line in a “real setting” • Need to build stack-walker to peel back profiling layers, and locate the actual source line • Expensive – and pointless – to roll your own stack walker

Ideas I hope to present • L2: How to test at scale • The only practical way to detect communication non-determinism (in this domain) • Can form the backbone of future large-scale replay-based debugging tools

Ideas I hope to present • Realize that the multicore landscape is rapidly changing • Accelerators (e.g. GPUs) are growing in use • Multicore CPUs and GPUs will be integrated more tightly • Energy is a first-rate currency • Lessons learned from the embedded systems world are very relevant

Ideas I hope to present • L3: Creating dedicated verification tools for GPU kernels • How symbolic verification methods can be effectively used to analyze GPU kernel functions • Status of tool and future directions

Ideas I hope to present • L4: Designing an experimental message-passing multicore • Implements an emerging message passing standard called MCAPI in silicon • How the design of special instructions can help with fast messaging • How features in the Network on Chips (NoC) can help support the semantics of MCAPI • Community involvement in the creation of such tangible artifacts can be healthy • Read “The future of Microprocessors” in a recent CACM, by ShekharBorkar and Andrew Chien

Organization • Today • MPI and dyn. FV • Tomorrow • GPU computing and FV • XUM

Context/Motivation: Demand for cycles! • Terascale • Petascale • Exascale • Zettascale

More compute power enables new discoveries, solves new problems • Molecular dynamics simulations • Better drug design facilitated • Sanbonmatsu et al, FSE 2010 keynote • 290 days of simulation to simulate 2 million atom interactions over 2 nano seconds • Better “oil caps” can be designed if we have the right compute infrastructure • Gropp, SC 2010 panel

Commonality among different scalesAlso “HPC” will increasingly go embedded MPI CUDA / OpenCL OpenMP Pthreads Multicore Association APIs High End Machines for HPC / Cloud Desktop Servers and Compute Servers Embedded Systems and Devices

Difficult Road Ahead wrt Debugging • Concurrent software debugging is hard • Gets harder as the degree of parallelism in applications increases • Node level: Message Passing Interface (MPI) • Core level: Threads, OpenMPI, CUDA • Hybrid programming will be the future • MPI + Threads • MPI + OpenMP • MPI + CUDA • Yet tools are lagging behind! • Many tools cannot operate at scale and give measurable coverage HPC Apps HPC Correctness Tools

High-end Debugging Methods areoften Expensive, Inconclusive • Expensive machines, resources • $3M electricity a year (megawatt) • $1B to install hardware • Months of planning to get runtime on cluster • Debugging tools/methods are primitive • Extreme-Scale goal unrealistic w/o better approaches • Inadequate attention from “CS” • Little/no Formal Software Engineering methods • Almost zero critical mass

Importance of Message Passing in HPC (MPI) • Born ~1994 • The world’s fastest CPU ran at 68 MHz • The Internet had 600 sites then! • Java was still not around • Still dominant in 2011 • Large investments in applications, tooling support • Credible FV research in HPC must include MPI • Use of message passing is growing • Erlang, actor languages, MCAPI, .NET async … (not yet for HPC) • Streams in CUDA, Queues in OpenCL,…

Trend: Hybrid Concurrency Problem Solving Environment based User Applications Monolith Large-scale MPI-based User Applications Problem-Solving Environments e.g. Uintah, Charm++, ADLB High Performance MPI Libraries Concurrent Data Structures Infiniband style interconnect Geoforce GTX 480 (Nvidia) Sandybridge (courtesy anandtech.com) AMD Fusion APU

MPI Verification approach depends on type of determinism • Execution Deterministic • Basically one computation per input data • Value Deterministic • Multiple computations, but yield same “final answer” • Nondeterministic • Basically reactive programs built around message passing, possibly also using threads Examples to follow

An example of parallelizing matrix multiplication using message passing X

An example of parallelizing matrix multiplication using message passing X MPI_Send MPI_Recv MPI_Bcast

An example of parallelizing matrix multiplication using message passing X MPI_Send MPI_Recv MPI_Bcast MPI_Send

An example of parallelizing matrix multiplication using message passing MPI_Recv X = MPI_Send MPI_Recv MPI_Bcast MPI_Send

Unoptimized Initial Version : Execution Deterministic MPI_Recv (from: P0, P1, P2, P3…) ; Send Next Row to First Slave which By now must be free MPI_Send

Later Optimized Version : Value DeterministicOpportunistically Send Work to Processor Who Finishes first MPI_Recv ( from : * ) OR Send Next Row to First Worker that returns the answer! MPI_Send

Still More Optimized Value-Deterministic versions:Communications are made Non-blocking, and Software Pipelined(still expected to remain value-deterministic) MPI_Recv ( from : * ) OR Send Next Row to First Worker that returns the answer! MPI_Send

Typical MPI Programs • Value-Nondeterministic MPI programs do exist • Adaptive Dynamic Load Balancing Libraries • But most are value deterministic or execution deterministic • Of course, one does not really know w/o analysis! • Detect replay non-determinism over schedule space • Races can creep into MPI programs • Forgetting to Wait for MPI non-blocking calls to finish • Floating point can make things non-deterministic

Gist of bug-hunting story • MPI programs “die by the bite of a thousand mosquitoes” • No major vulnerabilities one can focus on • E.g. in Thread Programming, focusing on races • With MPI, we need comprehensive “Bug Monitors” • Building MPI bug monitors requires collaboration • Lucky to have collaborations with DOE labs • The lack of FV critical mass hurts

A real-world bug P0 P1 P2 --- --- --- Send( (rank+1)%N ); Send( (rank+1)%N ); Send( (rank+1)%N ); Recv( (rank-1)%N ); Recv( (rank-1)%N ); Recv( (rank-1)%N ); • Expected “circular” msg passing • Found that P0’s Recv entirely vanished !! • REASON : ?? • In C, -1 % N is not N-1 but rather -1 itself • In MPI, “-1” Is MPI_PROC_NULL • Recv posted on MPI_PROC_NULL is ignored !

A real-world bug P0 P1 P2 --- --- --- Send( (rank+1)%N ); Send( (rank+1)%N ); Send( (rank+1)%N ); Recv( (rank-1)%N );Recv( (rank-1)%N ); Recv( (rank-1)%N ); • Expected “circular” msg passing • Found that P0’s Recv entirely vanished !! • REASON : ?? • In C, -1 % N is not N-1, but -1 • In MPI, “-1” Is MPI_PROC_NULL • Recv posted on MPI_PROC_NULL is ignored !

MPI Bugs – more anecdotal evidence • Bug encountered at large scale w.r.t. famous MPI library (Vo) • Bug was absent at a smaller scale • It was a concurrency bug • Attempt to implement collective communication (Thakur) • Bug exists for ranges of size parameters • Wrong assumption: that MPI barrier was irrelevant (Siegel) • It was not – a communication race was created • Other common bugs (we see it a lot; potentially concurrency dep.) • Forgetting to wait for non-blocking receive to finish • Forgetting to free up communicators and type objects • Some codes may be considered buggy if non-determinism arises! • Use of MPI_Recv(*) often does not result in non-deterministic execution • Need something more than “superficial inspection”

Real bug stories in the MPI-land • Typing a[i][i] = init instead of a[i][j] = init • Communication races • Unintended send matches “wildcard receive” • Bugs that show up when ported • Runtime buffering changes; deadlocks erupt • Sometimes, bugs show up when buffering added! • Misunderstood “Collective” semantics • Broadcast does not have “barrier” semantics • MPI + threads • Royal troubles await the newbies

Our Research Agenda in HPC • Solve FV of Pure MPI Applications “well” • Progress in non-determinism coverage for fixed test harness • MUST integrate with good error monitors • (Preliminary) Work on hybrid MPI + Something • Something = Pthreads and CUDA so far • Evaluated heuristics for deterministic replay of Pthreads + MPI • Work on CUDA/OpenCL Analysis • Good progress on Symbolic Static Analyzer for CUDA Kernels • (Prelim.) progress on Symbolic Test Generator for CUDA Pgms • (Future) Symbolic Test Generation to “crash” hybrid pgms • Finding lurking crashes may be a communicable value proposition • (Future) Intelligent schedule-space exploration • Focus on non-monolithic MPI programs

Motivation for Coverage of Communication Nondeterminism

Eliminating wasted search in message passing verif. P0 --- MPI_Send(to P1…); MPI_Send(to P1, data=22); P1 --- MPI_Recv(from P0…); MPI_Recv(from P2…); MPI_Recv(*, x); if (x==22) then error1 else MPI_Recv(*, x); P2 --- MPI_Send(to P1…); MPI_Send(to P1, data=33);

A frequently followed approach: “boil the whole schedule space” – often very wasteful @InProceedings{PADTAD2006:JitterBug, author = {Richard Vuduc and Martin Schulz and Dan Quinlan and Bronis de Supinski and Andreas S{\ae}bj{\"o}rnsen}, title = {Improving distributed memory applications testing by message perturbation}, booktitle = {Proc.~4th Parallel and Distributed Testing and Debugging (PADTAD) Workshop, at the International Symposium on Software Testing and Analysis}, address = {Portland, ME, USA}, month = {July}, year = {2006} }

Eliminating wasted work in message passing verif. No need to play with schedules of deterministic actions P0 --- MPI_Send(to P1…); MPI_Send(to P1, data=22); P1 --- MPI_Recv(from P0…); MPI_Recv(from P2…); MPI_Recv(*, x); if (x==22) then error1 else MPI_Recv(*, x); P2 --- MPI_Send(to P1…); MPI_Send(to P1, data=33); But consider these two cases…

Need to detectResource Dependent Bugs

Example of Resource Dependent Bug P1 Send(to:0); Recv(from:0); P0 Send(to:1); Recv(from:1); We know that this program with lesser Send buffering may deadlock

Example of Resource Dependent Bug P1 Send(to:0); Recv(from:0); P0 Send(to:1); Recv(from:1); … and this program with more Send buffering may avoid a deadlock

Example of Resource Dependent Bug P1 Send(to:2); Recv(from:0); P2 Recv(from:*); Recv(from:0); P0 Send(to:1); Send(to:2); … But this program deadlocks if Send(to:1) has more buffering !

Example of Resource Dependent Bug P1 Send(to:2); Recv(from:0); P2 Recv(from:*); Recv(from:0); P0 Send(to:1); Send(to:2); Mismatched – hence a deadlock … But this program deadlocks if Send(to:1) has more buffering !

Widely Publicized Misunderstandings “”Your program is deadlock free if you have successfully tested it under zero buffering”

MPI at fault? • Perhaps partly • Over 17 years of MPI, things have changed • Inevitable use of shared memory cores, GPUs, … • Yet, many of the issues seem fundamental to • Need for wide adoption across problems, languages, machines • Need to give programmer better handle on resource usage • How to evolve out of MPI? • Whom do we trust to reset the world? • Will they get it any better? • What about the train-wreck meanwhile? • Must one completely evolve out of MPI?

Our Impact So Far Problem Solving Environment based User Applications Monolith Large-scale MPI-based User Applications ISP and DAMPI Problem-Solving Environments e.g. Uintah, Charm++, ADLB High Performance MPI Libraries Useful formalizations to help test these Concurrent Data Structures Infiniband style interconnect PUG and GKLEE Geoforce GTX 480 (Nvidia) Sandybridge (courtesy anandtech.com) AMD Fusion APU

Outline for L1 • Dynamic formal verification of MPI • It is basically testing which discovers all alternate schedules • Coverage of communication non-determinism • Also gives us a “predictive theory” of MPI behavior • Centralized approach : ISP • GEM: Tool Integration within Eclipse Parallel Tools Platform • Demo of GEM

A Simple MPI Example Process P2 Barrier ; Isend(1, req) ; Wait(req) ; Process P0 Isend(1, req) ; Barrier ; Wait(req) ; Process P1 Irecv(*, req) ; Barrier ; Recv(2) ; Wait(req) ;

A Simple MPI Example Process P2 Barrier ; Isend(1, req) ; Wait(req) ; Process P0 Isend(1, req) ; Barrier ; Wait(req) ; Process P1 Irecv(*, req) ; Barrier ; Recv(2) ; Wait(req) ; • Non-blocking Send – send lasts from Isend to Wait • Send buffer can be reclaimed only after Wait clears • Forgetting to issue Wait  MPI “request object” leak

A Simple MPI Example Process P2 Barrier ; Isend(1, req) ; Wait(req) ; Process P0 Isend(1, req) ; Barrier ; Wait(req) ; Process P1 Irecv(*, req) ; Barrier ; Recv(2) ; Wait(req) ;

A Simple MPI Example Process P2 Barrier ; Isend(1, req) ; Wait(req) ; Process P0 Isend(1, req) ; Barrier ; Wait(req) ; Process P1 Irecv(*, req) ; Barrier ; Recv(2) ; Wait(req) ; • Non-blocking Receive – lasts from Irecv to Wait • Recv buffer can be examined only after Wait clears • Forgetting to issue Wait  MPI “request object” leak

(1) Formal Verification for Message Passing and GPU Computing (2) XUM: An Experimental Multicore supporting MCAPI

(1) Formal Verification for Message Passing and GPU Computing (2) XUM: An Experimental Multicore supporting MCAPI

Presentation Transcript

Computing at UF

Introduction to High Performance Computing: Parallel Computing, Distributed Computing, Grid Computing and More

Experimental Design

Automatic Verification of Industrial Designs

MTLE Passing Score-Setting

Supercomputing in Plain English Multicore Madness

Java for High Performance Computing

Optical Computing

Building Abstractions with Data (Part 3)

Formal Semantics

MESSAGE-BASED SYNCHRONISATION AND COMMUNICATION

Introduction to PGAS (UPC and CAF) and Hybrid for Multicore Programming

SoC Verification ( 晶片系統驗證 )

2.3 InterProcess Communication (IPC)

Global Predicate Detection and Event Ordering

Formal vs. Informal Language

Soft Computing

Verification and Validation

CS 333 Introduction to Operating Systems Class 6 – Monitors and Message Passing

Web Services: Formal Modeling and Analysis

Message Passing Interface (MPI) 3