1 / 23

A Few Thoughts on Programming Models for Massively Parallel Systems

A Few Thoughts on Programming Models for Massively Parallel Systems. Bill Gropp and Rusty Lusk Mathematics and Computer Science Division www.mcs.anl.gov/~{gropp,lusk}. Application Realities. The applications for massively parallel systems already exist Because they take years to write

lucus
Download Presentation

A Few Thoughts on Programming Models for Massively Parallel Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Few Thoughts on Programming Models for Massively Parallel Systems Bill Gropp and Rusty Lusk Mathematics and Computer Science Divisionwww.mcs.anl.gov/~{gropp,lusk}

  2. Application Realities • The applications for massively parallel systems already exist • Because they take years to write • They are in a variety of models • MPI • Shared memory • Vector • Other • Challenges include expressing massive parallelism and giving natural expression to spatial and temporal locality.

  3. What is the hardest problem? • (Overly simplistic statement): Program difficulty is directly related to the relative gap in latency and overhead • The biggest relative gap is the remote (MPI) gap, right?

  4. Short Term • Transition existing applications • Compiler does it all • Model: vectorizing compilers (with feedback to retrain user) • Libraries (component software does it all) • Model: BLAS, CCA, “PETSc in PIM” • Take MPI or MPI/OpenMP codes only • Challenges • Remember history: Cray vs. STAR-100 vs. Attached Processors

  5. Mid Term • Use variations or extensions of familiar languages • E.g., CoArray Fortran, UPC, OpenMP, HPF, Brook • Issues: • Local vs. global. Where is the middle (for hierarchical algorithms)? • Dynamic software (see libraries, CCA above); adaptive algorithms. • Support for modular or component oriented software.

  6. Long Term • Performance • How much can we shield the user from managing memory? • Fault Tolerance • Particularly the impact on data distribution strategies • Debugging for performance and correctness • Intel lessons: lock-out makes it difficult to perform post mortems on parallel systems

  7. Danger! Danger! Danger! • Massively parallel systems are needed for hard, not easy, problems • Programming models must make difficult problems possible; the focus must not be on making simple problems trivial. • E.g., fast dense matrix-matrix multiply isn’t a good measure of the suitability of a programming model.

  8. Don’t Forget the 90/10 Rule • 90 % of the execution time in 10 % of the code • Performance focus emphasizes this 10% • The other 90% of the effort goes into the other 90% of the code • Modularity, expressivity, maintainability are important here

  9. Supporting the Writing of Correct Programs • Deterministic algorithms should have an expression that is easy to prove is deterministic • This doesn’t mean enforcing a particular execution order or preventing the use of non-deterministic algorithms • Races are just too hard to avoid • Only “Hero” programmers may be reliable • Will we have “Structured parallel programming”? • Undisciplined access to shared objects is very risky • Like goto, the access to shared objects is both powerful and (as pointed out about goto) can simplify programs • The challenge repeated: what are the structured parallel programming constructs?

  10. Concrete Challenges for Programming Models forMassively Parallel Systems • Completeness of Expression • How many advanced and emerging algorithms do we exclude? • How many legacy applications to we abandon? • Fault Tolerance • Expressing (or avoiding) problem decomposition • Correctness Debugging • Performance Debugging • I/O • Networking

  11. Completeness of Expression • Can you efficiently implement MPI? • No, MPI is not the best or even a great model for WIMPS. But … • It is well defined • The individual operations are relatively simple • Parallel implementation issues are relatively well understood • MPI is designed for scalability (apps already running on thousands of processors) • Thus, any programming model should be able to implement MPI with a reasonable amount of effort. Consider MPI a “null test” of the power of a programming model. • Side effect: gives insight into how to transition existing MPI applications onto massively parallel systems • Gives some insight into the performance of many applications because it factors the problem into local and non-local performance issues.

  12. Fault Tolerance • Do we require fault tolerance on every operation, or just on the application? • Checkpoints vs. “reliable computing” • Cost of fine vs. coarse grain guarantees • Software and performance costs! • What is the support for fault-tolerant algorithms? • Coarse-grain (checkpoint) vs. fine-grain (transactions) • Interaction with data decomposition • Regular decompositions vs. turning off dead processors

  13. Problem Decomposition • Decomposition-centric (e.g., data-centric) programming models • Vectors and Streams are examples • Divide-and-conquer or recursive generation (Mou, Leiserson, many others) • More freedom in storage association (e.g., blocking to natural memory sizes; padding to eliminate false sharing)

  14. Problem Decomposition Approaches • Very fine grain (I.e., ignore) • Individual words. Many think that this is the most general way. • You build a fast UMA-PRAM and I’ll believe it. • Low overhead and latency tolerance requires discovery of significant independent stuff • Special aggregates • Vectors, streams, tasks (object based decompositions) • Implicit by user-visible specification • E.g., Recursive subdivision

  15. Application Kernels • Are needed to understand, evaluate candidates • Risks • Not representative • Over-simplified • Implicit information exploited in solution • (give example) • Under-simplified • Too hard to work with • Wrong evaluation metric • Result in “fragile” results: small changes in specification cause large changes in results • Called “Ill-posed” in numerical analysis • Widely recognized: “the only real benchmark is your own application”

  16. Example Application Kernels • Bad: • Dense matrix-matrix multiply • Rarely good algorithmic choice in practice • Too easy • (Even if most compilers don’t do a good job with this) • Fixed-length FFT • Jacobi sweeps • Getting better: • Sparse matrix-vector multiply

  17. Hand-tuned Compiler From Atlas Reality Check Enormous effort required to get good performance

  18. Better Application Kernels • Even Better: • Sparse matrix assembly followed by matrix-vector multiply, on q of p processing elements, matrix elements are r x r blocks • Assembly: often a disproportionate amount of coding, stresses expressivity • q<p: supports hierarchical algorithms • Sparse matrix: many aspects of PDE simulation (explicit variable coefficient problems, Krylov methods and some preconditioners, multigrid); r x r typical for real multi-component problems. • Freedoms: data structure for sparse matrix representation (but bounded spatial overhead) • Best: • Your description here (please!)

  19. Some Other Comments • Is a general purpose programming model needed? • Domain-specific environments • Combine languages, libraries, static, and dynamic tools • JIT optimization • Tools to construct efficient special-purpose systems • First steps in this direction • OpenMP (warts like “lastprivate” and all) • Name the newest widely-accepted, non-derivative programming language • Not T, Java, Visual Basic, Python

  20. Challenges • The Processor in Memory (PIM) • Ignore the M(assive). How can we program the PIM? • Implicitly adopts the hybrid model; pragmatic if ugly • Supporting legacy applications • Implementing MPI efficiently at large scale • Reconsider SMP and DSM-style implementations (many current impls immature) • Supporting important classes of applications • Don’t pick a single model • Recall Dan Reed’s comment about loosing half the users with each new architecture • Explicitly make tradeoffs between features • Massive virtualization vs ruthless exploitation of compile-time knowledge • Interacting with the OS • Is the OS interface intrinsically nonscalable? • Is the OS interface scalable, but only with heroic levels of implementation effort?

  21. Scalable System Services • 100000 Independent tasks • Are they truly independent? One property of related tasks is that: • The probability that a significant number will make the same (or any!) nonlocal system call (e.g., I/O request) in the same time interval >> random chance • What is the programming model’s role in • Aggregating nonlocal operations? • Providing a framework in which it is natural to write programs that make scalable calls to system services?

  22. Cautionary Tales • Timers. Application programmer uses gettimeofday to time program. Each thread uses this to generate profiling data. • File systems. Some applications write one file/task (or one file/task/timestep) leading to zillions of files. How long does ls take? ls –lt? Don’t forget, all of the names are almost identical (worst-case sorting?) • Job startup. 100000 tasks start from their local executable, then all access a shared object (e.g., MPI_Init). What happens to the file system?

  23. New OS Semantics? • Define value-return calls (e.g., file stat, gettimeofday) to allow on the fly aggregation • Defensive move for OS • You can always write a nonscalable program • Define state-update with scalable semantics • Collective operations • Thread safe • Avoid seek, provide write_at, read_at

More Related