1 / 24

Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems. Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Performance and Architectures Lab (PAL), CCS-3. Schedule. 09 .00-09.45 Introduction 09.45 -10.00 Break 10 .00-10.45 Existing Systems 10 .45-11.00 Break

varsha
Download Presentation

Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Performance and Architectures Lab (PAL), CCS-3

  2. Schedule • 09.00-09.45 Introduction • 09.45-10.00 Break • 10.00-10.45 Existing Systems • 10.45-11.00 Break • 11.00-11.45 Case Study • 11.45-12.00 Break • 12.00-12.45 A New Approach

  3. Part 1: Introduction • The need for more capability • The big issues • A taxonomy of systems in three dimensions

  4. The Need for More Capability The most constant difficulty in contriving the engine has arisen from the desire to reduce the time in which the calculations were executed to the shortest which is possible. Charles Babbage, 1791-1871 Our interest is in scientific computing—large-scale, numerical, parallel applications run on large-scale parallel machines.

  5. Definitions • Computing capacity: total deliverable computing power from a system or set of systems. (Power—rate of delivery) • Computing capability: computing power available to a single application. Highest-end computing is primarily concerned with capability—why else build such machines?

  6. The Need for Large-Scale Parallel Machines • It is the insatiable demand for ever more computational capability that has driven the creation of many Tflop-scale parallel machines (Earth Simulator, LANL’s ASCI Q, LLNL’s Thunder and BlueGene/L, etc.) • Petaflop machines are on the horizon, for example DARPA HPCS program (High Productivity Computing Systems)

  7. One-upmanship? Is this merely one-upmanship with the Japanese? From The Roadmap for the Revitalization of High-End Computing, Computing Research Association: […] there is a growing recognition that a new set of scientific and engineering discoveries could be catalyzed by access to very-large-scale computer systems—those in the 100 teraflop to petaflop range.

  8. Requirements for ASCI In our own arena, Advanced Simulation and Computing (ASC) for stockpile stewardship; climate, ocean, and urban infrastructure modeling, etc., Within 10 years, estimates of the demand for Capability and general physics arguments indicate a machine of 1000TF=1 PetaFlop (PF) will be needed to execute the most demanding jobs. Such demand is inevitable; it should not be viewed, however, as some plateau in required Capability: there are sound technical reasons to expect even greater Capability demand in the future.

  9. Large Component Count 133,120 processors 608,256 DRAM • Increases in performance will be achieved through single processor improvements and increases in component count • For example, BlueGene/L will have 133,120 processors and 608,256 memory modules • The large component count will make any assumption of complete reliability unrealistic

  10. Sensitivity to Failures • In a large-scale machine a failure of a single component usually causes a significant fraction of the system to fail because • Components are strongly coupled (e.g., a failure of a fan will lead to other failures due to overheating) • The state of the application is not stored redundantly, and loss of any state is catastrophic • In capability mode, many processing nodes are running the same application, and are tightly coupled together

  11. The Need for Transparent Fault-Tolerance • System software must be resilient to failures, to allow continuing execution of in the presence of failures • Most of the investment is in the application software (250M$/year for MPI software in the ASCI TriLabs) • Economical constraints impose a limited level of redundancy • Other considerations include cost of development, scalability and efficiency

  12. The JASON’s Report • A recent report from the JASON’s, a committee of distinguished scientists chartered by the US government, raised the sensitive question of whether ASCI machines can be used as capability engines • For that to be possible, major advances in fault-tolerance are needed • The recommendation of the report is to skip one generation of supercomputers, due to the lack of good technical/scientific solutions

  13. MTBF as a Function of System Size

  14. Failure Distribution (ASCI Blue Mountain)

  15. State of the Art in Large-Scale Supercomputers • We can assemble large-scale systems by wiring together hardware and “bolting together” software components • But we have almost no control on the machine: not only faults but also performance anomalies

  16. 1.2 The Big Issues From DoE Office of Science By the end of this decade petascale computers with thousands of times more computational power than any in current use will be vital tools for expanding the frontiers of science and for addressing vital national priorities. These systems will have tens to hundreds of thousands of processors, an unprecedented level of complexity, and will require significant new levels of scalability and fault management. [Emphasis added]

  17. Office of Science cont’d Current and future large-scale parallel systems require that such services be implemented in a fast and scalable manner so that the OS/R does not become a performance bottleneck. Without reliable, robust operating systems and runtime environments the computational science research community will be unable to easily and completely employ future generations of extreme-scale systems for scientific discovery.

  18. DARPA Defence Advanced Research Projects Administration (DARPA) High Productivity Computing Systems (HPCS) mission: Provide economically viable high productivity systems for the national security and industrial user communities with the following design attributes in the latter part of this decade: • Performance • Programmability • Portability • Robustness

  19. Our Translation • Performance—achieving achievable performance (not, e.g., some percentage of theoretical peak) • Programmability/portability—standard interfaces, transparency of mechanisms for fault tolerance • Robustness—graceful failover

  20. 1.3 A Taxonomy of Systems Q: Is it a supercomputer or just a cluster? A: It is a continuum along multiple dimensions. A taxonomy of systems of three dimensions: • Degree of integration of compute node; • Collective primitives provided by the network interface, programmability, global address space; • Degree of integration of system software.

  21. Note This taxonomy is useful for our explication, but we make no claims that it • is canonical, • that it captures highly specialized architectures (for example custom-designed special-purpose digital processors, vector processors, floating-point processors). We are concerned with the big `general purpose’ parallel machines.

  22. Compute Node Degree of integration of compute node between processors, memory, and network interface • Single processor—SMP—multiple CPU cores per chip • Number of levels of cache, proximity of caches to CPU core • Proximity of network interface to CPU core: on-chip—off-chip direct connection—separated by I/O interface

  23. Network Interface • Collective primitives provided by network interface: none—functionally rich; • Programmability of network interface: none—general purpose • Provision of virtual global address space

  24. System software • Degree of integration of system software much more about this later…

More Related