1 / 24

A New Approach in the System Software Design for Large-Scale Parallel Computers

A New Approach in the System Software Design for Large-Scale Parallel Computers. Juan Fernández 1,2 , Eitan Frachtenberg 1 , Fabrizio Petrini 1 , Salvador Coll 1 and José C. Sancho 1 1 Performance and Architecture Lab 2 Grupo de Arquitectura y Computación Paralela (GACOP)

toki
Download Presentation

A New Approach in the System Software Design for Large-Scale Parallel Computers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A New Approach in the System Software Designfor Large-Scale Parallel Computers Juan Fernández1,2, Eitan Frachtenberg1, Fabrizio Petrini1, Salvador Coll1 and José C. Sancho1 1Performance and Architecture Lab 2Grupo de Arquitectura y Computación Paralela (GACOP) CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN URL: http://www.c3.lanl.gov URL: http://www.ditec.um.es email:{juanf,eitanf,fabrizio,scoll,jcsancho}@lanl.gov

  2. Motivation OS OS OS Hardware / OSs are glued together by System Software: Resource Management Communications Parallel Development and Debugging Tools Parallel File System Fault Tolerance OS OS OS OS OS System software is a key factor to maximize usability, performance and scalability on large-scale systems!!!

  3. Motivation • System software complexity due to multiple factors: • Extremely complex global state • Non-deterministic behavior inherent to computing systems and parallel apps • Local OSs lack global awareness of parallel apps • Independent design of different components • User-level applications rely on system software

  4. Outline • Motivation • Goals • Core Primitives • Resource Management • Communication Libraries • Ongoing and future work

  5. Goals • Target • Simplifying design and implementation of the system software for large-scale parallel computers • Simplicity, performance, scalability, determinism • Approach • Built atop a basic set of three primitives • Global synchronization/scheduling • Vision • SIMD system running MIMD applications (variable granularity in the order of hundreds of s)

  6. Outline • Motivation • Goals • Core Primitives • Resource Management • Communication Libraries • Ongoing and future work

  7. Core Primitives • System software built atop three primitives • Xfer-And-Signal • Transfer block of data to a set of nodes • Optionally signal local/remote event upon completion • Compare-And-Write • Compare global variable on a set of nodes • Optionally write global variable on the same set of nodes • Test-Event • Poll local event

  8. Core Primitives The proposed mechanisms simplify design and implementation!!!

  9. Core Primitives • Implementation • Global, virtually addressable shared memory • Remote Direct Memory Access (RDMA) • Hardware-supported multicast • Hardware-supported global query • Computing capability in the NIC • Portability • Infiniband, BlueGene/L, QsNET

  10. Outline • Motivation • Goals • Core Primitives • Resource Management • Communication Libraries • Ongoing and future work

  11. Resource Management • STORM: Scalable TOol for Resource Management [1,2] • Job launching • binary and data dissemination • actual launching of a parallel job • reporting of job termination • Job scheduling • FCFS, gang scheduling, ... [3] • new scheduling algorithms can be “plugged” • Heartbeat/strobe at regular intervals (time slices) • Monitoring • Built atop the three core primitives [1] “Scalable Resource Management in High Performance Computers.” E. Frachtenberg, J. Fernández, F. Petrini, and S. Coll. Cluster´02. [2] “STORM: Lightning-Fast Resource Management.” E. Frachtenberg, J. Fernández, F. Petrini, S. Pakin and S. Coll. SC´02. [3] “Flexible CoScheduling: Mitigating Load Imbalance and Improving Utilization of Heterogeneous Resources.” E. Frachtenberg, D. G. Feitelson, F. Petrini and J. Fernández. IPDPS´03.

  12. Outline • Motivation • Goals • Core Primitives • Resource Management • Communication Libraries • Ongoing and future work

  13. Communication Libraries • BCS-MPI: Buffered Coscheduled MPI [4] • Global synchronization [5] • Heartbeat/strobe sent at regular intervals (time slices) • All system activities are tightly coupled • Global Scheduling • Exchange of communication requirements • Communication scheduling • Perform real transmission and reduce computations [6] • Implementation on the NIC (Elan3 - QsNet) • Built atop the three core primitives [4] “BCS-MPI: A New Approach in the System Software Design for Large-Scale Parallel Computers.” J. Fernández, E. Frachtenberg, and F. Petrini. SC´03. [5] “Scalable Collective Communication on the ASCI Q Machine” J. Fernández, E. Frachtenberg, and F. Pettrini. HOTi´11. [6] “Scalable NIC-based Reduction on Large-scale Clusters.” A. Moody, J. Fernández, F. Petrini and D. K. Panda. SC´03.

  14. Communication Libraries • BCS-MPI: real-time commication scheduling • Global Strobe • (time slice starts) Exchange of comm requirements • Global • Synchronization Communication scheduling Time Slice (hundreds of s) • Global • Synchronization Real transmission • Global Strobe • (time slice ends)

  15. Ongoing and future work • Improved system utilization • Scheduling multiple jobs • QoS for different types of traffic • Scheduling messages may provide traffic segregation • Transparent fault tolerance [7] • BCS MPI simplifies the state of the machine • Kernel-level implementation of BCS-MPI • User-level solution is already working • Deterministic replay of MPI programs • Ordered resource scheduling may enforce reproducibility [7] “On the Feasibility of Incremental Checkpointing for Scientific Computing.” J. C. Sancho, F. Petrini, G. Johnson, J. Fernández and E. Frachtenberg. IPDPS´04.

  16. A New Approach in the System Software Designfor Large-Scale Parallel Computers Juan Fernández1,2, Eitan Frachtenberg1, Fabrizio Petrini1 1Performance and Architecture Lab 2Grupo de Arquitectura y Computación Paralelas (GACOP) CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN URL: http://www.c3.lanl.gov URL: http://www.ditec.um.es email:{juanf,eitanf,fabrizio}@lanl.gov

  17. Motivation Growing gap between workstation and cluster usability!!!

  18. Motivation • System software complexity due to multiple factors: • Extremely complex global state • Thousands of processes, threads, open files, pending messages, etc. • Non-deterministic behavior • Inherent to computing systems  OS process scheduling • Induced by parallel applications  MPI_ANY_SOURCE • Local OSs lack global awareness of parallel applications • Interferences with fine-grain synchronization operations  non-scalable collective communication primitives • Independent design of different components • Redundancy of functionality Communication protocols • Missing functionality  QoS user-level traffic / system-level traffic • User-level applications rely on system software • System software performance/scalability impacts user-application performance/scalability

  19. Resource Management • Job Launching STORM is 40 times faster than the best reported result!!!

  20. Resource Management • Job Scheduling STORM is able to use very small time slices: RESPONSIVENESS !!!

  21. Communication Libraries • Non-blocking primitives: MPI_Isend/Irecv

  22. Communication Libraries • Blocking primitives: MPI_Send/Recv

  23. Communication Libraries • Global Synchronization Protocol • Global Message Scheduling Phase • Microphases: Descriptor Exchange + Message Scheduling • Message Transmission Phase: • Microphases: Point-to-point, Barrier and Broadcast, Reduce

  24. Communication Libraries • SAGE- timing.input (IA32) 0.5% SPEEDUP !!!

More Related