1 / 31

Scalable Process Management and Interfaces for Clusters

Scalable Process Management and Interfaces for Clusters. Rusty Lusk representing also David Ashton, Anthony Chan, Bill Gropp, Debbie Swider, Rob Ross, Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory. Interfaces.

joella
Download Presentation

Scalable Process Management and Interfaces for Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable Process Management and Interfaces for Clusters Rusty Lusk representing also David Ashton, Anthony Chan, Bill Gropp, Debbie Swider, Rob Ross, Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory

  2. Interfaces • High-level programming is the identification of components and interfaces • MPI is an interface • The new abstract device interface inside the next generation MPICH • An experimental process manager component - MPD • An interface between process managers and parallel libraries - BNR

  3. Projects at Argonne Related to MPI Sandia Other Apps PETSc IBM Jumpshot SLOG Perf Analysis LANL UoC Flash HP Etnus Debugging CurrentMPIIOImpls LLNL Put/GetPrgming LBL ASCI ND C++ MPI MPI2 MVICH CollectiveOps Others MPICH SGI ROMIO MPICH-2 VIA Myrinet PVFS BNR ADI3 Java IO DataMngmt NCSA LargeClusters Infiniband Myricom IBM MPD MultiThreading OpenMP Globus NT Cluster MicroSoft ScalableSystem Tools SUT IMPI MPICH-G2 QoS NIST NGI Topology Topology Sens.Collective

  4. The MPICH Implementation of MPI • As a research project: exploring tradeoffs between performance and portability; conduct research in implementation issues. • As a software project: providing a free MPI implementation on most machines; enabling vendors and others to build complete MPI implementation on their own communication services. • MPICH 1.2.1 just released, with complete MPI-1, parts of MPI-2 (I/O and C++), port to Windows2000. • Available at http://www.mcs.anl.gov/mpi/mpich

  5. Internal Interfaces in MPICHThe Abstract Device Interface • ADI-1 objectives • speed of implementation • performance of MPI point-to-point operations layered over vendor libraries (NX, CMMD, EUI, etc.) • portability • ADI-2 objectives • portability • robustness • support for our own research into implementation issues • ease of outside implementations • vendor • research

  6. Experiences with ADI-1 & -2 • Vendors • could (and did) field complete MPI implementations quickly • could (and did) incrementally move up the interface levels, replacing lower-level code • could (and did) evolve upper-level code • Researchers • could (and did) experiment with new transport layers (e.g. BIP) • Enabled by interface design

  7. Internal Interfaces in Next Generation MPICH • Objectives for third-generation Abstract Device Interface (ADI-3) • support for full MPI-2 • enable thread-safe implementation • provide more tools for collaborators • support in ADI for collective operations • enable high performance on new networks • Status • interfaces being finalized, comments • exploratory thread-safe, multiprotocol implementation running • http://www.mcs.anl.gov/mpi/mpich.adi3 MPI_Reduce MPI_Isend ADI_Isend ADI_Rhc write welcome

  8. Interfaces in the Parallel Computing Environment System Admin System Monitor Scheduler Queue manager Internal Interfaces Job Submission Process Manager Parallel Library PVM, MPI User Application File System

  9. What is a Process Manager? • A process management system is the software component that starts user processes (with command line arguments and environment), ensures that they terminate cleanly, and manages I/O • For simple jobs, this can be the shell • For parallel jobs, more is needed • Process management is different from scheduling and queuing • We focus for now on the Unix environment • Related projects: MPICH, PVM, LAM, Harness, PBS, LSF, DQS, Load Leveler, Condor • An experimental system: MPD (the multipurpose daemon)

  10. Goals for MPD • Original Goal: speed up MPICH startup evolved to: • Grandiose Goal: build an entire cluster computing environment evolved to: • Architectural Goal: design the components of a cluster computing environment and their interfaces evolved to: • Realistic Goal: make a start on understanding architectural requirements, and speed up MPICH startup

  11. Design Targets for MPD • Simplicity - transparent enough to convince system people to run it as root • Speed - startup of parallel jobs fast enough to provide interactive “feel” (1000 processes in a few seconds) • Robustness - no single point of failure, auto-repair of at least some failures • Scalability - complexity or size of any one component shouldn’t depend on the number of components • Service - provide parallel jobs with what they need, e.g. mechanism for precommunication

  12. Parallel Jobs • Individual process environments: each process in a parallel job should be able to have • its own executable file • its own command-line arguments • its own environment variables • exit codes • Collective identity of parallel job: a job should collectively • be able to be signalled (suspended, restarted, killed, others) • produce stdout, stderr, and accept stdin scalably • terminate, especially abnormally

  13. Scheduler job console (mpirun) process Architecture of MPD • Daemons, managers, clients, consoles • Experimental process manager, job manager, scheduler interface for parallel jobs

  14. Interesting Features • Security • “Challenge-response” system, using passwords in protected files and encryption of random numbers • Speed not important since daemon startup is separate from job startup • Fault Tolerance • When a daemon dies, this is detected and the ring is reknit => minimal fault tolerance • New daemon can be inserted in ring • Signals • Signals can be delivered to clients by their managers

  15. More Interesting Features • Uses of signal delivery • signals delivered to a job-starting console process are propagated to the clients • so can suspend, resume, or kill an mpirun • one client can signal another • can be used in setting up connections dynamically • a separate console process can signal currently running jobs • can be used to implement a primitive gang scheduler • Support for parallel libraries via BNR

  16. Handling Standard I/O • Managers capture stdout and stderr (separately) from their clients • Managers forward stdout and stderr (separately) up a pair of binary trees to the console, optionally adding a rank identifier as line label • Console’s stdin is delivered to stdin of client 0 by default, but can be controlled to broadcast or go to specific client mpd ring I/O tree manager ring client

  17. Client Wrapping • Unix semantics for fork, exec, and process environments allow interposition of other processes that do not know about the client library • For example, mpirun -np 16 myprog can be replaced by mpirun -np 16 nice -5 myprog or mpirun -np 16 pty myprog

  18. Putting It All Together • The combination of • client wrapping • I/O management, especially redirection of stdin • line labels on stdout • ability to customize console can be surprisingly powerful.

  19. A Simple Parallel Debugger • The program mpigdb is a slightly modified version of the mpirun console program • Automatically wraps given client with gdb • Intercepts (gdb) prompts and counts them, issues own (mpigdb) prompt when enough have been received • Sets line label option on stdout and stderr • Sets “broadcast” behavior for stdin as default • Uses “z” command to modify stdin target • any specific rank, or broadcast to all

  20. Parallel Debugging with mpigdb donner% mpigdb -np 3 cpi (mpigdb) b 33 0: Breakpoint 1 at 0x8049eac: file cpi.c, line 33. 1: Breakpoint 1 at 0x8049eac: file cpi.c, line 33. 2: Breakpoint 1 at 0x8049eac: file cpi.c, line 33. (mpigdb) r 2: Breakpoint 1, main (argc=1, argv=0xbffffab4) at cpi.c:33 1: Breakpoint 1, main (argc=1, argv=0xbffffac4) at cpi.c:33 0: Breakpoint 1, main (argc=1, argv=0xbffffad4) at cpi.c:33 (mpigdb) n 2: 43 MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); 0: 39 if (n==0) n=100; else n=0; 1: 43 MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); (mpigdb) z 0 (mpigdb) n 0: 41 startwtime = MPI_Wtime(); (mpigdb) n 0: 43 MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); (mpigdb)

  21. Continuing... (mpigdb) z (mpigdb) n .... (mpigdb) n 0: 52 x = h * ((double)i - 0.5); 1: 52 x = h * ((double)i - 0.5); 2: 52 x = h * ((double)i - 0.5); (mpigdb) p x 0: $2 = 0.0050000000000000001 2: $2 = 0.025000000000000001 1: $2 = 0.014999999999999999 (mpigdb) c 0: pi is approximately 3.1416009869231249, 0: Error is 0.0000083333333318 0: Program exited normally. 1: Program exited normally. 2: Program exited normally. (mpigdb) q donner%

  22. Experiences • Instantaneous startup of small MPICH jobs was wonderful after years of conditioning for slow startup • Not so critical for batch-scheduled jobs, but allows writing parallel versions of short, quick Unix tools (cp, rm, find, etc) as MPI jobs • Speed on big jobs (Chiba City with fast Ethernet); • mpdringtest 1 on 211 hosts - .13 sec. • mpirun -np 422 hostname - 3.5 sec. • Running the daemons as root required little extra • We really do use the debugger mpigdb

  23. Running the Daemons as Root • Daemon is run as root • Console runs as a setuid program, and becomes root only briefly while connected to daemon. • Console transmits uid, gid, and group membership of real user to daemon ring • Daemons fork managers, who assume user’s attributes before exec’ing the manager program • After job startup, console, managers, and clients are all running as user. • Daemon free to accept more console commands from same or other users • In experimental use now on our Chiba City cluster

  24. Status • The MPD system is open source and is available as a component of the MPICH distribution. • Configuring and installing MPICH with the ch_p4mpd device automatically builds and installs MPD. • MPD can be built and installed separately from MPICH • Serving as platform for study of process manager interfaces

  25. Motivation for a Process Manager Interface • Any device or method needs to interact with process manager (at least at startup, and probably later as well) • A parallel library (such as an MPI implementation) should be independent of any particular process manager. • Process managers abound, but few are equipped to manage the processes of parallel jobs. • Globus has a “meta” process manager that interacts with local ones. • Condor can start PVM and MPICH jobs (special for each) • PBS “TM” interface • MPD is a prototype process manager, intended to help us explore interface issues.

  26. One Interface in the ADI-3 Library: the BNR Process Manager Interface • Goals • simple enough to plausibly suggest for implementation by other process managers • provide startup info and precommunication • not specific to either parallel library or process manager • Status • complete definition still not frozen • some parts already implemented by GARA (Globus P. M.) and MPD Providers Harness? MPD GARA PBS Condor? BNR Interface Globus Device Others MPICH Users

  27. The BNR Interface • “Data” part allows users (e.g. parallel libraries) to put key=value pairs into the (possibly distributed) database. • “Spawn” part allows users to request that processes be started, with hints. • Groups, put/get/fence model for synchronization, communication preparation. (Fence provides scalability) • mpirun (as user) can use precommunication for command-line args, environment

  28. BNR interface BNR_Spawn MPI_Spawn BNR_Spawn mpirun Plans • Support TotalView startup => real parallel debugger • Performance tuning, alternate networks • Verification and simulation subproject • Full implementation of BNR interface, including BNR_Spawn (needed for more powerful mpiexec and MPI_Spawn, MPI_Spawn_multiple)

  29. Example of Use in Setting Up Communication • Precommunication (communication needed before MPI communication), in TCP method Process 26 Process 345 BNR_Init( &grp ) ...obtain own host and port... listen( port ); BNR_Put( “host_26”, host ) BNR_Put( “port_26”, port ); BNR_Fence( grp ); ... BNR_Init( &grp ); ... BNR_Fence( grp ); ... ... decide to connect to 0 ... BNR_Get( grp, “host_26”, host ); BNR_Get( grp, “port_26”, &port ): connect( host, port );

  30. Multiple Providers of the BNR Interface • MPD • BNR_Puts deposit data in managers • BNR_Fence implemented by manager ring • BNR_Get hunts around ring for data • Globus • BNR_Puts are local into process • BNR_Fence is an all-to-all exchange • BNR_Get is then also local • MPICH-NT launcher • uses global database in dedicated process

  31. Summary • There is still much to do in creating a solid parallel computing environment for applications. • Process managers are one important component of the environment. MPD is an experimental process manager, focusing on providing needed services to parallel jobs (fast startup, stdio, etc.) • A widely used interface to process managers in general would help the development of process managers and parallel libraries alike. • Other components and interfaces are important topics for research and experimentation.

More Related