1 / 34

An Interoperability Approach to Systems Software, Tools, and Libraries

An Interoperability Approach to Systems Software, Tools, and Libraries. Rusty Lusk Mathematics and Computer Science Division Argonne National Laboratory. Acknowledgements. The work on system software for clusters was done with Narayan Desai, Rick Bradshaw, and Andrew Lusk.

jeb
Download Presentation

An Interoperability Approach to Systems Software, Tools, and Libraries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Interoperability Approach to Systems Software, Tools, and Libraries Rusty Lusk Mathematics and Computer Science Division Argonne National Laboratory

  2. Acknowledgements • The work on system software for clusters was done with Narayan Desai, Rick Bradshaw, and Andrew Lusk. • The work on the MPD process manager was done with Ralph Butler. • The work on MPICH2 was done with Bill Gropp, Rob Ross, Rajeev Thakur, Brian Toonen, David Ashton, and Anthony Chan.

  3. Outline • Cluster system software • Traditional paradigm • New, experimental paradigm • A cluster system software “stack” • An SDK for developing cluster system software • MPI as a system software component • The MPICH2 implementation of MPI • Parallel system software • Experiences

  4. What Is Cluster Systems Software? • The collection of programs used in configuring and maintaining individual nodes, together with software involved in submission, scheduling, management, and termination of parallel jobs. • Does not include single-node software • Compilers for sequential languages • OS kernels • Except for kernel modules related to parallism • E.g., modules for PVFS or BLCR • Does include grid middleware (but not in this talk) • Does include parallel user tools • Does include MPI implementation, parallel file system • In general, includes what makes the cluster a cluster instead of a pile of separate machines

  5. Traditional Paradigm • Shell scripts to “parallelize” sequential utilities • Loop over rsh • Some general-purpose tools • pdsh • Monolithic resource management systems • PBS, LSF • Some component-like functionality • Maui scheduler • Packages of packages • OSCAR, ROCKS, others • Some non-traditional approaches • bproc • Software integration occurs “manually” • Depends on implementations, not just interfaces

  6. Experiences with the Traditional Paradigm • At Argonne, we operate a research cluster (Chiba City) as well as a production cluster (Jazz). • A year ago we were using OpenPBS/Maui on Chiba, along with other aspects of the traditional paradigm • PBSPro/Moab on Jazz • Not too bad • We were increasingly dissatisfied with the Chiba environment in particular • Administrative overhead • Unusual user requirements • Apply kernel mods for duration of job • Root access for some experiments • Many needs common to all clusters • Elimination of manual sequential tasks • Support for thinking collectively • Simplicity and “elasticity”

  7. The Scalable Systems Software SciDAC Project • Multiple Institutions (most national labs, plus NCSA) • Research goal: to develop a component-based architecture for systems software for scalable machines • Software goal: to demonstrate this architecture with some prototype open-source components • One powerful effect: forcing rigorous (and aggressive) definition of what each component should do and what should be encapsulated in other components • http://www.scidac.org//ScalableSystems • Argonne is participating in multiple component definitions and prototype implementations • Embracing the component concept

  8. Scalable Systems Software SciDAC Components Meta Scheduler Meta Monitor Meta Manager Access Control Security Manager Infrastructure Meta Services Interacts with all components Process Mgmt Node Configuration & Build Manager System Monitor Accounting Scheduler Resource Allocation Management Process Manager Queue Manager User DB Data Migration High Performance Communication & I/O File System Checkpoint / Restart Usage Reports User Utilities Testing & Validation Resource Management Application Environment Validation Not Us

  9. A New Paradigm • Simple, single-function components, with published interfaces to other components, replace monolithic ones. • Can be wrappers for existing programs • Can be all-new software for new functionality • Can be throwaway, once-only lightweight components, provided it is sufficiently easy to implement a component • A flexible communication architecture makes component interconnection simple, reliable, and secure. • A scalable process manager component provides support for parallel jobs, both system and user.

  10. A New Paradigm (continued) • A “software development kit” makes it easy to create components of all three types. • Parallel tools and components are written in MPI for scalability, flexibility, and performance • Systems management becomes easier : • Substitutability of individual components • Communication, other common code only done once • Certain components can be adapted to local requirements

  11. Digression on Software Components in General • Clemens Szyperski: “A software component is a unit of composition with contractually specified interfaces and explicit context dependencies only. It can be deployed independently and is subject to composition by third parties.” • Expected benefits of components • Localization of functionality encourages reuse • Components may evolve or be replaced without affecting the overall system • Site customizations or reimplementations • “better” or more “well suited” • Functions of disparate components can be assembled in ways not envisioned by the component implementers • New tools • Different views of the system • Our project: apply this approach to cluster systems software in pursuit of precisely these benefits • Implementation strategy: the “stack”

  12. The “Stack” (upside down) • Multiple wire protocols • Extensible • Communication library • Multiple language bindings • XML syntax style • Designing for validation • Software Development Kit for creating components • Components • Some defined by SSS • Others we know we need, others we haven’t thought of yet • Throwaways • Process Manager component supports MPI well • Enables more use of MPI in system software • MPICH2

  13. Wire Protocols • The SSS project agreed on XML over sockets as the inter-component communication mechanism, but with no further detail • This still leaves open many decisions: • The precise XML (the schemas) for specific messages between specific components • The XML syntax style • What kinds of things are entities, attributes, value types, etc. • The way messages are framed: • What constitutes a complete XML message to be parsed • Security issues: passwords, encryption, etc. • Our communication library allows for multiple protocols • Basic (with challenge/response for security) • HTML • SSSRMAP • Easy to add others

  14. Communication Library • Components register with a service directory component • Any component can ask the SD for the host and port of any other component by name and find out what wire protocol to use. • Then it can compose and send XML messages to that component • Uses send/receive • Correct wire protocol will be used automatically. • Multiple language bindings for library, so components can be written in Python, Perl, C, C++, Java

  15. Asynchronous Communication • Asynchronous messages (e.g. job completion) handled by Event Manager component • Any component can register with EM to receive notification of certain events. • Various components notify the EM when certain events occur. • E.g., QM submits job to PM, wants to know when job is complete • QM registers with EM for completion messages • PM notifies EM when job is complete • EM notifies all who registered for this event, including QM

  16. Things to Look for in Any Syntax for Commands • A command message does three things • Matching a set of objects in target component’s data store • Either constructs them or identifies them • Can use similar syntax • Applies a function with arguments to that set of objects • Constructs return message • May be complex, containing partial information about objects identified

  17. Desirable Features of an XML Style • Completeness • If you can think it, you can write it • High value of validation • Function signature type checking • Protection of components from poorly formed messages • Simplified component code • Extent to which documentation is expressed in schema • Readability • Can a human understand the XML text? • Conciseness • Desirable, but might conflict with readability • Scalability • “Atomicity” • Can race conditions be avoided?

  18. An Example: Process Manager Manages Parallel Process Groups; can Signal Groups <signal-process-group signal='SIGINT' scope='single'>   <process-group pgid='*' submitter='*'>      <process host='ccn221' pid='*'/>      <process host='ccn222' pid='*'/>   </process-group> </signal-process-group> 1.Since almost all fields are '*', and simple matching is employed, there is only one predicate on this example. We are looking for process groups where there exists at least one process on ccn221. 2. the function signal-process-group, with the arguments signal=SIGINT  and scope='single', is applied to each matched process-group  object.  3. The pgid and submitter fields for each process group are returned,  along with hostname and pids for each of the processes on ccn221 or ccn222, not those of the entire process group. <process-groups>   <process-group pgid='2232' submitter='user1'>       <process host='ccn221' pid='2232'/>       <process host='ccn222' pid='2542'/>    </process-group>    <process-group pgid='2240' submitter='user2'>       <process host='ccn222' pid='2531'/>       <process host='ccn221' pid='1432'/>    </process-group> </process-groups>

  19. A Software Development Kit for Building Components • We envision frequent invention or replacement of components • Many functions can be shared beyond just the communication infrastructure components • Therefore we have constructed a small infrastructure to aid in building components • This SDK is for writing components in Python; other languages could be supported

  20. Lower Levels of SSS-SDK • The communication library with multiple wire protocols, easy way to add new wire protocols • Some infrastructure components, including the Service Directory and Event Manager • The SSSlib communication library, with bindings for multiple languages

  21. Upper Level of SSS-SDK • Component services that are independent of component • Registers/deregisters with Service Directory • Sets up logging • Sets up error reporting • Select loop, error handling • Socket setup/cleanup • XML parsing (uses Elementtree) • XML validation • Message parsing of messages in (RS format) • Provided in two classes available for subclassing: • Server • Event Receiver

  22. Echo Client #!/usr/bin/python from os import getpid from sss.ssslib import comm_lib c = comm_lib() h = c.ClientInit('echo') c.SendMessage(h, "<echo><pid id='%s'/></echo>“ % (getpid())) response = c.RecvMessage(h) c.ClientClose(h) print response

  23. Echo Server #!/usr/bin/env python from sss.server import Server class Echo(Server):     __implementation__ = 'echo'     # set log name     __component__ = 'echo' # component answers to 'echo'     __dispatch__ = {'echo':'HandleEcho'} # call HandleEcho method # for echo messages     __validate__ = 0 # no schema for this component def HandleEcho(self, xml, (peer,port)):         return xml if __name__ == '__main__':     e = Echo()     e.ServeForever()

  24. List of Components In Production Use on Chiba City Cluster • Service Directory • As described above, knows how to contact, communicate with, each component • Event Manager • Relays events from sources to those components that have subscribed to those events • (Communication Library) • Send/receive of character data using underlying wire protocols • Node State Manager • Maintains status of each node; used by scheduler to allocate nodes to users • Node Configuration and Build manager • Knows how to configure nodes and supervise their builds

  25. List of Components In Production Use on Chiba City (cont.) • Process Manager • Currently wraps MPD, starts parallel jobs upon request • From user (or root), interactively • From Queue Manager • From personal job submitter • Scheduler • FIFO plus backfill plus reservations • Queue Manager • Accepts user scripts • PBS compatibility mode • Can deal with requests to load custom kernel • Accounting Manager • Maintains usage data • System Diagnostics Manager • Maintains hardware status data

  26. Other Parallel Software for Cluster Users • Parallel File System • PVFS • High-level I/O libraries over MPI-IO • HDF-5 • pnetCDF • Scalable Unix Tools • Parallel versions of cp, rm, ls, ps, find, etc. • MPI-2 • System Tools written in MPI

  27. MPICH2 • Goals: same as MPICH • Research project, to explore scalability and performance, incorporate and test research results • Software project, to encourage use of MPI-2 • Scope: all of MPI-2 • I/O • Dynamic • One-sided • All the obscure parts, too • Useful optional features recommended by the Standard (full mpiexec, singleton-init, thread safety) • Other useful features (debugging, profiling libraries, correctness checking)

  28. MPICH2 • Incorporates latest research into MPI implementation • Our own • Collective operations • Optimizations for one-sided ops • Optimized datatype handling • I/O • Others • Collective operations, for example • See recent EuroPVM and Cluster Proceedings • In use by vendors • IBM on BG/L • Cray on Red Storm • Coming soon from another major vendor • Having vendors adapt MPICH2 into their products has helped make it efficient and robust

  29. Status • Available from http://www.mcs.anl.gov/mpi/mpich2 • Nearly complete • Has RMA, optimized as described in Monday’s talk. • I/O, dynamic process management • Linux and Windows, 32- and 64-bit, others coming soon • TCP and TCP/shmem • Infiniband via MVAPICH2 from OSU • Version 0.971 (Saving 1.0 designation for when totally complete) • Not beta; faster and more robust than MPICH 1.2.6 • Currently missing but coming soon: • Heterogeneity • MPI_Comm_join • external32 portable datatype for I/O

  30. Using MPI in System Software • Advantages • Easy to write parallel programs • Efficient algorithms for collective operations • Access to high-speed interconnects • Currently in use • File staging • For setting up user environment in absence of shared file system • Parallel rsynch • For maintaining system software environment on nodes • MPISH • For running user scripts in systems-parallel environment • See “MPI Cluster Systems Software” in EuroPVM/MPI

  31. Experiences on Chiba City Cluster • Running only “new paradigm” software for a year now. • System software maintenance has been greatly simplified. • At the same time, capabilities have been greatly expanded. • Loading experimental operating systems on the fly • Using MPI for system management tasks • System managers happy • See “Component-Based Cluster Systems Software Architecture: a Case Study” in Cluster 2004 • Small amount of code • < 18,000 lines total (most in communication library) • < 5,000 lines of Python • Queue manager, 600 lines, Scheduler 400 lines

  32. Ultimate Vision • From high-level, scalable system software components to a high-level, scalable language for controlling clusters boot_up_cluster(configfile); if (cluster_broken); fix_it; else main_loop(); … • Think collectively: • Collective operations on nodes • Allreduce on the return codes • Split nodes into “communicators” • Collectively handle those that succeeded • Collectively handle those that failed

  33. Summary • The use of a component architecture for cluster systems software promises to make management of clusters easier. • An SDK for systems software can aid in the creation of components. • Scalability and parallelism are crucial for systems software as well as applications. • MPICH2 is an MPI-2 implementation that supports both uses. • Many promising future directions for cluster systems software beckon.

  34. The End

More Related