Using the Parallel Universe beyond MPI

Using the Parallel Universe beyond MPI

Parallel Universe applications using Metronome • Metronome’s support for running parallel jobs builds on Condor’s Parallel Universe • Possible to run coordinated Metronome jobs on multiple machines at the same time with available communication between them • Provides advanced testing opportunities • Some examples: client/server, cross-platform, compatibility, stress/scalability

Service testing challenges • Starting multiple services on the same machine does not allow for testing across a network or different platforms • Deciding when to start the services and when to start tests requires human intervention • Setup of the services is usually a manual process, or don’t bother testing. • Same goes for the teardown of services to return the machines to their original state

Benefits of using Metronome • Condor manages dynamic claiming of resources, communication between job nodes and cleaning up after the jobs run • Metronome publishes basic information about each task to the job ad where it’s accessible by any node, acting as a “scratch space” for the job • The hostnames of all job nodes, the start time, return code, and end time for each task on each node are published to this shared job ad • This information is useful for communication between nodes and synchronization in the user’s glue scripts.

Client/server test example SERVER Start server Execute Node 0 Send port to client Parallel Job Handle client requests Poll for ALLDONE from client Exit Submit Node Discover server hostname and port Start client Run queries against server Send ALLDONE message to server Execute Node 1 Exit CLIENT

How to submit a parallel job in Metronome • Several minor modifications to the Metronome submit file are necessary for parallel jobs • List of platforms is comma separated with parentheses around the outside • Platforms = (x86_rhas_3, x86_rhas_4)

Parallel job submit files continued • Add a glue script for each task/node combination to be executed remotely. • platform_pre_0 = client/platform_pre • platform_pre_1 = server/platform_pre • remote_declare_0 = client/remote_declare • remote_declare_1 = server/remote_declare • remote_task_0 = client/remote_task • remote_task_1 = server/remote_task • remote_task_args_0 = 9000 • remote_task_args_1 = 9001 • … and so forth for all glue scripts.

Other parallel job use cases • Cross platform testing (Linux to Solaris) • Scalability/stress testing (1 server, many clients) • Compatibility testing (cross version, stable vs. development series)

For more information • Documentation is available on the NMI site • See http://nmi.cs.wisc.edu/node/1001 for information on running parallel jobs using Metronome • http://nmi.cs.wisc.edu/node/282 describes how to set up your own Metronome installation for running parallel jobs

Using the Parallel Universe beyond MPI