1 / 9

Using the Parallel Universe beyond MPI

Using the Parallel Universe beyond MPI. Parallel Universe applications using Metronome. Metronome’s support for running parallel jobs builds on Condor’s Parallel Universe Possible to run coordinated Metronome jobs on multiple machines at the same time with available communication between them

Download Presentation

Using the Parallel Universe beyond MPI

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using the Parallel Universe beyond MPI

  2. Parallel Universe applications using Metronome • Metronome’s support for running parallel jobs builds on Condor’s Parallel Universe • Possible to run coordinated Metronome jobs on multiple machines at the same time with available communication between them • Provides advanced testing opportunities • Some examples: client/server, cross-platform, compatibility, stress/scalability

  3. Service testing challenges • Starting multiple services on the same machine does not allow for testing across a network or different platforms • Deciding when to start the services and when to start tests requires human intervention • Setup of the services is usually a manual process, or don’t bother testing. • Same goes for the teardown of services to return the machines to their original state

  4. Benefits of using Metronome • Condor manages dynamic claiming of resources, communication between job nodes and cleaning up after the jobs run • Metronome publishes basic information about each task to the job ad where it’s accessible by any node, acting as a “scratch space” for the job • The hostnames of all job nodes, the start time, return code, and end time for each task on each node are published to this shared job ad • This information is useful for communication between nodes and synchronization in the user’s glue scripts.

  5. Client/server test example SERVER Start server Execute Node 0 Send port to client Parallel Job Handle client requests Poll for ALLDONE from client Exit Submit Node Discover server hostname and port Start client Run queries against server Send ALLDONE message to server Execute Node 1 Exit CLIENT

  6. How to submit a parallel job in Metronome • Several minor modifications to the Metronome submit file are necessary for parallel jobs • List of platforms is comma separated with parentheses around the outside • Platforms = (x86_rhas_3, x86_rhas_4)

  7. Parallel job submit files continued • Add a glue script for each task/node combination to be executed remotely. • platform_pre_0 = client/platform_pre • platform_pre_1 = server/platform_pre • remote_declare_0 = client/remote_declare • remote_declare_1 = server/remote_declare • remote_task_0 = client/remote_task • remote_task_1 = server/remote_task • remote_task_args_0 = 9000 • remote_task_args_1 = 9001 • … and so forth for all glue scripts.

  8. Other parallel job use cases • Cross platform testing (Linux to Solaris) • Scalability/stress testing (1 server, many clients) • Compatibility testing (cross version, stable vs. development series)

  9. For more information • Documentation is available on the NMI site • See http://nmi.cs.wisc.edu/node/1001 for information on running parallel jobs using Metronome • http://nmi.cs.wisc.edu/node/282 describes how to set up your own Metronome installation for running parallel jobs

More Related