1 / 11

Process Management & Monitoring WG

Process Management & Monitoring WG. Quarterly Report August 26, 2004. Components. Process Management Process Manager Checkpoint Manager Monitoring Job Monitor System/Node Monitors Meta Monitoring. Component Progress. Checkpoint Manager (LBNL) Monitoring (NCSA) Process Manager (ANL).

alda
Download Presentation

Process Management & Monitoring WG

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Process Management & Monitoring WG Quarterly Report August 26, 2004

  2. Components • Process Management • Process Manager • Checkpoint Manager • Monitoring • Job Monitor • System/Node Monitors • Meta Monitoring

  3. Component Progress • Checkpoint Manager (LBNL) • Monitoring (NCSA) • Process Manager (ANL)

  4. Checkpoint Manager:BLCR Status • Full save and restore of • CPU registers • Memory • Signals (handlers & pending signals) • PID, PGID, etc • Files (w/ limitations) • Communication (via LAM/MPI)

  5. Checkpoint Manager:BLCR Status • Files • Files unmodified between checkpoint and restart • Files appended to between checkpoint and restart • Pipes between processes

  6. Checkpoint Manager:BLCR Status • LAM/MPI over TCP (and GM) • Handles in flight data (drains) • Linear scaling • Migratable

  7. Checkpoint Manager:BLCR Status • Linux only • “Stock” 2.4.X • RedHat 7.2, 7.3, 8.0, 9 • SuSE 7.? and 9 • RHEL3/CentOS nearly ready • 2.6.x port has begun “in background” • X86 only • Alpha, PPC may be 95% ready • IA64 and X86_64 possible

  8. Checkpoint Manager:BLCR Future Work • More on files • Mutable files • Directories • Misc. • Process groups and Sessions • Terminal characteristics

  9. Checkpoint Manager:SSS Work • Rudimentary Checkpoint Manager • Works with Bamboo and MPDPM • Long delayed plans for “next gen” • Upgraded interface spec (what syntax?) • Management of “context files” • lampd • mpirun replacement for running LAM/MPI jobs under MPD

  10. Process ManagerProgress • Continued daily use on Chiba City, along with other components • At Brett’s request, addition of option to signal entire (Unix) process group of a user process or just the process itself. • Default is just the top-level user process • Example: <signal-process-group scope=‘global’ signal=‘SIGINT’> <process-group user=‘desai’> </signal-process-group> • Miscellaneous hardening of MPD system, particularly in error conditions, prompted by Intel use.

  11. MonitoringWork at NCSA • A major fix has been implemented in warehouse. Before, there was a threshold of network bad-ness that if exceeded, would cause none of the nodes to be monitored at all (due to messages being stacked up in the incoming sockets). The code has been fixed so that multiple messages can be monitored per pass, which means that if the above threshold is exceeded, the nodes will just be monitored more slowly. This code was tested in the "good" realm against Dave, Scott, Brett in July, before having another release of the RMAP suite. It has not been tested in the "bad" realm, because that's a dedicated test. The bad news is that upon coming back from vacation in Britain, the hard drive on my desktop had had a complete hardware failure. I had been backing up warehouse religiously, and since I had transported code down to xtorc to create new rpms, I lost nothing on warehouse. I did, however, lose a bunch of work on the SSSRMAP wire protocol. Unfortunately, this included a bunch of annotated code that I would have liked to have had. Fortunately, most of what I'd done was figuring stuff out, and some of that carried over in memory so that reconstructing the second time is much easier. So I've been working feverishly on trying to get back on track with that project. I am at the point that it will be useful for me to sit down with Narayan and Dave, and ask "where does this go" install/Makefile sorts of questions.

More Related