1 / 17

Resource Management and Accounting Working Group

Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002. Resource Management and Accounting Working Group. Working Group Scope and Components Progress made Current issues being worked Next steps

lark
Download Presentation

Resource Management and Accounting Working Group

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable Systems Software CenterResource Management and Accounting Working GroupFace-to-Face MeetingOctober 10-11, 2002

  2. Resource Management and Accounting Working Group • Working Group Scope and Components • Progress made • Current issues being worked • Next steps • Discussions involving larger group

  3. Working Group Scope The Resource Management Working Group is involved in the areas of resource management, scheduling and accounting. This working group will focus on the following software components: • Queue Manager (/Job Manager) • Scheduler • Accounting and Allocation Manager • Meta Scheduler Other critical resource management components are being developed in the Process Management and Monitoring Working Group: • Process Manager • Node Monitor

  4. Proposed Component Architecture Infrastructure Services Meta Scheduler Discovery Service Allocation Manager Local Scheduler Information Service Queue Manager Node Monitor Event Manager Security System Process Manager Node Manager

  5. Resource Management Prototype Demonstration This demo runs a simple end-to-end test with a job being submitted running past it’s wallclock limit 4 Create-Reservation Allocation Manager Local Scheduler 9 Withdraw-Allocation 2 Query-Job 7 Query-Job 8 Delete-Job 3 Query-Node 5 Run-Job Job Submission Client Queue Manager Node Monitor 1 Submit-Job 0 Service-Lookup 6 Exec-Process Discovery Service Process Manager

  6. General Progress • Initial draft of Scalable Systems Software Resource Management and Accounting Protocol (SSSRMAP) completed • Requirements documents nearly complete for all components • All components under revision control

  7. Scheduler Progress • Extended internal XML Usage • Implemented SSSRMAP XML interface for queue manager, node monitor and allocation/accounting manager • Enhanced internal scalability to support up to 50,000 nodes • Added support for HTTP framing protocol • Added internal suspend/resume and checkpoint/requeue management code (interfaced to PBS, LSF, and LL) • Created subset of XML-based job control and state control clients for use with GUI tools • Significant testing and documentation of existing features (priority and QOS enhancements)

  8. Queue Manager Progress • Conformance to the SSSRMAP XML specification • Synchronization of the job attribute types with PBS SSS front-end • Full wire protocol compatibility with basic, challenge, and ANL versions of basic and challenge • Multiple server ports employed to allow multiple client protocols simultaneously • New interface with Event Manager • Added job signaling support with the Process Manager

  9. Allocation Manager Progress • Requirements and survey sent out to 15 sites and vendors • Allocation management component placed under bitkeeper • Implemented HTTP framing protocol and tested performance • Support for expression grouping in queries • Journaling implemented – undo and redo working • Got SHA1-HMAC security working with QBank/Maui • Reframed bank objects (accounts, users, allocations, etc.) as dynamically introduced objects • Object actions defined in metadata cache • Creation of dynamic web-GUI using PHP and javascript (forms for object creation, querying, modification, deletion and undeletion)

  10. Meta Scheduler Progress • Development of submission client • Support for PBS ‘command file’ keywords and semantics • Ability to run jobs end-to-end • Fault tolerance improvements (Cluster scheduler reconnection and global JobId tracking) • Added interfaces to interoperate with grid systems (Globus) • Improved user interface • Partial XML local scheduler-meta scheduler language defined and implemented

  11. Current Issues • Job State Management for Queue Manager • Data staging • Job signaling • Support for Job steps • Integration with Node Monitor

  12. Next Work • Prepare for SC demos • Scalability Testing • Release v1.0 of Resource Management System for existing components • Basic documentation • Security authentication • Need to solidify RMS-wide standards for packaging, build procedure, revision control, and distribution home.

  13. Scheduler Future • Integrate SSS security protocols • Extend GUI support • Full support for XML allocation manager language • Extend SSS language to support suspend/resume and checkpoint/requeue • Test TM interface fault tolerance features (corrupt data, bad connections, etc.)

  14. Queue Manager Future • Add Epilogue/Prologue support • Add job submission verification script • Interface with Node Monitor • Full PBS qsub compatibility • Add interface with Node Manager to support job dependent node OS image installation

  15. Allocation Manager Future • Focus on getting QBank ready for bundling and release with SSS RMS system (security, use key, improved installation procedure) • Focus effort on open source of new Allocation Manager (gold) • Implementation of enhanced allocation, reservation mechanisms which utilize simple pricing engine and log job and usage data • Security authentication (gold) • Support for operations on returned fields (sort, sum, max, unique, group by, etc.) • Integrate SSSLIB connection protocol & discovery service

  16. Meta Scheduler Future • Fault tolerance improvements • Initial data management (data stage-in/stage-back) • Full XML local scheduler-meta scheduler language defined and implemented

  17. Issues requiring inter-group coordination • Resource controller for handling switch allocation, licenses, resource limit enforcement (logical partioning) • How is checkpointing and suspend/resume routed through • Who manages node access control? • Dynamic jobs

More Related