1 / 43

HPC Profile BOF

HPC Profile BOF. Marvin Theimer Microsoft Corporation Marty Humphrey University of Virginia. Agenda. 11:00 – 11:15 Review of Charter ( Humphrey ) 11:15 – 11:30 HPC Use Cases – Base Case and Common Cases ( Theimer ) 11:30 – 11:45 Extensible Job Submission Design ( Theimer )

bono
Download Presentation

HPC Profile BOF

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HPC Profile BOF Marvin Theimer Microsoft Corporation Marty Humphrey University of Virginia

  2. Agenda • 11:00 – 11:15 Review of Charter (Humphrey) • 11:15 – 11:30 HPC Use Cases – Base Case and Common Cases (Theimer) • 11:30 – 11:45 Extensible Job Submission Design (Theimer) • 11:45 – 12:00 Comparative Analysis • Extensible Job Submission Design and JSDL/BES (Wasson) • ESI – Snelling/Foster (Theimer) • 12:00 – 12:30 Discussion

  3. Review of Charter(11:00 – 11:15)

  4. History • GGF14: Chicago Jul 14 2005 • “Minimal Web Services BOF” (aka WS-Management) • Newhouse, Theimer, Humphrey, Tollefsrud • GGF15: Boston Oct 6 2005 • UVa update on WS-Management use for OGSA (Wasson) • Specific technical thoughts on the support of dual stacks • Suspended given rumored “reconcilation” • GGF16: Athens Feb 14 2006 • “An evolutionary approach to realizing the Grid vision” • Theimer, Parastatidis, Hey, Humphrey, Fox • OGSA F2F Feb 17 206 • Theimer gives detailed presentation of the “evolutionary” paper

  5. More History • Mar 15 2006 • “Toward Converging Web Service Standards for Resources, Events, and Management” • HP, IBM, Intel, Microsoft • OGSA F2F: Sunnyvale, CA April 5 2006 • Theimer presented the use-case document • Since March 2006, active engagement of OGSA-WG mailing list to build consensus

  6. OGSA HPC Profile WG (Computing Area) • Objective: the profile and protocol specifications needed to realize the vertical use case of batch job scheduling of scientific/technical applications   • “use case” = HPC use case • Output: HPCP (normative) • Scope • Identify any changes/extensions that are deemed necessary to existing protocol specifications and will work with the relevant working groups to try to affect the identified changes/extensions • Identify additional protocol specifications that need to be defined and will either work on their definition or spin them out to additionally defined working groups.

  7. OGSA HPC Profile WG (Computing Area) • “sub-profiles” • interface for specifying and submitting and scheduling jobs • interface for bulk data staging • Evolutionary approach • A simple base case will be defined that we expect to have universally implemented by all batch job scheduling clients and schedulers. • All additional functionality will be defined in terms of optional extensions (which are anticipated to be widely applicable)

  8. Pre-existing Documents • JSDL • BES • “An evolutionary approach to realizing the Grid vision”

  9. Status • Use-case in final revisions • Resource reservation • Provisioning • Execution • Next: what the framework should be for defining extension profiles • Aggressive milestones to meet vendor deadlines

  10. Deliverables • OGSA HPC Use Cases – Base Case and Common Cases (GFD-I) • OGSA HPC profile specification (GFD-R.P) • OGSA HPC initial common cases extension profile specification (GFD-R.P)

  11. Milestones

  12. HPC Use Cases – Base Case and Common Cases[ GFD-I ](11:15 – 11:30)

  13. Goals • BASE case: • ALL scheduling clients and services are expected to understand • HPC, not Grid (i.e., do NOT span administrative domains) • Common Cases: • Represent some significant fraction of implementors, not all implementors • NOT all cases – only common cases • capture client-visible functionality requirements rather than being technology/system-design-driven

  14. Base Case • High throughout compute cluster used only within the enterprise • User requests: • Submit a job with specification of resource requirements  unique jobID or fault • Query a specific job for its current state • Cancel a specific job • List jobs • State diagram: queued, running, finished

  15. Base Case (cont) • Only small set of “standard” resources • number of CPUs/compute nodes needed, memory requirements, disk requirements, etc. • Only equality of string values and numeric relationships among pairs of numeric values are provided in the base use case. • Once a job has been submitted it can be cancelled, but its resource requests can't be modified

  16. Base case: Out of Scope • Data access issues • Programs are assumed to be pre-installed • Creation and management of user security credentials • No need for directory services beyond something like DNS • Management of the system resources

  17. Base case: Fault tolerance model • Job fails because of “system problems” • Job must be resubmitted by client • the job scheduler will not automatically rerun the job • Failure of the scheduler may or may not cause currently running jobs to fail.

  18. Base case: Job Exits • Whether it exited successfully, with an error code, or terminated due to some other fault situation. • How long it ran in terms of wall-clock time.

  19. Base case: scheduling policy • FIFO • Out-of-scope: • quotas and other forms of SLAs • Non-independent jobs • Infrastructure support for parallel, distributed programs (such as MPI) • Reservation of resources separate from allocation to a running job (e.g., reserve 3 cpus for future use) • Interactive access to running jobs

  20. Common Cases • Purpose of enumerating common cases: use as the basis for creating appropriate extension mechanisms • 13 cases

  21. 13 Common Cases • Exposing existing schedulers’ functionality • Condor, Globus, LSF, Maui, Microsoft-CCS, PBS, SGE, etc. • Polling vs. notification • notification “call-back” messages for significant changes in the state of a job • What are the semantics of message delivery? • At-Most-Once and Exactly-Once Submission Guarantees • The base use case allows the possibility that a client can’t directly tell whether its job submission request has been successfully received by a job scheduler or not • Types of Data Access • non-transparent staging of data between independent storage systems. • explicitly supports transparent data access within a virtual organization or across a federated set of organizations

  22. 13 Common Cases (cont.) • Types of Programs to Install/Provision/Run • users may have programs that require explicit installation of some form. • Multiple Organization Use Cases • Submission of jobs requires additional security support (e.g., “foreign” credential) • Data current resides outside of the enterprise in question • Additional sandboxing of non-local users • Extended Resource Descriptions • allow arbitrary resource types whose semantics are not understood by the HPC infrastructure • accounting information returned for a job

  23. 13 Common Cases (cont.) • Extended Client/System Administrator Operations • users may wish to modify the requirements for a job after it has already been submitted • Arrays of jobs • system administrators: suspension/resumption of jobs and migration of jobs among compute nodes • Extended Scheduling Policies • shortest/smallest-job-first, weighted-fair-share scheduling, etc. • multiple submission queues, job submission quotas, and various forms of SLAs, such as guarantees on how quickly a job will be scheduled to run.

  24. 13 Common Cases (cont.) • Parallel Programs and Workflows of Programs • instantiate such programs (e.g., MPI) across multiple compute nodes in a suitable manner, including provision of information that will allow the various program instances to find each other within the cluster • Programs may have execution dependencies on each other. • Advanced Reservations and Interactive Use Cases • reserve resources for use at a specific future time • communicate in real time with external client users • Cycle Scavenging • batch job scheduler dispatches jobs to machines that have dynamically indicated to it that they are currently available for running guest jobs.

  25. 13 Common Cases (cont.) • Multiple Schedulers • submit work to the whole of the computing infrastructure without having to manually select which facility to submit to

  26. Status • Need feedback • Is the base case sufficient? • Missing any “common” cases? • Any of the 13 “too uncommon”?

  27. Extensible Job Submission Design(11:30 – 11:45)

  28. Extensible Job Submission Design (EJS) • Main focus: extensibility • Philosophy: • Cover all the bases (resource reservation, provisioning, execution, data staging, etc.) • Keep it simple • Approach: • Minimalist base cases (overall and for each sub-component) • Optional extensions to enable both richer semantics and evolution

  29. What is a Job? • OGSA glossary: • Job: User-defined task that is scheduled to be carried out by an execution subsystem • Task: ??? • Single program instance? • Distributed MPI program? • What about data staging? BES defines simple workflows • Execution subsystem: ??? • Job queue? • Process? Compute node? Multiple compute nodes? • Workflow: • Focus is on business processes & services • No mention of executing multiple user-defined tasks or data staging steps • Batch job scheduling literature: • Job ~ accounting entity under which multiple user-defined steps are run

  30. Core Concepts • Task: execution of one or more program instances in one or more execution subsystems • Compute node: execution subsystem that actually executes a program • Resources: • Compute node CPUs, memory, disk space, etc. • Aggregates: # of compute nodes, all resources of a compute node, etc. • Scheduler: allocates resources to job and tasks • Resource allocation: 3 distinct phases • Clients query schedulers about available resources • Clients reserve resources • Schedulers allocate resources to tasks or to reservation requests • Job: reified resource reservation against which tasks can be run

  31. Examples of Jobs and Tasks

  32. Canceled Pending Running New Finished Failed Base Task States

  33. Canceled New Unsatisfied Satisfied Finished Failed Base Job States

  34. Multiple Schedulers Cluster13 Cluster13-1 CN Cluster13-headnode Task1 Sched13 Cluster13-2 Meta-sched Client CN Task1 Cluster42 Desktop-foo Sched42 Cluster42-8 CN CN Task3 42-1 Task2 42-7 …

  35. Other Topics Covered • Advertising resource information • Failure and recovery model • Security and credential delegation

  36. Types of Extensions • Purely additive extensions allowed (i.e. no changes to base semantics) • Additional WSDL operations (incl. for parameter overloading) • Array operations • Extended state diagrams • Extended resource descriptions • Extended information representations • Multiple, composable, extensible “micro”-protocols

  37. Canceled Canceled New Pending Running Finished New Pending Running Finished Migrate Suspend Running: Migrating Running: Suspended Failed Failed Canceled New Pending Running: Stage-in Running: Executing Running: Stage-out Finished Failed Specialization of States Profile A: Task state transition diagram for a scheduling profile that extends the base protocol to support task migration Profile C: Task state transition diagram for a scheduling profile that extends the base protocol to support task suspension Profile B: Task state transition diagram for a scheduling profile that extends the base protocol to support the notion of staging in data to a compute node before a task runs and staging data out back to the client user after the task has finished execution

  38. Base Interoperability Interface • Task interface: • CreateTask(schedulerEPR, resourceDescr, credentialsDescr, lifetime)  taskDescr • QueryTask(taskEPR, taskID, queryDescr)  taskDescr • CancelTask(taskEPR, taskID) • Scheduler interface: • QueryScheduler(schedulerEPR, queryDescr)  schedulerDescr

  39. Generic Extensions • Array operations • Notifications • Query operation modifiers • Idempotent message delivery semantics • EPR resolution

  40. Task Interface Extensions • Re-execution of failed tasks • Additional & extended resource definitions • Additional operations • ModifyTask • … • Additional scheduling policies • Support for parallel/distributed programs • Data staging • Provisioning • Static workflow

  41. Resource Reservations • Job interface: • CreateJob (schedulerEPR, resourceDescr, credentialsDescr, lifetime)  rsrvDesc • QueryJob (rsrvEPR, rsrvID)  rsrvDescr • ModifyJob (rsrvEPR, rsrvID, resourceDescr)  rsrvDescr • CancelJob (rsrvEPR, rsrvID)

  42. Multiple Schedulers • Hierarchical information option • Client scheduler list • AnnounceScheduler (schedulerEPR, announcerDesc)

  43. Comparison of ESI to Extensible Job Submission Design • Focus of ESI: reconciliation/synthesis of Globus and Unicore • Focus of EJS: extensibility

More Related