System directed resilience for exascale platforms
1 / 8

System-Directed Resilience for Exascale Platforms - PowerPoint PPT Presentation

  • Uploaded on

System-Directed Resilience for Exascale Platforms. LDRD Proposal 09-0016. System-Directed Resilience for Exascale Platforms (09-0016) Ron Oldfield (1423), Neil Pundit (1423), FY09-11, Total $1500 Costs. Problem Current apps cannot survive a node failure

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'System-Directed Resilience for Exascale Platforms' - lynne

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

System-Directed Resilience for Exascale Platforms (09-0016)Ron Oldfield (1423), Neil Pundit (1423),FY09-11, Total $1500 Costs

  • Problem

  • Current apps cannot survive a node failure

  • Proposed SolutionApplication-transparent resilience to node failures

  • Approach

  • Design/develop system software to support:

    • Application quiescence,

    • Efficient state management,

    • Automatic fault recovery

  • R&D Goals & Milestones

  • Investigate and develop new methods for quiescence that don’t hinder other apps.

  • Identify critical application state and develop efficient methods to manage state.

  • Identify system software requirements for

    • dynamic node allocation,

    • network/os virtualization, and

    • MPI node recovery.

  • Relationship to Other Work

  • Scalability and efficient resource utilization, particularly memory and storage, are key issues for this effort.

  • Our team has R&D experience in:

  • Scalable system software (LWK, Portals, LWFS),

  • Smart memory management techniques (Smartmap)

  • RAS systems

  • All efforts developed “lightweight” approaches that are both resource-efficient and scalable.

  • Significance of Results

  • Represents a fundamental change in the way HPC systems support resilience.

  • Significant impact on performance: less defensive I/O overhead for checkpoints.

  • Higher levels of reliability.

  • Improved productivity: developers worry less about resilience, more on core science.

Resilience challenges for exascale
Resilience Challenges for Exascale

  • Current Application characteristics

    • Require large fractions of systems

    • Long running

    • Resource constrained compute nodes

    • Cannot survive component failure

  • Current Options for fault tolerance

    • Application-directed checkpoints

    • System-directed checkpoints

    • System-directed incremental checkpoints

    • Checkpoint in memory

    • Others: virtualization, redundant computation, …

  • We propose to develop systems software resilient to node failure

    • Support for application quiescence,

    • Efficient (diskless) state management,

    • Fast methods for fault recovery.

Application quiescence
Application Quiescence

Goal: Develop methods to suspend application activity without hindering progress of other applications

  • Requires

    • Methods for accurate and efficient fault detection

    • Mechanisms and interfaces for conveying node state to shared services (e.g., need a functional RAS system)

  • Approach

    • Integrated system software for cooperation among shared services and applications

      • Network layer: deal with messages in transit

      • File system: isolate and suspend in-progress I/O operations

State management
State Management

Goal: Efficient methods for extracting and managing state


  • Identify critical state

    • Characterize memory usage

    • Investigate resource-efficient methods for logging modified memory.

    • App guidance to identify unnecessary data (e.g., ghost cells, cache)

  • System guidance for when to extract state

  • Explore diskless methods to manage state

  • Explore state compression to reduce resource reqs

Fault recovery
Fault Recovery

Goal: Dynamically recover a failed node without restarting the whole application


  • Explore changes to system software to support dynamic node allocation (for swap of failed node).

  • Develop network virtualization to abstract physical node ID from software.

  • Develop efficient methods for state recovery

    • Investigate roll-back, roll-forward techniques


  • Recovering from independent node failures is a critical issue for exascale systems

  • We address that problem through modifications to system software

    • Support for application quiescence,

    • Efficient (diskless) state management,

    • Fast methods for fault recovery.

      Our approach represents a fundamental

      change in how systems support resilience

Reviewer questions
Reviewer Questions

  • Programmatic

    • Firm commitments from team if LDRD goes forward?

    • Why is funding flat for FY10 and FY11?

  • Technical

    • Is the assertion that “checkpoint overhead will exceed 50% beyond 100K nodes” too modest?

    • Why use the term “components” instead of cores or processors.

  • Technical/Programmatic

    • Can the project really address all of the proposed work?

    • With 10-11 technical topics have we identified all the technical risks?