1 / 8

Scalable Fault Tolerance for Petascale Systems 3/20/2008

Performance Measures x.x, x.x, and x.x. Scalable Fault Tolerance for Petascale Systems 3/20/2008. Greg Bronevetsky, Bronis de Supinski, Peter Lindstrom, Adam Moody, Martin Schulz CAR - CASC. Enabling Fault Tolerance for Petascale Systems. Problem:

aislin
Download Presentation

Scalable Fault Tolerance for Petascale Systems 3/20/2008

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Measures x.x, x.x, and x.x Scalable Fault Tolerance for Petascale Systems3/20/2008 Greg Bronevetsky, Bronis de Supinski, Peter Lindstrom, Adam Moody, Martin Schulz CAR - CASC Science & Technology Principal Directorate - Computation Directorate

  2. Enabling Fault Tolerance for Petascale Systems • Problem: • Reliability key concern for petascale systems • Current fault tolerance approaches scale poorly, use significant I/O bandwidth • Deliverables: • Efficient application checkpointing software for upcoming petascale systems • High-performance I/O system designs for future petascale systems • Ultimate objective:Reliable software on unreliable petascale hardware Science & Technology Principal Directorate - Computation Directorate

  3. Our team has extensive experience implementing scalable fault tolerance and compression techniques • Funding Request: $500k/year (none from other directorates) • Team members: • Peter Lindstrom(.25FTE): Floating Point Compression • Adam Moody(.5FTE): Checkpointing/HPC Systems • Martin Schulz(.25FTE): Checkpointing/HPC Systems • Greg Bronevetsky(.25FTE): Checkpointing/Soft Errors • External collaborators (anticipated): • Sally McKee (Cornell University) Science & Technology Principal Directorate - Computation Directorate

  4. Checkpoints on current systems are limited by the I/O bottleneck • BG/L: 20 minutes per checkpoint (pre-upgrade) • Zeus: 26 minutes • Argonne BG/P: 30 minutes (target) • Current Practice: Drinking the ocean through a straw • Alternative: Flash, disks on compute network, I/O nodes • Extra level of cache between compute nodes, parallel file system Compute Network I/O Nodes ParallelFile System Compute Network I/O Nodes Storage Elements To parallel file system: 80 minutesTo local disks: 1 minute • Thunder checkpoint: Science & Technology Principal Directorate - Computation Directorate

  5. Checkpoint scalability must be improved to support coming systems such as Sequoia • Checkpoint Size Reduction • Incremental Checkpointing • Save only state that changed since last checkpoint • Changes detected via runtime or compiler • Checkpoint Compression • Floating point-specific • Sensitive to relationships between data • Scalable Checkpoint Coordination • Checkpoint Size Reduction • Scalable Checkpoint Coordination • Checkpoint Size Reduction • Scalable Checkpoint Coordination • Subsets of processors checkpoint together • I/O pressure spread evenly over time Science & Technology Principal Directorate - Computation Directorate

  6. Application-specific APIs will enable novel fault tolerance solutions like those used in ddcMD • Application semantics improve performance • Programmers can identify • Data that doesn’t need to be saved • Types of data structures Key for high-performance compression • Matrix relationships Recomputation vs storage • Fault detection algorithms • Critical for soft errors • Ex: ddcMD corrects cache errors on BG/L Science & Technology Principal Directorate - Computation Directorate

  7. Our project will create a paradigm shift in LLNL application reliability • LLNL practice: Users write own checkpointing code • Wastes programmer time • Checkpointing at global barriers unscalable • Current automated solutions do not scale • Very large checkpoints • No information about application • This project will: • Match I/O demands to I/O capacity • Minimize programmer effort • Scale checkpointing to petascale systems • Enable application-specific fault tolerance solutions Science & Technology Principal Directorate - Computation Directorate

  8. Fault tolerance is critical forSequoia and all future platforms • CAR S&T Strategy 1.1: “Perform the research to develop new algorithms that can best exploit likely HPC hardware characteristics, including … fault-tolerant algorithms that can withstand processor failure” • Project enables application fault tolerance • Target audience: application developers • pf3d uses Adam Moody’s in-memory checkpointer • ddcMD implements complex error tolerance schemes • Deliverables: • Efficient application checkpointing software for upcoming petascale systems (e.g. Sequoia) • High-performance I/O system designs for future petascale systems • Application-specific fault tolerance APIs Science & Technology Principal Directorate - Computation Directorate

More Related