A scalable double in memory checkpoint and restart scheme towards exascale
Download
1 / 16

A Scalable Double In-memory Checkpoint and Restart Scheme Towards Exascale - PowerPoint PPT Presentation


  • 100 Views
  • Uploaded on

A Scalable Double In-memory Checkpoint and Restart Scheme Towards Exascale. Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign. Motivation. As machines grow in size MTBF decreases

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' A Scalable Double In-memory Checkpoint and Restart Scheme Towards Exascale' - zachery-cruz


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
A scalable double in memory checkpoint and restart scheme towards exascale

A Scalable Double In-memory Checkpoint and Restart Scheme Towards Exascale

Gengbin Zheng

Xiang Ni

Laxmikant V. Kale

Parallel Programming Lab

University of Illinois at Urbana-Champaign


Motivation
Motivation Towards

  • As machines grow in size

    • MTBF decreases

      • Jaguar had 2.33 average failures/day from 2008 to 2010

    • Applications have to tolerate faults

  • Challenges for exascale:

    • Disk-based (NFS reliable disk) checkpointing is slow

    • System-level checkpointing can be expensive

    • Scalable checkpointing/restart can be a communication intensive process

    • Job scheduler prevent fault tolerance support in runtime

Charm++ Workshop 2012


Motivation cont
Motivation (cont.) Towards

  • Applications on future exascale machines need fast, low cost and scalable fault tolerance support

    • Previous work:

    • double in-memory checkpoint/restart scheme

      • In production version of Charm++ since 2004

Charm++ Workshop 2012


Double in memory checkpoint restart protocol
Double in-memory Checkpoint/Restart Towards Protocol

PE3

PE0

PE1

PE2

I

H

J

A

G

B

D

E

F

C

H

I

J

F

G

D

E

B

C

A

A

I

H

B

C

J

G

D

F

E

PE1 crashed ( lost 1 processor )

PE0

PE2

PE3

I

B

C

H

J

A

G

D

E

F

D

H

J

G

A

B

C

F

E

I

A

C

E

J

H

I

F

G

B

D

checkpoint 1

checkpoint 2

object

restored object

A

A

A

A

Charm++ Workshop 2012


Runtime support for ft
Runtime Support for FT Towards

  • Automatically checkpointing threads

    • Including stack and heap (isomalloc)

  • User helper functions

    • To pack and unpack data

      • Checkpointing only the live variables

Charm++ Workshop 2012


Local disk based protocol
Local Disk-Based Protocol Towards

  • Double in-memory checkpointing

    • Memory concern

    • Pick checkpointing time where global state is small

      • MD, N-body, quantum chemistry

  • Double In-disk checkpointing

    • Make use of local disk (or SSD)

    • Also does not rely on any reliable storage

    • Useful for applications with very big memory footprint

Charm++ Workshop 2012


Previous results performance comparisons with traditional disk based checkpointing
Previous Results: Performance Comparisons with Traditional Disk-based Checkpointing

Charm++ Workshop 2012


Previous results restart with load balancing
Previous Results: Restart with Load Balancing Disk-based Checkpointing

LeanMD, Apoa1, 128 processors

Charm++ Workshop 2012


Previous result recovery performance
Previous Result: Recovery Disk-based CheckpointingPerformance

  • 10 crashes

  • 128 processors

  • Checkpoint every 10 time steps

Charm++ Workshop 2012


Ft on mpi based charm
FT on MPI-based Charm++ Disk-based Checkpointing

  • Practical challenge: job scheduler

    • Job scheduler kills the entire job when a process fails

  • MPI-based Charm++ is portable on major supercomputers

  • A fault injection scheme in MPI machine layer

    • DieNow()

      • MPI process stop responding

      • Fault detection by keep-alive messages

    • Spare processors to replace failed ones

    • Demonstrated on 64K cores of BG/P machine

Charm++ Workshop 2012


Performance at large scale
Performance at Large Scale Disk-based Checkpointing

Charm++ Workshop 2012


Optimization for scalability
Optimization for scalability Disk-based Checkpointing

  • Communication bottlenecks

    • Checkpoint/restart time takes O(P) time

  • Optimizations:

    • Collectives (barriers)

      • Switch O(P) barrier to a tree-based barrier

    • Stale message handling

      • Epoch number

      • A phase to discard stale messages as quickly as possible

    • Small messages

      • Streaming optimization

Charm++ Workshop 2012


Leanmd checkpoint time before after optimization
LeanMD Checkpoint Time before/after Optimization Disk-based Checkpointing

Charm++ Workshop 2012


Checkpoint time for jacobi ampi
Checkpoint Time for Jacobi/AMPI Disk-based Checkpointing

Kraken

Charm++ Workshop 2012


Leanmd restart time
LeanMD Restart Time Disk-based Checkpointing

Charm++ Workshop 2012


Conclusions and future work
Conclusions and Future Disk-based Checkpointingwork

  • In-memory checkpointing after optimization is scalable towards Exascale

  • A short paper is accepted at the 2nd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2012)

  • Future work:

    • Non-blocking checkpointing

Charm++ Workshop 2012


ad