A scalable double in memory checkpoint and restart scheme towards exascale
This presentation is the property of its rightful owner.
Sponsored Links
1 / 16

A Scalable Double In-memory Checkpoint and Restart Scheme Towards Exascale PowerPoint PPT Presentation


  • 64 Views
  • Uploaded on
  • Presentation posted in: General

A Scalable Double In-memory Checkpoint and Restart Scheme Towards Exascale. Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign. Motivation. As machines grow in size MTBF decreases

Download Presentation

A Scalable Double In-memory Checkpoint and Restart Scheme Towards Exascale

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


A scalable double in memory checkpoint and restart scheme towards exascale

A Scalable Double In-memory Checkpoint and Restart Scheme Towards Exascale

Gengbin Zheng

Xiang Ni

Laxmikant V. Kale

Parallel Programming Lab

University of Illinois at Urbana-Champaign


Motivation

Motivation

  • As machines grow in size

    • MTBF decreases

      • Jaguar had 2.33 average failures/day from 2008 to 2010

    • Applications have to tolerate faults

  • Challenges for exascale:

    • Disk-based (NFS reliable disk) checkpointing is slow

    • System-level checkpointing can be expensive

    • Scalable checkpointing/restart can be a communication intensive process

    • Job scheduler prevent fault tolerance support in runtime

Charm++ Workshop 2012


Motivation cont

Motivation (cont.)

  • Applications on future exascale machines need fast, low cost and scalable fault tolerance support

    • Previous work:

    • double in-memory checkpoint/restart scheme

      • In production version of Charm++ since 2004

Charm++ Workshop 2012


Double in memory checkpoint restart protocol

Double in-memory Checkpoint/Restart Protocol

PE3

PE0

PE1

PE2

I

H

J

A

G

B

D

E

F

C

H

I

J

F

G

D

E

B

C

A

A

I

H

B

C

J

G

D

F

E

PE1 crashed ( lost 1 processor )

PE0

PE2

PE3

I

B

C

H

J

A

G

D

E

F

D

H

J

G

A

B

C

F

E

I

A

C

E

J

H

I

F

G

B

D

checkpoint 1

checkpoint 2

object

restored object

A

A

A

A

Charm++ Workshop 2012


Runtime support for ft

Runtime Support for FT

  • Automatically checkpointing threads

    • Including stack and heap (isomalloc)

  • User helper functions

    • To pack and unpack data

      • Checkpointing only the live variables

Charm++ Workshop 2012


Local disk based protocol

Local Disk-Based Protocol

  • Double in-memory checkpointing

    • Memory concern

    • Pick checkpointing time where global state is small

      • MD, N-body, quantum chemistry

  • Double In-disk checkpointing

    • Make use of local disk (or SSD)

    • Also does not rely on any reliable storage

    • Useful for applications with very big memory footprint

Charm++ Workshop 2012


Previous results performance comparisons with traditional disk based checkpointing

Previous Results: Performance Comparisons with Traditional Disk-based Checkpointing

Charm++ Workshop 2012


Previous results restart with load balancing

Previous Results: Restart with Load Balancing

LeanMD, Apoa1, 128 processors

Charm++ Workshop 2012


Previous result recovery performance

Previous Result: Recovery Performance

  • 10 crashes

  • 128 processors

  • Checkpoint every 10 time steps

Charm++ Workshop 2012


Ft on mpi based charm

FT on MPI-based Charm++

  • Practical challenge: job scheduler

    • Job scheduler kills the entire job when a process fails

  • MPI-based Charm++ is portable on major supercomputers

  • A fault injection scheme in MPI machine layer

    • DieNow()

      • MPI process stop responding

      • Fault detection by keep-alive messages

    • Spare processors to replace failed ones

    • Demonstrated on 64K cores of BG/P machine

Charm++ Workshop 2012


Performance at large scale

Performance at Large Scale

Charm++ Workshop 2012


Optimization for scalability

Optimization for scalability

  • Communication bottlenecks

    • Checkpoint/restart time takes O(P) time

  • Optimizations:

    • Collectives (barriers)

      • Switch O(P) barrier to a tree-based barrier

    • Stale message handling

      • Epoch number

      • A phase to discard stale messages as quickly as possible

    • Small messages

      • Streaming optimization

Charm++ Workshop 2012


Leanmd checkpoint time before after optimization

LeanMD Checkpoint Time before/after Optimization

Charm++ Workshop 2012


Checkpoint time for jacobi ampi

Checkpoint Time for Jacobi/AMPI

Kraken

Charm++ Workshop 2012


Leanmd restart time

LeanMD Restart Time

Charm++ Workshop 2012


Conclusions and future work

Conclusions and Future work

  • In-memory checkpointing after optimization is scalable towards Exascale

  • A short paper is accepted at the 2nd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2012)

  • Future work:

    • Non-blocking checkpointing

Charm++ Workshop 2012


  • Login