Scalable Failure Recovery for Tree-based Overlay Networks

Scalable Failure Recovery forTree-based Overlay Networks Dorian C. Arnold University of Wisconsin Paradyn/Condor Week April 30 – May 3, 2007 Madison, WI

Overview • Motivation • Address the likely frequent failures in extreme-scale systems • State Compensation: • Tree-Based Overlay Networks (TBŌNs) failure recovery • Use surviving state to compensate for lost state • Leverage TBŌN properties • Inherent information redundancies • Weak data consistency model: convergent recovery • Final output stream converges to non-failure case • Intermediate output packets may differ • Preserves all output information Scalable TBŌN Failure Recovery

HPC System Trends • 60% larger than 103 processors • 10 systems larger than 104 processors Scalable TBŌN Failure Recovery

Large Scale System Reliability BlueGene/L Purple Scalable TBŌN Failure Recovery

Current Reliability Approaches • Fail-over (hot backup) • Replace failed primary w/ backup replica • Extremely high overhead: 100% minimum! • Rollback recovery • Rollback to checkpoint after failure • May require dedicated resources and lead to overloaded network/storage resources Scalable TBŌN Failure Recovery

Our Approach:State Compensation • Leverage inherent TBŌN redundancies • Avoid explicit replication • No overhead during normal operation • Rapid recovery • Limited process participation • General recovery model • Applies to broad classes of computations Scalable TBŌN Failure Recovery

CP0 CP0 PacketFilter CP1 CP1 CP2 CP2 FilterState Tree ofCommunicationProcesses … CP CP CP CP … CP CP CP CP CP CP CP CP Background: TBŌN Model FE Application Front-end TBŌN Output TBŌN Input Application Back-ends BE BE BE BE BE BE BE BE Scalable TBŌN Failure Recovery

A Trip Down Theory Lane • Formal treatment provides confidence in recovery model • Reason about semantics before/after recovery • Recovery model doesn’t change computation • Prove algorithmic soundness • Understand recovery model characteristics • System reliability cannot depend upon intuition and ad-hoc reasoning Scalable TBŌN Failure Recovery

Theory Overview • TBŌN end-to-end argument: output only depends on state at the end-points • Can recover from lost of any internal filter and channel states CPi channel state filter state CPj channel state CPk CPl Scalable TBŌN Failure Recovery

Theory Overview (cont’d) CPi TBŌN Output Theorem Output depends only on channel states and root filter state channel state filter state All-encompassing Leaf State Theorem State at leaves subsume channel state(all state throughout TBŌN) CPj channel state Result: only need leaf state to recoverfrom root/internal failures CPk CPl Scalable TBŌN Failure Recovery

Theory Overview (cont’d) • TBON Output Theorem • All-encompassing Leaf State Theorem • Builds on Inherent Redundancy Theorem Scalable TBŌN Failure Recovery

Background: Notation fsn(CPi ) CPi csn,p( CPi ) fsp(CPj ) CPj cs( CPj ) CPk CPl Scalable TBŌN Failure Recovery

( ( ) ( ) ) f ( ) ( ) g f f f C P C P C P C P i t n s o u s ! i i i i 1 + n n n n ; ; Background: Data Aggregation Filter function: Packets frominput channels Current filter state Output packet Updated filter state Scalable TBŌN Failure Recovery

( ( ) ) ( ( ) ) ( ) b b b f b f C P C P C P i t t t t t t t t t a a n a a a c a a s c s = = = ! i i i 1 + n n n Background: Filter Function • Built on state join and difference operators • State join operator, • Update current state by merging inputs • Commutative: • Associative: • Idempotent: Scalable TBŌN Failure Recovery

Background: Descendant Notation desc0(CPi ) CPi desc1(CPi ) CP CP … desck-1(CPi ) CP CP desck(CPi ) CP CP CP CP … fs( desck(CPi ) ): join of filter states of specified processes cs( desck(CPi ) ): join of channel states of specified processes Scalable TBŌN Failure Recovery

t t t TBŌN Properties:Inherent Redundancy Theorem The join of a CP’s filter state with its pending channel state equals the join of the CP’s children’s filter states. The join of a CP’s filter state with its pending channel state equals the join of the CP’s children’s filter states. The join of a CP’s filter state with its pending channel state equals the join of the CP’s children’s filter states. fsn(CP0 ) csn,p(CP0 ) csn,q(CP0 ) fsp(CP1 ) fsq(CP2 ) = CP0 CP1 CP2 Scalable TBŌN Failure Recovery

t TBŌN Properties:Inherent Redundancy Theorem The join of a CP’s filter state with its pending channel state equals the join of the CP’s children’s filter states. fs( desc0(CP0 )) cs( desc0(CP0)) fs( desc1(CP0 )) = CP0 CP1 CP2 Scalable TBŌN Failure Recovery

TBŌN Properties:All-encompassing Leaf State Theorem The join of the states from a sub-tree’s leaves equalsthe join of the states at the sub-tree’s root and all in-flight data The join of the states from a sub-tree’s leaves equalsthe join of the states at the sub-tree’s root and all in-flight data CP0 CP2 CP1 CP5 CP6 CP3 CP4 Scalable TBŌN Failure Recovery

k k 0 1 ¡ ( ( ) ) ( ) ( ( ) ) ( ( ) ) f d f d d C P C P C P C P t t t s e s c s c s e s c c s e s c = 0 0 0 0 : : : k k k 1 1 ¡ ¡ ( ( ) ) ( ( ) ) ( ( ) ) f d f d d C P C P C P t s e s c s e s c c s e s c 1 0 0 = ( ( ) ) ( ( ) ) ( ( ) ) 0 0 0 f d f d d C P C P C P 2 1 1 ( ( ) ) ( ( ) ) ( ( ) ) t f d f d d C P C P C P s e s c s e s c c s e s c t = 0 0 0 s e s c s e s c c s e s c = 0 0 0 TBŌN Properties:All-encompassing Leaf State Theorem The join of the states from a sub-tree’s leaves equalsthe join of the states at the sub-tree’s root and all in-flight data From Inherent Redundancy Theorem: … Scalable TBŌN Failure Recovery

State Composition • Motivated by previous theory • State at leaves of a sub-tree subsume state throughout the higher levels • Compose state below failure zones to compensate for lost state • Addresses root and internal failures • State decomposition for leaf failures • Generate child state from parent and sibling’s Scalable TBŌN Failure Recovery

State Composition If CPj fails, all state associated withCPj is lost CPi TBŌN Output Theorem: Output depends only on channel states and root filter state cs( CPi , m) fs(CPj ) CPj All-encompassing Leaf State Theorem: State at leaves subsume channel state(all state throughout TBŌN) cs( CPj ) CPk CPl Therefore, leaf states can replacelost channel state without changingcomputation’s semantics Scalable TBŌN Failure Recovery

State Composition Algorithm ifdetect child failure remove failed child from input list resume filtering from non-failed children endif ifdetect parent failure do determine/connect to new parent whilefailure to connect propagate filter state to new parent endif Scalable TBŌN Failure Recovery

Summary: Theory can be Good! • Allows us to make recovery guarantees • Yields sensible, understandable results • Better-informed implementation • What needs to be implemented • What does not need to be implemented Scalable TBŌN Failure Recovery

References • Arnold and Miller, “State Compensation: A Scalable Failure Recovery Model for Tree-based Overlay Networks”, UW-CS Technical Report, February 2007. • Other papers and software:http://www.paradyn/org/mrnet Scalable TBŌN Failure Recovery

Bonus Slides! Scalable TBŌN Failure Recovery

( ) l b l h d i i t t t t t £ + r e c o v e r y a e n c y c o n n e c o n e s a s m e n m a x a o p e e s = h d t t o u p u o v e r e a State Composition Performance LAN connection establishment: ~1 millisecond Scalable TBŌN Failure Recovery

Failure Model • Fail-stop • Multiple, simultaneous failures • Failure zones: regions of contiguous failure • Application process failures • May view as sequential data sources/sink • Amenable to basic reliability mechanisms • Simple restart, sequential checkpointing Scalable TBŌN Failure Recovery

Scalable Failure Recovery for Tree-based Overlay Networks

Scalable Failure Recovery for Tree-based Overlay Networks

Presentation Transcript

Overlay networks for wireless ad hoc networks

Overlay/P2P Networks

The MRNet Tree-based Overlay Network

Resilient Overlay Networks

Structure Management for Scalable Overlay Service Construction

Location-based Data Overlay for Intermittently-Connected Networks

CS5412 : Overlay Networks

Using Overlay Networks for Proximity-based Discovery

Failure Recovery

Overlay Networks and Overlay Multicast

Resilient Overlay Networks

Tree-based Overlay Networks for Scalable Applications and Analysis

Implementing Scalable Tree-based Algorithms using MRNet

Failure Recovery

Availability for DHT-Based Overlay Networks with Unidirectional Routing

Resilient Overlay Networks

Failure Recovery of Overlay Tree-based Structures

Resilient Overlay Networks

Infrastructure Primitives for Overlay Networks

Tree-Pattern Similarity Estimation for Scalable Content-based Routing