DRAIN: Distributed Recovery Architecture for Fault-Tolerant Multi-core Chips

DRAIN: Distributed Recovery Architecture for Inaccessible Nodes in Multi-core Chips Andrew DeOrio†, KonstantinosAisopos‡§Valeria Bertacco†, Li-ShiuanPeh§ †University of Michigan‡Princeton University§Massachusetts Institute of Technology DAC 2011

Reliable Networks on Chip fault-tolerant routing Drain up $ recovery detection diagnosis recon-figuratio R Diagnose what fault has occurred Detect if fault has occurred Reconfigure network to account for fault Recover and resume normal operation processor nodes are disconnected, state is lost! cache router

Recovery Approaches • Checkpoint/recovery approaches • Drain takes a reactive approach, incurring performance overhead only when errors occur uP uP $ $ checkpoint buffers R R MEM high performance overhead! data stuck in checkpoint buffer!

Data Recovery with Drain • Recover data lost during reconfiguration • Emergency links provide alternate path • Transfers cache contents and architectural state processor u P u P u P . . . . core . . . . . . . . . . . . . . local cache ... ... $ $ $ DRAIN emergency link ... ... . . . . . . . . . Router Router Router primary link . . . memory . . . Mem . . . controller

Drain Example up up $ $ up up $ $ M

Drain Example up up $ $ up up link failure X $ $ M

Drain Example up up $ $ reconfigure interconnect up up $ $ M

Drain Example up up $ $ up up $ $ X M link failure

Drain Example up up $ $ node isolated! up up $ $ M

Drain Example up up $ $ drain connected nodes via primary links up up $ $ M

Drain Example up up $ $ up up $ $ drain disconnected node via emergency link M

Drain Example up up $ $ up up $ $ drain connected node again M

Drain Example up up $ $ resume normal operation up up $ $ M

Drain Performance as Links Fail decreasing functional network size increasing emergency link time

Memory Latency Before and After

Conclusions • DRAIN is a lightweight recovery mechanism for CMPs • 5,000 gates per node • Recoup cache data and architectural state from disconnected nodes • Performance overhead only during recovery • ~3ms at 1GHz

DRAIN: Distributed Recovery Architecture for Fault-Tolerant Multi-core Chips

DRAIN: Distributed Recovery Architecture for Fault-Tolerant Multi-core Chips

Presentation Transcript

AnyGL: A Large Scale Hybrid Distributed Graphics System

Unit Nine

Snacks - Fried Potato Chips

DATA MANAGEMENT FOR THE ALL-DOD CORE ARCHITECTURE DATA MODEL (All_CADM)

J2EE Technologies and Distributed Multi-Tiered Application Development

Multi-Way search Trees

Best Practices in Software Architecture

Making Domino Clusters a Key Part of Your Enterprise Disaster Recovery Architecture

AFFYMETRIX chips

Chapter 23

Project title : DISTRIBUTED AND CAPE-OPEN COMPLIANT PLATFORM FOR PLANNING AND SCHEDULING MULTI-SITE MANUFACTURING SYSTEM

Memory Systems in the Multi-Core Era Lecture 1: DRAM Basics and DRAM Scaling

Lecture 3 (Complexities of Parallelism)

Chapter 22: Distributed Databases

Multi-Agent Systems (Chapter 9) Adapted with permission from Adina Magda Florea adina@wpi

Distributed Systems

RoboCup: An Application Domain for Distributed Planning and Sensoring in Multi-robot Systems

Outline

Pattern-Oriented Distributed Software Architectures

Active Directory Disaster Recovery

Distributed Resilient Consensus Nitin Vaidya University of Illinois at Urbana-Champaign