On modeling the lifetime reliability of homogeneous manycore systems
Download
1 / 25

On Modeling the Lifetime Reliability of Homogeneous Manycore Systems - PowerPoint PPT Presentation


  • 81 Views
  • Uploaded on

On Modeling the Lifetime Reliability of Homogeneous Manycore Systems. Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University of Hong Kong. Integrated Circuit (IC) Product Reliability. IC errors can be broadly classified into two categories Soft errors

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' On Modeling the Lifetime Reliability of Homogeneous Manycore Systems' - sophia-strong


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
On modeling the lifetime reliability of homogeneous manycore systems

On Modeling the Lifetime Reliability of Homogeneous Manycore Systems

Lin Huang and Qiang Xu

CUhk REliable computing laboratory (CURE)

The Chinese University of Hong Kong


Integrated circuit ic product reliability
Integrated Circuit (IC) Product Reliability Systems

IC errors can be broadly classified into two categories

Soft errors

Do not fundamentally damage the circuits

Hard errors

Permanent once manifest

E.g., time dependent dielectric breakdown (TDDB) in the gate oxides, electromigration (EM) and stress migration (SM) in the interconnects, and thermal cycling (TC)


Manycore systems
Manycore Systems Systems

State-of-the-art computing systems have started to employ multiple cores on a single die

General-purpose processors, multi-digital signal processor systems

Power-efficiency

Short time-to-market

Source: Intel

Source: Nvidia


Problem formulation
Problem Formulation Systems

To model the lifetime reliability of homogeneous manycore systems using a load-sharing nonrepairable k-out-of-n: G system with general failure distributions

Key features

k-out-of-n: G systems: to provide fault tolerance

Load-sharing: each embedded core carries only part of the load assigned by the operating system

Nonrepairable: embedded cores are integrated on a single silicon die

General failure distribution: embedded cores age in operation


Queueing model for task allocation
Queueing Model for Task Allocation Systems

Embedded cores execute tasks independently and one core can perform at most one task at a time

Consider a manycore system composed of a set identical embedded cores

The set of active cores , spare cores , and faulty cores


Queueing model for task allocation1
Queueing Model for Task Allocation Systems

A general-purpose parallel processing system with a central queue with a bulk arrival is modeled as queueing system

The probability that a certain active core is occupied by tasks (also called utilization) is computed as

Target system

Gracefully degrading systems

Standby redundant systems


Lifetime reliability of entire system gracefully degrading system
Lifetime Reliability of Entire System Systems– Gracefully Degrading System

A functioning manycore system may contains good cores

Let be the probability that the system has active cores at time

The system reliability can therefore be expressed as

Thus, the Mean Time to Failure (MTTF) of the system can be written as


Lifetime reliability of entire system gracefully degrading system1
Lifetime Reliability of Entire System Systems– Gracefully Degrading System

To determine

Conditional probability

For any

Conditional probability

The remaining is how to compute


Behavior of single processor core
Behavior of Single Processor Core Systems

States of cores

Spare mode – cold standby

Active mode

Processing state

Wait state – warm standby

The same shape but different scale

parameter

E.g.,


Lifetime reliability of a single core gracefully degrading system
Lifetime Reliability of A Single Core Systems– Gracefully Degrading System

Define accumulated time in a certain state at time as how long it spends in such a state up to time

Calculation

Core

Core

Core

Core

Core


Lifetime reliability of a single core gracefully degrading system1
Lifetime Reliability of A Single Core Systems– Gracefully Degrading System

Theorem 1 Suppose a manycore system with gracefully degrading scheme has experienced core failures, in the order of occurrence time at , respectively, for any core that has survived until time

its accumulated time in the processing state up to time

its accumulated time as warm standby up to time


Lifetime reliability of a single core gracefully degrading system2
Lifetime Reliability of A Single Core Systems– Gracefully Degrading System

Recall that the reliability functions in wait and processing states have the same shape but different scale parameter

General reliability function , abbreviated as

Reliability function in processing state , denoted as

Reliability function in wait state , denoted as

Relationships: and


Lifetime reliability of a single core gracefully degrading system3
Lifetime Reliability of A Single Core Systems– Gracefully Degrading System

A subdivision of the time :

By the continuity of reliability function, we have

processing

wait

wait

Accumulated time in the processing state

Accumulated time in the wait state


Lifetime reliability of a single core gracefully degrading system4
Lifetime Reliability of A Single Core Systems– Gracefully Degrading System

Theorem 2 Given a gracefully degrading manycore system that has experienced core failures which occur at respectively, the probability that a certain core survives at time provided that it has survived until time is given by

where


Lifetime reliability of entire system standby redundant system
Lifetime Reliability of Entire System Systems– Standby Redundant System

A standby redundant system is functioning if it contains at least good cores, among which are configured as active one, the remaining are spares

To determine

Again, the key point is to compute


Lifetime reliability of a single core standby redundant system
Lifetime Reliability of A Single Core Systems– Standby Redundant System

Define a core’s birth time as the time point when it is configured as an active one

Theorem 3 In a standby redundant manycore system, for any core with birth time that has survived until time

its accumulated time in the processing state up to time

its accumulated time as warm standby up to time


Lifetime reliability of a single core standby redundant system1
Lifetime Reliability of A Single Core Systems– Standby Redundant System

Theorem 4 In a manycore system with standby redundant scheme, the probability that a certain core with birth time survives at time is given by

where


Experimental setup
Experimental Setup Systems

Lifetime distributions

Exponential

Weibull

Linear failure rate

System parameters

Consider a manycore system

consisting of cores


Misleading caused by exponential assumption
Misleading Caused by Exponential Assumption Systems

: Expected lifetime of the -core system


Lifetime reliability for non exponential lifetime distribution
Lifetime Reliability for Non-Exponential Lifetime Distribution

(a) Weibull Distribution

(b) Linear Failure Rate Distribution





Conclusion
Conclusion Redundant System

  • State-of-the art CMOS technology enables the chip-level manycore processors

  • The lifetime reliability of such large circuit is a major concern

  • We propose a comprehensive analytical model to estimate the lifetime reliability of manycore systems

  • Some experimental results are shown to demonstrate the effectiveness of the proposed model



ad