1 / 20

Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta

Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta Scott Mahlke. W-QUAD (ISCA-35) June 21, 2008. [Srinivasan, DSN‘04]. [Borkar, MICRO‘05]. Motivation. “Designing Reliable Systems from Unreliable Components…”

aiden
Download Presentation

Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta Scott Mahlke W-QUAD (ISCA-35) June 21, 2008 1

  2. [Srinivasan, DSN‘04] [Borkar, MICRO‘05] Motivation • “Designing Reliable Systems from Unreliable Components…” - Shekhar Borkar (Intel) Failures will be wearout induced More failures to come 2

  3. Approaches to Reliability Approaches to Reliability Tolerate Faults (reactive) or… Prevent Faults (proactive) Architecture-level • Detect • Diagnose • Repair/reconfigure/recover • High-K dielectrics • Passivation • Dynamic thermal mgmt (DTM) • Introspective reliability mgmt (IRM) Diva Circuit-level WDU Reliability Banking • Margining • Robust cell topologies Heat-and-Run Argus RAMP Targeted management based on wearout monitoring 3 3

  4. Not All Cores Are Created Equal • Chip-multiprocessors will be subject to severe process variation • Dynamic thermal/power budgeting can be suboptimal • Temperature is only part of the picture • Need low-level reliability awareness • Low-level sensors measure physical changes • Wearout-aware management improves reliability enhancement • System reconfiguration • Dynamic voltage and frequency scaling (DVFS) • Job assignment 4

  5. Introspective Reliability Management (IRM) • WDU [MICRO`07] • measure propagation • delay • track statistical trends • Olay • track the progression of • wearout • profile workload behavior • generate wearout-aware • job schedules • Low-level Sensors • delay • leakage • temperature • etc. 5

  6. Wearout-aware Scheduling Per-module Reliability Profile T0 T1 T2 T3 Tn 10% 50% 75% 15% 25% 35% 25% 25% 45% 5% 85% 35% Activity: Active Jobs Available Cores Job Schedule 6

  7. Wearout-aware Scheduling T0 T1 T2 T3 Tn 7

  8. Wearout-aware Policies • GreedyE • Optimizes for early life performance • Minimizes premature failures with wear-leveling T13 T8 T9 T3 T5 Tn T4 T3 T9 T5 T7 Tn C6 C1 C3 C10 C4 Cn T12 T3 T9 T5 T4 Tn C7 C6 C1 C3 C10 Cn C1 C3 C10 C4 C0 Cn T0 T1 T2 T3 T4 Tn C0 C1 C2 C3 C4 Cn Weak Light T11 T13 T0 T7 T5 T2 T4 T12 T15 T6 T8 T3 T1 T10 T15 T9 Heavy Strong Cores Jobs Schedule 8

  9. Wearout-aware Policies • GreedyE • Optimizes for early life performance • Minimizes premature failures with wear-leveling • GreedyL • Optimizes for end of life performance • Victimizes weak cores to maximize the life of stronger cores • GreedyA • Hybrid of GreedyE and GreedyL • Adapts behavior based on system utilization 9

  10. Lifetime Reliability Simulation (FACE) Offline Characterization SPEC2000 (INT & FP) • Synthetic Benchmarks • representative of SPEC2000 • suite • reduces online profiling • complexity Temperature Trace Execution Trace Power Trace 10

  11. Lifetime Reliability Simulation (FACE) Offline Characterization • Reliability Management • monitors CMP health • wearout-aware scheduling • profiling • intelligent heuristics • Parameter Specification • Device lifetimes • Utilization pattern • Simulate CMP Aging • tracks progression of • wearout mechanisms • hierarchical design • Workload Generation • emulates OS scheduler • temperature traces • power traces Online Simulation 11

  12. Wearout Modeling • Mean time to failure (MTTF) • defines distribution of device lifetimes • Damage accumulation • where α is the degradation rate 12

  13. CMP Reliability Simulation • CMPs: • variable number of cores • model systematic variation • Cores: • Alpha 21264-type processor • Modules: • experience load-dependent stress • smallest granularity of • temperature modeling • Transistors: • multiple mechanisms evolve • independently 13

  14. Evaluation • Policies • Random (baseline), GreedyE, GreedyL, GreedyA • Figures of merit • Failure distribution • Useful work performed prior to system failure • Varied system parameters • CMP size • System utilization • Sensor error 14

  15. Failure Distribution w/ 16-cores 15

  16. Sensitivity to System Utilization w/ 16-cores 16

  17. Sensitivity to CMP Size w/ 100% utilization & GreedyE 17

  18. Sensitivity to Sensor Error w/ 16-cores,100% utilization, & GreedyE 18

  19. Conclusions • Heterogeneity exists in both CMPs and their workloads • Wearout-aware job assignments effectively exploit this heterogeneity • Real-time health monitoring (low-level sensors) • CMPs augmented with Olay perform up to 20% more useful work • Proper high-level analysis and profiling is essential for enhancing lifetime reliability. 19

  20. Questions? ? 20

More Related