1 / 26

AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs

AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs. Presenter: Lin Huang Lin Huang and Qiang Xu CU hk RE liable computing laboratory (CURE) The Chinese University of Hong Kong. Lifetime Reliability Becomes A Serious Concern. Failure mechanisms

shalom
Download Presentation

AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs Presenter: Lin Huang Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University of Hong Kong

  2. Lifetime Reliability Becomes A Serious Concern Failure mechanisms Electromigration NBTI TDDB Reliability-related factors Temperature Supply voltage Frequency Infant mortality Useful life Wearout 90nm 130nm 180nm Failure rate Time [T. M. Mak] < 7 year ~ 7 year ~ 10 year

  3. Design-Stage Decisions Affect Lifetime Reliability DPM / DTM DVFS Timeout Thermal throttling Power gating … Redundancy Level Quantity … Task Allocation Round-robin Optimized … SPECIFICATION IC • Functionality • Power consumption • Area constraint • Thermal issue • Expected service life • … Without an efficient yet accurate lifetime reliability simulation framework, making the good decisions is extremely difficult if not impossible !

  4. The Challenges in Simulation-Based Lifetime Reliability Analysis • Increasing failure rate • Exponential distribution assumption in previous work Infant mortality Useful life Wearout Failure rate Time

  5. The Challenges in Simulation-Based Lifetime Reliability Analysis • Operational temperature varies significantly and rapidly Obtained with HotSpot 4.0 [Huang-ieeetc08] How to achieve efficient yet accurate lifetime reliability simulation with such limited information, when failure mechanisms follow arbitrary failure distributions?

  6. Key Idea • General failure distribution with general scale parameter by which time is divided • Example: Weibull failure distribution • Suppose we can express the reliability function as and can be computed according to limited tracing information • Example: reliability function

  7. Key Idea • Aging rate • Capture the impact of certain usage strategy • Reliability-related usage strategy • A combination of … • Dynamic power/thermal management • Trigger mechanism • Load-sharing strategy • … given the application flow with certain characteristic

  8. Aging rate Key Idea Temperature USAGE STRATEGY Supply voltage Frequency Representative workload Future Past

  9. Key Idea Representative workload Future Past

  10. Power State Machine Trigger Mechanism Application Flow Load-sharing Strategy Redundancy Scheme Proposed Simulation Framework: AgeSim– Step One: Simulation and Tracing Temperature (Data) Power / Thermal Manager Temperature Simulator Execution Mode Power Simulator Power (Data) time step

  11. Temperature (Data) Power State Machine Power / Thermal Manager Trigger Mechanism Temperature Simulator Reliability- Related Factors Trace File Application Flow Execution Mode Load-sharing Strategy Power Simulator Redundancy Scheme Power (Data) Proposed Simulation Framework: AgeSim– Step One: Simulation and Tracing

  12. & Reliability- Related Factors Trace File Aging rate & Proposed Simulation Framework: AgeSim– Step Two: Aging Rate Calculation

  13. Model Validation By average temperature 28.3% error in MTTF By AgeSim almost identical results

  14. DVFS1 Low voltage: 90%Vdd DVFS2 Low voltage: 80%Vdd No DVFS Case Study IDynamic Voltage and Frequency Scaling Task departure HV Run HV Idle Task arrival T>TH T<TL Task departure LV Run LV Idle Task arrival

  15. Case Study IDynamic Voltage and Frequency Scaling • System load • The ratio between task arrival rate and service rate

  16. Case Study IDynamic Voltage and Frequency Scaling • System load • The ratio between task arrival rate and service rate

  17. Case Study IITask Allocation on Multi-Core Processors Example Chip Frequency Map • Random allocation • Performance-aware allocation • Always choose the available core with highest frequency [Sarangi-ieeetsm08]

  18. Case Study IITask Allocation on Multi-Core Processors • System load • The ratio between task arrival rate and service rate

  19. Discussion on the Flexibility of AgeSim • Task allocation and scheduling for MPSoC under lifetime reliability constraint • Multiprocessor with different redundancy schemes • Example: gracefully degrading redundancy, standby redundancy

  20. Conclusion • Lifetime reliability has become a serious concern for high-performance ICs • Design stage decisions significantly affect system reliability • We propose an efficient yet accurate simulation framework to evaluate the system reliability under various usage strategy • Arbitrary failure distribution • Fine-grained tracing for representative workloads • AgeSim is effective and flexible

  21. AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs Thank you for your attention !

  22. Backup Slides • Multiple representative workload  • Aging rate • Accuracy  • Key idea 

  23. Multiple Representative Workloads • The proposed method could be easily extended to analyze the system with multiple representative workloads • We can organize the workloads into a hyper-workload with their occurrence probabilities • We can extract the aging rate and occurrence probability for each workload and then compute the unified aging rate by 

  24. Aging Rate • Aging rate is independent of time Failure rate Time 

  25. Accuracy

  26. Power State Machine Trigger Mechanism Application Flow Load-sharing Strategy Redundancy Scheme Key Idea Processor usage strategy Aging rate Reliability function Power State Machine Trigger Mechanism Application Flow Load-sharing Strategy Redundancy Scheme 

More Related