1 / 25

Efficient Scrub Mechanisms for Error-Prone Emerging Memories

Efficient Scrub Mechanisms for Error-Prone Emerging Memories. Manu Awasthi ǂ , Manjunath Shevgoor⁺, Kshitij Sudan⁺, Rajeev Balasubramonian⁺, Bipin Rajendran ‡ , Viji Srinivasan ‡ ǂ Micron, ⁺ University of Utah, ‡ IBM Research. Executive Summary.

beate
Download Presentation

Efficient Scrub Mechanisms for Error-Prone Emerging Memories

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Scrub Mechanisms for Error-Prone Emerging Memories Manu Awasthiǂ, Manjunath Shevgoor⁺, Kshitij Sudan⁺, Rajeev Balasubramonian⁺, Bipin Rajendran‡, Viji Srinivasan‡ ǂMicron, ⁺University of Utah, ‡IBM Research

  2. Executive Summary • Future memory hierarchies will likely include NVMs • Focus of this work: Phase Change Memory (PCM) • Multi level cells in PCM appear imminent • A number of proposals exist to handle hard errors and lifetime issues of PCM devices • Resistance Drift is a lesser explored phenomenon • Will become primary cause of “soft errors” -- Need to explore holistic solutions to counter drift

  3. Phase Change Memory - MLC (01) (11) (10) (00) Crystalline Amorphous Resistance (110) (101) (100) (011) (010) (001) (111) (000) • Chalcogenide material can exist in crystalline or amorphous states • The material can also be programmed into intermediate states • Leads to many intermediate states, paving way for Multi Level Cells (MLCs)

  4. What is Resistance Drift? Time 11 10 01 00 ERROR!! Tn B T0 A Resistance

  5. Resistance Drift - Issues Rdrift(t) = R0х (t)α • Programmed resistance drifts according to power law equation - • R0, α usually follow a Gaussian distribution • Time to drift (error) depends on • Programmed resistance (R0), and • Drift Coefficient (α) • Is highly unpredictable!!

  6. Resistance Drift - How it happens ERROR!! 11 10 01 00 Number of Cells R0 R0 Rt Rt • Median case cell • Typical R0 • Typical α • Worst case cell • High R0 • High α Scrub rate will be dictated by the Worst Case R0 and Worst Case α

  7. Resistance Drift Data (11) (01) (10) (00)

  8. Uncorrectable Errors

  9. Problem Severity

  10. Inadequacy of Device Level Solutions • Precise writes –> write – compare – write • Can be effective, requires multiple iterations • Increases write time, total write energy, decreases lifetime • Correcting values during read not effective • Coding techniques • Can help, but cannot eliminate the problem by themselves

  11. Naïve Solution : Scrubbing Refresh should be reactionary NOT precautionary! • Drift resets with every cell reprogram (write) • Leverage existing error correction mechanisms, e.g., ECC - has its own drawbacks • A Full Refresh/Scrub (read-compare-write) is extremely costly in PCM • Each PCM write takes 100 - 1000ns • Writing to a 2-bit cell may consume as much as 1.6nJ • Requires 600 refreshes in parallel

  12. Solution Outline • A three pronged approach • Incorporate stronger ECC codes • Incorporate low-cost error detection mechanisms • Adapt scrub algorithms that are conservative and adaptive

  13. Additional ECC support • With stronger ECC support, scrub operations can be made infrequent • If E errors can be corrected and E+1 detected, then scrub operation can be postponed till the Eth error • Can provide support for hard error tolerance as well • We advocate ECC-8 (E-8) • Correct eight errors, detect nine • Leads to 14.25% storage overhead, for 512 bits

  14. LARDD Read Line Check for Errors True Errors < E After N cycles False Scrub Line • Light Array Reads for Drift Detection • Support for E Error-correcting, E+1 error detecting codes assumed • Lines are read periodically and checked for correctness • Only after the number of errors reaches a threshold, scrubbingis performed • Drift detection has to be simple, on-chip

  15. Read Pipeline with LARDD & Parity (11) (01) (10) (00) 15 15 • Simple error detection mechanism to bypass complex BCH circuit • 256 cell line divided up into eight 32-cell fields • Count number of drift prone states, store as parity information • For every LARDD, check if parity has changed • If true, invoke BCH circuitry

  16. Headroom Read Line Check for Errors True Errors < E-h After N cycles False Scrub Line • Headroom-h scheme – scrub is triggered if E-h errors are detected • Decreases probabilityof errors slipping through • Increases frequency of full scrub and hence decreases life time • Presents trade-off between Hard and Soft errors

  17. Headroom Gradual Read Line After N cycles Check for Errors True Errors <= E-h-g False Double LARDD rate • Start with a small LARDD frequency • Increase frequency as drift-based errors increase • Decreases energy overhead • Might cause slight increase in error rate

  18. Adaptive Scrub • Dynamic events can affect reliability • Temperature increases can increase α and decrease drift time • Cell lifetime/wearout is also an issue • Soft error rate depends on prevalence of drift prone states • Hard errors show up with aging cells • Start with headroom h policy, get rid of headroom when at least E-h-1 hard errors • Can increase LARDD frequency when total number of hard errors increases preset threshold

  19. Results Stronger ECC + LARDD support leads to decrease in unrecoverable errors

  20. Results - Headroom Schemes • Compared to ECC-1, for 512 s LARDD interval • ECC-8-headroom-3 leads to 99.6% reduction in error rate, 35.4% reduction in scrub energy • ECC-8-headroom-3-gradual-1 leads to 96.5% reduction in error rate, 37.8% reduction in scrub energy

  21. Device Level Solution – Non Uniform Banding Before Mean R0 11 10 01 00 After Resistance

  22. Comparing with Device Level Solutions ECC-8-headroom-3 provides for least number of uncorrectable errors

  23. Conclusions • Resistance drift will exacerbate with MLC scaling • Naïve solutions based on ECC support are costly for PCM • Increased write energy, decreased lifetimes • Holistic solutions need to be explored to counter drift at device, architectural and system levels • 39% reduction in energy, 4x less errors, 102x increase in lifetime

  24. Thanks!!www.cs.utah.edu/arch-research

  25. Backup Slides

More Related