1 / 64

Virtualized and Flexible ECC for Main Memory

Virtualized and Flexible ECC for Main Memory. Doe Hyun Yoon and Mattan Erez Dept. Electrical and Computer Engineering The University of Texas at Austin. ASPLOS 2010. Memory Error Protection. Applying ECC uniformly – ECC DIMMs Simple and transparent to programmers Error protection level

Download Presentation

Virtualized and Flexible ECC for Main Memory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Virtualized and Flexible ECC for Main Memory Doe Hyun Yoon and MattanErez Dept. Electrical and Computer Engineering The University of Texas at Austin ASPLOS 2010

  2. Memory Error Protection • Applying ECC uniformly – ECC DIMMs • Simple and transparent to programmers • Error protection level • Fixed, design-time decision • Chipkill-correct used in high-end servers • Constrain memory module design space • Allow only x4 DRAMs • Lower energy efficiency than x8 DRAMs • Virtualized ECC – objectives • To provide flexible memory error protection • To relax design constraints of chipkill

  3. Virtualized ECC • Two-tiered error protection • Tier-1 Error Code (T1EC) • Simple error code for detection or light-weight correction • Tier-2 Error Code (T2EC) • Strong error correcting code • Store T2EC within the memory namespace itself • OS manages T2EC • Flexible memory error protection • Different T2EC for different data pages • Stronger protection for more important data

  4. Virtualized ECC – Example Error Protection Level Physical Memory Virtual Address space Page frame – i Virtual page – i Low Virtual Page to Physical Frame mapping Page frame – j Virtual page – j Page frame – k Virtual page – k T2EC for Chipkill High ECC page – j ECC page – k Physical Frame to ECC Page mapping T2EC for Double Chipkill Data T1EC

  5. Virtualized ECC

  6. Observations on Memory Errors • Per-system error rate is still low • Most of time, we try to detect errors finding no error • To detect errors is a common case operation • Need a low latency, low complexity error detection mechanism  T1EC • To correct errors is an uncommon case operation • Correction can be complex, take a long time • But, still need to manage error correction info somewhere  Virtualized T2EC

  7. Uniform ECC Physical Memory VA VPN offset Page Frame PA Virtual Memory PA PFN offset Data ECC

  8. Virtualized ECC Physical Memory VA VPN offset Page Frame PA Virtual Memory PA PFN offset T2EC OS manages PFN to EPN translation Scale according to T2EC size EA ECC Page ECC Address ECC page number offset Data T1EC

  9. Update only valid T2EC to DRAM Don’t need T2EC in most cases T2EC lines can be partially valid Virtualized ECC operation Read: fetch data and T1EC ECC Address Translation Unit: fast PA to EA translation T2ECs of consecutive data lines map to a T2EC line Write: update data, T1EC, and T2EC 3 PA: 0x0200 B0 ECC Address Translation Unit A 4 EA: 0x0540 0 1 1 2 2 3 3 LLC 2 Wr: 0x0200 5 Wr: 0x0540 DRAM Rank 0 Rank 1 1 Rd: 0x00c0 0040 0000 00c0 0080 A 0140 0100 01c0 0180 0240 B0 0200 02c0 B1 0280 0340 B2 0300 03c0 B3 0380 0440 0400 04c0 0480 T2EC for Rank 1 data T2EC for Rank 0 data 0540 0500 0 05c0 0580 Data T1EC Data T1EC

  10. Penalty with V-ECC • Increased data miss rate • T2EC lines in LLC reduce effective LLC size • Increased traffic due to T2EC write-back • One-way write-back traffic • Not in a critical-path

  11. Chipkill-Correct

  12. Chipkill-correct • Single Device-error CorrectDouble Device-error Detect • Can tolerate a DRAM failure • Can detect a second DRAM failure • Chipkill requires x4 DRAMs • x8 chipkill is impractical • But, x8 DRAM is more energy efficient

  13. Baseline x4 Chipkill • Two x4 ECC DIMMs • 128bit data + 16bit ECC (redundancy overhead: 12.5%) • 4 check symbol error code using 4-bit symbol • Access granularity • 64B in DDR2 (min. burst 4 x 128 bit) • 128B in DDR3 (min. burst 8 x 128 bit) 144-bit wide data bus x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4

  14. x8 Chipkill • x8 chipkill with the same access granularity • 152-bit wide data path • 128-bit data + 24-bit ECC • Redundancy overhead: 18.75% • Need a custom-designed DIMM • Increase the system cost a lot x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 152-bit wide data bus

  15. x8 Chipkill /w Standard DIMMs • Increase access granularity • 128B in DDR2 (min. burst 4 x 256 bit) • 256B in DDR3 (min. burst 8 x 256 bit) x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 280-bit wide data bus

  16. V-ECC for Chipkill • Use 3 check symbol error codes • Single Symbol-error Correct and Double Symbol-error Detect • T1EC • 2 check symbols • Detect up to 2 symbol error • T2EC • 3rd check symbol • Combined T1EC/T2EC provides Chipkill

  17. V-ECC: ECC x4 configuration • Use 8-bit symbol error code • 2 bursts out of a x4 DRAM form an 8bit-symbol • Modern DRAMs have minimum burst of 4 or 8 • 1 x4 ECC DIMM + 1 x4 Non-ECC DIMM • Each DRAM access in DDR2 (burst 4) • 64B data, 4B T1EC • 2B T2EC is virtualized within memory namespace • 32 T2ECs per 64B cache line Virtualized within memory T2EC 136-bit wide data bus x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 Data T1EC Data

  18. V-ECC: ECC x8 configuration • Use 8-bit symbol error code • 2 x8 ECC DIMMs • Each DRAM access in DDR2 (burst 4) • 64B data, 8B T1EC • 4B T2EC is virtualized • 16 T2ECs per 64B cache line x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 Virtualized within memory T2EC 144-bit wide data bus Data T1EC Data T1EC

  19. Flexible Error Protection • Single HW with V-ECC can provide • Chipkill-detect, Chipkill-correct, and Double chipkill-correct • Use different T2EC for different pages • Reliability – Performance tradeoff • Maximize performance/power efficiency with Chipkill-Detect • Stronger protection at the cost of additional T2EC access

  20. Evaluation

  21. Simulator/Workload • GEMS + DRAMsim • An out-of-order SPARC V9 core • Exclusive two-level cache hierarchy • DDR2 800MHz – 12.8GB/s (128-bit wide data path) • 1 channel 4 ranks • Power model • WATTCH for processor power – scaled to 45nm • CACTI for cache power – cacti 45nm • Micron model for DRAM power – commodity DRAMs • Workloads • 12 data intensive applications from SPEC CPU 2006 and PARSEC • Microbenchmarks: STREAM and GUPS

  22. Normalized Execution Time • Less than 1% penalty on average • Performance penalty  • Spatial locality  • Write-back traffic

  23. System Energy Efficiency • Energy Delay Product (EDP) gain • ECC x4: 1.1% on average • ECC x8: 12.0% on average 1.23 10% 12% 17% 20%

  24. Flexible Error Protection Chipkill-Detect Chipkill-Correct Double Chipkill-Correct

  25. Conclusion • Virtualized ECC • Two-tiered error protection, virtualized T2EC • Improved system energy efficiency with chipkill • Reduce DRAM power consumption by 27% • Improve system EDP by 12% • Performance penalty – 1% on average • Error protection even for Non-ECC DIMMs • Can be used for GPU memory error protection • Flexibility in error protection • Adaptive error protection level by user/system demand • Cost of error protection is proportional to protection level

  26. Virtualized and Flexible ECC for Main Memory Doe Hyun Yoon and MattanErez Dept. Electrical and Computer Engineering The University of Texas at Austin

  27. Backup

  28. Virtualized ECC Operations • DRAM read • Fetch data and T1EC – detect errors • Don’t need T2EC in most cases • DRAM write-back • Update data, T1EC, and T2EC • Cache T2EC for locality on T2EC access • Need to translate PA to EA • On-chip ECC address translation unit • TLB-like structure for fast PA to EA translation • Error correction • Need to read T2EC; maybe in the LLC or DRAM

  29. LLC 3 PA: 0x0200 B0 ECC Address Translation Unit A 4 EA: 0x0540 0 1 1 2 2 3 3 2 Wr: 0x0200 5 Wr: 0x0540 DRAM Rank 0 1 Rd: 0x00c0 Rank 1 0040 0000 00c0 0080 A 0140 0100 01c0 0180 0240 B0 0200 02c0 B1 0280 0340 B2 0300 03c0 B3 0380 0440 0400 04c0 0480 T2EC for Rank 1 data T2EC for Rank 0 data 0540 0500 0 05c0 0580 Data T1EC Data T1EC

  30. RECAP: V-ECC • Two-tiered error protection • Uniform T1EC • Virtualized T2EC • V-ECC for chipkill • ECC x4 configuration: saves 8 data pins • ECC x8 configuration: more energy efficient • Flexible error protection • Different T2EC for different pages • Stronger protection for important data • No protection for not important data

  31. Power Consumption • DRAM power saving • ECC x4: 4.2% • ECC x8: 27.8% • Total power saving • ECC x4: 2.1% • ECC x8: 13.2%

  32. Caching T2EC • T2EC occupancy: Less than 10% on average • MPKI overhead: Very small • The higher spatial locality, the less impact on caching behavior

  33. Traffic • Traffic increase – less than 10% on average • Increased demand misses; • T2EC traffic • Spatial locality is important, so is the amount of write-back traffic

  34. Virtualized ECC • Uniform T1EC • Low-cost error detection or light-weight correction • Virtualized T2EC • Correct errors detected uncorrectable by T1EC • Cacheable and memory mapped • Read accesses data and T1EC • Don’t need T2EC in most times • Simpler common case read operations • Write updates data, T1EC, and T2EC

  35. Flexible Error Protection • ECC x8 DRAM configuration • Stronger error protection at the cost of more T2EC accesses • Additional cost of double chip-kill (relative to chip-kill)is quite small • Adaptation is with per-page granularity

  36. What if BW is limited? • Half DRAM BW – 6.4GB/s • Emulate CMP where BW is more scarce

  37. Virtualized ECC for Non-ECC DIMMs

  38. ECC for non-ECC DIMMs • Virtualize ECC in memory namespace • Not a two-tiered error protection • No uniform ECC storage (for T1EC) • But, let’s say the ECC as ‘T2EC’ to keep notation consistent • Virtualized T2EC both detects and corrects errors • Now, a DRAM read also triggers a T2EC access • Increased T2EC traffic, increased T2EC occupancy, and more penalty • But, we can detect and correct errors with non-ECC DIMMs

  39. LLC 2 PA: 0x0180 6 PA: 0x00c0 A C ECC Address Translation Unit 7 EA: 0x0510 D B 3 EA: 0x0550 1 Rd: 0x0180 8 Rd: 0x0510 5 Wr: 0x0140 4 Rd: 0x0540 DRAM Rank 0 Rank 1 0040 0000 00c0 0080 C 0140 0100 01c0 A 0180 0240 0200 02c0 0280 0340 0300 03c0 0380 0440 0400 04c0 0480 T2EC for Rank 1 data T2EC for Rank 0 data 0540 D 0500 B 05c0 0580 Data Data

  40. DIMM configurations • Use 2 check symbol error codes • Can detect and correct up to 1 symbol error • No 2 symbol error detection • Weaker protection than Chip-Kill, but it’s better than nothing • DIMM configurations • Can even use x16 DRAMs (way more energy efficient than x4 DRAMs)

  41. Performance and Energy Efficiency • More performance degradation (compared to ECC DIMMs) • Every read accesses T2EC • More T2EC traffic more T2EC occupancy in LLC • Energy efficiency is sometimes better • x16 DRAMs save a lot of DRAM power • Performance degradation is low if spatial locality is good

  42. Flexible error protection • A page can have different T2EC sizes • Error protection level of a page can be • No protection • 1 chip-kill detect • 1 chip-kill correct (but can’t detect 2 chip-kill) • 2 chip-kill correct • Penalty is proportional to protection level • T2EC size per 64B cache line * It cannot detect 2 chip-kill

  43. Non-ECC x8 Non-ECC x16

  44. Managing T2EC

  45. OS manages T2EC • PA to EA translation structure • T2EC storage • Only dirty pages require T2EC (with ECC DIMMs) • Can use Copy-On-Write T2EC allocation • Every data page needs T2EC in non-ECC implementation • Free T2EC when a data page is freed/evicted

  46. PA to EA Translation • Every write-back (with ECC DIMMs) or read/write (with non-ECC DIMMs) needs to access T2EC • Translation is similar to VA to PA translaation • OS manages a single translation structure

  47. Example Translation Physical address (PA) Level 1 Level 2 Level 3 Page offset >> ECC page table Base register + log2(T2EC) ECC table entry + ECC table entry + ECC table entry ECC page number ECC Page offset ECC address (EA)

  48. Accelerating Translation • ECC address translation unit • Cache PA to EA translation • Like TLBs • Hierarchical caching – 2 levels • 1st level manages consistency with TLB • 2nd level as a victim cache • Read triggered translation • 100% hit; L1 EA cache is consistent with TLB • Only occurs with non-ECC DIMMs • Write triggered translation • Probably hit; L2 EA cache can be relatively large

  49. ECC Address Translation Unit TLB ECC address translation unit To manage consistency between TLB and L1 EA cache L1 EA cache PA EA Control logic L2 EA cache EA MSHR 2-level EA cache External EA translation

  50. Possible Impacts • TLB miss penalty • VA to PA translation, then PA to EA translation • Seems like negligible – already assumed doubled TLB miss penalty in the evaluation • Design alternative: to translate VA to EA directly • Need to manage per-process translation structure • But potentially less impact on TLB miss penalty • EA cache misses per 1000 instrs • Configuration • 16 entry FA L1 EA cache • 4k entry 8 way L2 EA cache • ~3 in omnetpp and canneal • ~12 in GUPS • Less than 1 in other apps • Things might get messed up with a software TLB handler

More Related