virtualized and flexible ecc for main memory n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Virtualized and Flexible ECC for Main Memory PowerPoint Presentation
Download Presentation
Virtualized and Flexible ECC for Main Memory

Loading in 2 Seconds...

play fullscreen
1 / 64

Virtualized and Flexible ECC for Main Memory - PowerPoint PPT Presentation


  • 96 Views
  • Uploaded on

Virtualized and Flexible ECC for Main Memory. Doe Hyun Yoon and Mattan Erez Dept. Electrical and Computer Engineering The University of Texas at Austin. ASPLOS 2010. Memory Error Protection. Applying ECC uniformly – ECC DIMMs Simple and transparent to programmers Error protection level

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Virtualized and Flexible ECC for Main Memory' - eleanor-dennis


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
virtualized and flexible ecc for main memory

Virtualized and Flexible ECC for Main Memory

Doe Hyun Yoon and MattanErez

Dept. Electrical and Computer Engineering

The University of Texas at Austin

ASPLOS 2010

memory error protection
Memory Error Protection
  • Applying ECC uniformly – ECC DIMMs
    • Simple and transparent to programmers
  • Error protection level
    • Fixed, design-time decision
  • Chipkill-correct used in high-end servers
    • Constrain memory module design space
      • Allow only x4 DRAMs
      • Lower energy efficiency than x8 DRAMs
  • Virtualized ECC – objectives
    • To provide flexible memory error protection
    • To relax design constraints of chipkill
virtualized ecc
Virtualized ECC
  • Two-tiered error protection
  • Tier-1 Error Code (T1EC)
    • Simple error code for detection or light-weight correction
  • Tier-2 Error Code (T2EC)
    • Strong error correcting code
  • Store T2EC within the memory namespace itself
    • OS manages T2EC
  • Flexible memory error protection
    • Different T2EC for different data pages
    • Stronger protection for more important data
virtualized ecc example
Virtualized ECC – Example

Error Protection Level

Physical Memory

Virtual Address space

Page frame – i

Virtual page – i

Low

Virtual Page to Physical Frame mapping

Page frame – j

Virtual page – j

Page frame – k

Virtual page – k

T2EC for Chipkill

High

ECC page – j

ECC page – k

Physical Frame to ECC Page mapping

T2EC for

Double

Chipkill

Data

T1EC

observations on memory errors
Observations on Memory Errors
  • Per-system error rate is still low
    • Most of time, we try to detect errors finding no error
  • To detect errors is a common case operation
    • Need a low latency, low complexity error detection mechanism

 T1EC

  • To correct errors is an uncommon case operation
    • Correction can be complex, take a long time
    • But, still need to manage error correction info somewhere

 Virtualized T2EC

uniform ecc
Uniform ECC

Physical Memory

VA

VPN

offset

Page Frame

PA

Virtual Memory

PA

PFN

offset

Data

ECC

virtualized ecc2
Virtualized ECC

Physical Memory

VA

VPN

offset

Page Frame

PA

Virtual Memory

PA

PFN

offset

T2EC

OS manages

PFN to EPN

translation

Scale according to T2EC size

EA

ECC Page

ECC Address

ECC page number

offset

Data

T1EC

slide9

Update only valid T2EC to DRAM

Don’t need T2EC in most cases

T2EC lines can be partially valid

Virtualized ECC operation

Read: fetch data and T1EC

ECC Address Translation Unit: fast PA to EA translation

T2ECs of consecutive data lines map to a T2EC line

Write: update data, T1EC, and T2EC

3

PA: 0x0200

B0

ECC Address Translation Unit

A

4

EA: 0x0540

0

1

1

2

2

3

3

LLC

2

Wr: 0x0200

5

Wr: 0x0540

DRAM

Rank 0

Rank 1

1

Rd: 0x00c0

0040

0000

00c0

0080

A

0140

0100

01c0

0180

0240

B0

0200

02c0

B1

0280

0340

B2

0300

03c0

B3

0380

0440

0400

04c0

0480

T2EC for Rank 1 data

T2EC for Rank 0 data

0540

0500

0

05c0

0580

Data

T1EC

Data

T1EC

penalty with v ecc
Penalty with V-ECC
  • Increased data miss rate
    • T2EC lines in LLC reduce effective LLC size
  • Increased traffic due to T2EC write-back
    • One-way write-back traffic
      • Not in a critical-path
chipkill correct1
Chipkill-correct
  • Single Device-error CorrectDouble Device-error Detect
    • Can tolerate a DRAM failure
    • Can detect a second DRAM failure
  • Chipkill requires x4 DRAMs
  • x8 chipkill is impractical
    • But, x8 DRAM is more energy efficient
baseline x4 chipkill
Baseline x4 Chipkill
  • Two x4 ECC DIMMs
    • 128bit data + 16bit ECC (redundancy overhead: 12.5%)
    • 4 check symbol error code using 4-bit symbol
  • Access granularity
    • 64B in DDR2 (min. burst 4 x 128 bit)
    • 128B in DDR3 (min. burst 8 x 128 bit)

144-bit wide data bus

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x8 chipkill
x8 Chipkill
  • x8 chipkill with the same access granularity
    • 152-bit wide data path
      • 128-bit data + 24-bit ECC
      • Redundancy overhead: 18.75%
    • Need a custom-designed DIMM
      • Increase the system cost a lot

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

152-bit wide data bus

x8 chipkill w standard dimms
x8 Chipkill /w Standard DIMMs
  • Increase access granularity
    • 128B in DDR2 (min. burst 4 x 256 bit)
    • 256B in DDR3 (min. burst 8 x 256 bit)

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

280-bit wide data bus

v ecc for chipkill
V-ECC for Chipkill
  • Use 3 check symbol error codes
    • Single Symbol-error Correct and Double Symbol-error Detect
  • T1EC
    • 2 check symbols
    • Detect up to 2 symbol error
  • T2EC
    • 3rd check symbol
    • Combined T1EC/T2EC provides Chipkill
v ecc ecc x4 configuration
V-ECC: ECC x4 configuration
  • Use 8-bit symbol error code
    • 2 bursts out of a x4 DRAM form an 8bit-symbol
      • Modern DRAMs have minimum burst of 4 or 8
  • 1 x4 ECC DIMM + 1 x4 Non-ECC DIMM
  • Each DRAM access in DDR2 (burst 4)
    • 64B data, 4B T1EC
    • 2B T2EC is virtualized within memory namespace
      • 32 T2ECs per 64B cache line

Virtualized within memory

T2EC

136-bit wide data bus

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

Data

T1EC

Data

v ecc ecc x8 configuration
V-ECC: ECC x8 configuration
  • Use 8-bit symbol error code
  • 2 x8 ECC DIMMs
  • Each DRAM access in DDR2 (burst 4)
    • 64B data, 8B T1EC
    • 4B T2EC is virtualized
      • 16 T2ECs per 64B cache line

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

Virtualized within memory

T2EC

144-bit wide data bus

Data

T1EC

Data

T1EC

flexible error protection
Flexible Error Protection
  • Single HW with V-ECC can provide
    • Chipkill-detect, Chipkill-correct, and Double chipkill-correct
    • Use different T2EC for different pages
  • Reliability – Performance tradeoff
  • Maximize performance/power efficiency with Chipkill-Detect
  • Stronger protection at the cost of additional T2EC access
simulator workload
Simulator/Workload
  • GEMS + DRAMsim
    • An out-of-order SPARC V9 core
    • Exclusive two-level cache hierarchy
    • DDR2 800MHz – 12.8GB/s (128-bit wide data path)
      • 1 channel 4 ranks
  • Power model
    • WATTCH for processor power – scaled to 45nm
    • CACTI for cache power – cacti 45nm
    • Micron model for DRAM power – commodity DRAMs
  • Workloads
    • 12 data intensive applications from SPEC CPU 2006 and PARSEC
    • Microbenchmarks: STREAM and GUPS
normalized execution time
Normalized Execution Time
  • Less than 1% penalty on average
  • Performance penalty 
    • Spatial locality 
    • Write-back traffic
system energy efficiency
System Energy Efficiency
  • Energy Delay Product (EDP) gain
    • ECC x4: 1.1% on average
    • ECC x8: 12.0% on average

1.23

10%

12%

17%

20%

slide24

Flexible Error Protection

Chipkill-Detect

Chipkill-Correct

Double Chipkill-Correct

conclusion
Conclusion
  • Virtualized ECC
    • Two-tiered error protection, virtualized T2EC
  • Improved system energy efficiency with chipkill
    • Reduce DRAM power consumption by 27%
    • Improve system EDP by 12%
  • Performance penalty – 1% on average
  • Error protection even for Non-ECC DIMMs
    • Can be used for GPU memory error protection
  • Flexibility in error protection
    • Adaptive error protection level by user/system demand
    • Cost of error protection is proportional to protection level
virtualized and flexible ecc for main memory1

Virtualized and Flexible ECC for Main Memory

Doe Hyun Yoon and MattanErez

Dept. Electrical and Computer Engineering

The University of Texas at Austin

virtualized ecc operations
Virtualized ECC Operations
  • DRAM read
    • Fetch data and T1EC – detect errors
    • Don’t need T2EC in most cases
  • DRAM write-back
    • Update data, T1EC, and T2EC
    • Cache T2EC for locality on T2EC access
    • Need to translate PA to EA
  • On-chip ECC address translation unit
    • TLB-like structure for fast PA to EA translation
  • Error correction
    • Need to read T2EC; maybe in the LLC or DRAM
slide29

LLC

3

PA: 0x0200

B0

ECC Address Translation Unit

A

4

EA: 0x0540

0

1

1

2

2

3

3

2

Wr: 0x0200

5

Wr: 0x0540

DRAM

Rank 0

1

Rd: 0x00c0

Rank 1

0040

0000

00c0

0080

A

0140

0100

01c0

0180

0240

B0

0200

02c0

B1

0280

0340

B2

0300

03c0

B3

0380

0440

0400

04c0

0480

T2EC for Rank 1 data

T2EC for Rank 0 data

0540

0500

0

05c0

0580

Data

T1EC

Data

T1EC

recap v ecc
RECAP: V-ECC
  • Two-tiered error protection
    • Uniform T1EC
    • Virtualized T2EC
  • V-ECC for chipkill
    • ECC x4 configuration: saves 8 data pins
    • ECC x8 configuration: more energy efficient
  • Flexible error protection
    • Different T2EC for different pages
    • Stronger protection for important data
    • No protection for not important data
power consumption
Power Consumption
  • DRAM power saving
    • ECC x4: 4.2%
    • ECC x8: 27.8%
  • Total power saving
    • ECC x4: 2.1%
    • ECC x8: 13.2%
caching t2ec
Caching T2EC
  • T2EC occupancy: Less than 10% on average
  • MPKI overhead: Very small
  • The higher spatial locality, the less impact on caching behavior
traffic
Traffic
  • Traffic increase – less than 10% on average
    • Increased demand misses;
    • T2EC traffic
  • Spatial locality is important, so is the amount of write-back traffic
virtualized ecc3
Virtualized ECC
  • Uniform T1EC
    • Low-cost error detection or light-weight correction
  • Virtualized T2EC
    • Correct errors detected uncorrectable by T1EC
    • Cacheable and memory mapped
  • Read accesses data and T1EC
    • Don’t need T2EC in most times
    • Simpler common case read operations
  • Write updates data, T1EC, and T2EC
flexible error protection1
Flexible Error Protection
  • ECC x8 DRAM configuration
  • Stronger error protection at the cost of more T2EC accesses
    • Additional cost of double chip-kill (relative to chip-kill)is quite small
  • Adaptation is with per-page granularity
what if bw is limited
What if BW is limited?
  • Half DRAM BW – 6.4GB/s
  • Emulate CMP where BW is more scarce
ecc for non ecc dimms
ECC for non-ECC DIMMs
  • Virtualize ECC in memory namespace
    • Not a two-tiered error protection
    • No uniform ECC storage (for T1EC)
    • But, let’s say the ECC as ‘T2EC’ to keep notation consistent
  • Virtualized T2EC both detects and corrects errors
    • Now, a DRAM read also triggers a T2EC access
    • Increased T2EC traffic, increased T2EC occupancy, and more penalty
    • But, we can detect and correct errors with non-ECC DIMMs
slide39

LLC

2

PA: 0x0180

6

PA: 0x00c0

A

C

ECC Address Translation Unit

7

EA: 0x0510

D

B

3

EA: 0x0550

1

Rd: 0x0180

8

Rd: 0x0510

5

Wr: 0x0140

4

Rd: 0x0540

DRAM

Rank 0

Rank 1

0040

0000

00c0

0080

C

0140

0100

01c0

A

0180

0240

0200

02c0

0280

0340

0300

03c0

0380

0440

0400

04c0

0480

T2EC for Rank 1 data

T2EC for Rank 0 data

0540

D

0500

B

05c0

0580

Data

Data

dimm configurations
DIMM configurations
  • Use 2 check symbol error codes
    • Can detect and correct up to 1 symbol error
    • No 2 symbol error detection
    • Weaker protection than Chip-Kill, but it’s better than nothing
  • DIMM configurations
    • Can even use x16 DRAMs (way more energy efficient than x4 DRAMs)
performance and energy efficiency
Performance and Energy Efficiency
  • More performance degradation (compared to ECC DIMMs)
    • Every read accesses T2EC
    • More T2EC traffic more T2EC occupancy in LLC
  • Energy efficiency is sometimes better
    • x16 DRAMs save a lot of DRAM power
    • Performance degradation is low if spatial locality is good
flexible error protection2
Flexible error protection
  • A page can have different T2EC sizes
  • Error protection level of a page can be
    • No protection
    • 1 chip-kill detect
    • 1 chip-kill correct (but can’t detect 2 chip-kill)
    • 2 chip-kill correct
  • Penalty is proportional to protection level
  • T2EC size per 64B cache line

* It cannot detect 2 chip-kill

slide43

Non-ECC x8

Non-ECC x16

os manages t2ec
OS manages T2EC
  • PA to EA translation structure
  • T2EC storage
    • Only dirty pages require T2EC (with ECC DIMMs)
      • Can use Copy-On-Write T2EC allocation
    • Every data page needs T2EC in non-ECC implementation
    • Free T2EC when a data page is freed/evicted
pa to ea translation
PA to EA Translation
  • Every write-back (with ECC DIMMs) or read/write (with non-ECC DIMMs) needs to access T2EC
    • Translation is similar to VA to PA translaation
  • OS manages a single translation structure
example translation
Example Translation

Physical address (PA)

Level 1

Level 2

Level 3

Page offset

>>

ECC page table

Base register

+

log2(T2EC)

ECC table entry

+

ECC table entry

+

ECC table entry

ECC page number

ECC Page offset

ECC address (EA)

accelerating translation
Accelerating Translation
  • ECC address translation unit
    • Cache PA to EA translation
      • Like TLBs
    • Hierarchical caching – 2 levels
      • 1st level manages consistency with TLB
      • 2nd level as a victim cache
    • Read triggered translation
      • 100% hit; L1 EA cache is consistent with TLB
      • Only occurs with non-ECC DIMMs
    • Write triggered translation
      • Probably hit; L2 EA cache can be relatively large
ecc address translation unit
ECC Address Translation Unit

TLB

ECC address translation unit

To manage consistency between TLB and L1 EA cache

L1

EA cache

PA

EA

Control logic

L2

EA cache

EA

MSHR

2-level EA cache

External

EA translation

possible impacts
Possible Impacts
  • TLB miss penalty
    • VA to PA translation, then PA to EA translation
      • Seems like negligible – already assumed doubled TLB miss penalty in the evaluation
      • Design alternative: to translate VA to EA directly
        • Need to manage per-process translation structure
        • But potentially less impact on TLB miss penalty
  • EA cache misses per 1000 instrs
    • Configuration
      • 16 entry FA L1 EA cache
      • 4k entry 8 way L2 EA cache
    • ~3 in omnetpp and canneal
    • ~12 in GUPS
    • Less than 1 in other apps
    • Things might get messed up with a software TLB handler
chip kill correct
Chip-Kill-Correct
  • Single device error correct,Double device error detect
    • Other names: DRAM RAID, Extended ECC, Advanced ECC, …
    • Can tolerate a DRAM device failure
  • Using x1 DRAMs
    • SEC-DED effectively does chip-kill-correct
    • But, there’s no x1 DRAM any more (really?)

x1

x1

x1

x1

x1

x1

x1

x1

x1

x1

x1

x1

x1

x1

64 data bits

8 ECC bits

interleaved sec ded
Interleaved SEC-DED
  • 4 interleaved SEC-DED – x4 Chip-Kill
    • 256bit data width
    • Works with old DRAMs
    • Modern DRAMs use burst access
      • Granularity – DDR2: 128B, DDR3: 256B

(72,64) SEC-DED

(72,64) SEC-DED

(72,64) SEC-DED

(72,64) SEC-DED

x4

x4

x4

x4

x4

x4

x4

x4

64 data DRAMs

8 ECC DRAMs

slide53

Virtualized

Virtualized

x8 ECC-DIMM

x4 Non ECC-DIMM

x8 ECC-DIMM

x4 ECC-DIMM

Burst 4

Burst 4

data

data

T1EC

T1EC

T2EC

T2EC

why is x8 chipkill impractical
Why is x8 chipkill impractical?
  • With the same access granularity
    • Higher redundancy overhead
      • 128-bit data + 24-bit ECC (18.75%)
    • Need custom-designed DIMMs
  • Using standard ECC DIMMs
    • Wider data-path
      • 256-bit data + 24-bit ECC (9.375%)
    • Increase access granularity
      • 128B in DDR2
      • 256B in DDR3
dram modules
DRAM Modules
  • Non-ECC DIMMs
    • 64-bit wide data path
  • ECC DIMMs
    • 72-bit wide data path
    • Additional DRAMs dedicated to storing ECC
    • Additional pins to transfer ECC
  • SEC-DED
    • Single-bit Error CorrectionDouble-bit Error Detection
    • 64bit data + 8bit ECC
slide56

64-bit

x4 Non-ECC DIMM

64-bit

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x8 Non-ECC DIMM

72-bit

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x4 ECC DIMM

72-bit

x8 ECC DIMM

high end servers
High-end Servers
  • Need BOTH reliability and energy efficiency
  • Reliability
    • Chipkill-correct
  • But, chipkill requires x4 configurations
    • Using more energy efficient x8 configurations is impractical with chipkill
high level memory models
High-level Memory Models

PA space

PA space

VA space

VA space

T2EC

VA

PA

PA

EA

VA

Program

Program

Data

ECC

Data

T1EC

Conventional Architecture

Virtualized ECC Architecture

example
Example

Application 1’s VA space

Application 2’s VA space

Application 3’s VA space

VA to PA mapping

DRAM

Data

T1EC

PA to EA mapping

standard dimms
Standard DIMMs
  • x4 Non-ECC DIMMs
    • 16 x4 DRAMs per rank
  • x4 ECC DIMMs
    • 18 x4 DRAMs per rank

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

x4

64bit-wide data bus

x4 Non-ECC DIMM

72bit-wide data bus

x4 ECC DIMM

standard dimms cont d
Standard DIMMs – Cont’d
  • 8 x8 DRAMs per rank in Non-ECC DIMMs
  • 9 x8 DRAMs per rank in ECC DIMMs
  • x8 consumes 30% less power than x4

64bit-wide data bus

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8

x8 Non-ECC DIMM

72bit-wide data bus

x8 ECC DIMM

standard dimms cont d1
Standard DIMMs – Cont’d
  • 4 x16 DRAMs per rank in Non-ECC DIMMs
  • No x16 ECC DIMMs
  • More power efficient than x8 DRAMs

64bit-wide data bus

x16 Non-ECC DIMM

x16

x16

x16

x16

NO x16 ECC DIMM

configurations
Configurations
  • Baseline x4
    • Traditional uniform Chip-Kill
    • Note: x8 Chip-Kill is not practical
  • Virtualized ECC
    • ECC x4
      • Save 8 data pins
    • ECC x8
      • Use more energy efficient x8 DRAM

Baseline x4

ECC x4

ECC x8

128bit data

16bit ECC

128bit data

8bit ECC

128bit data

16bit ECC

x4 ECC DIMM

x4 ECC DIMM

x4 ECC DIMM

x4 Non ECC DIMM

x8 ECC DIMM

x8 ECC DIMM

symbol based error code
Symbol based error code
  • b-bit symbol
  • GF(2^b) based arithmetic
  • Simple rules
    • 1 check symbol
      • 1 symbol error detect
    • 2 check symbols
      • 1 symbol error correct
      • 2 symbol error detect
    • 3 check symbols
      • 1 symbol error correct + 2 symbol error detect
      • 3 symbol error detect
    • 4 check symbols
      • 2 symbol error correct + 2 symbol error detect
      • 4 symbol error detect
  • 3 check symbol error code provides Chip-Kill-Correct
    • Max codeword length: 2^b+2 symbols
      • b=4: 60bit data + 12bit ECC
      • b=8: 2008bit data + 24bit ECC