Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM

Download Presentation

Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM

Loading in 2 Seconds...

- 114 Views
- Uploaded on
- Presentation posted in: General

Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM

Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer, Yi Lu

(alizade, adelj)@stanford.edu, (dachuang, sundaes)@memoir-systems.com, yilu4@illinois.edu

ACM Sigmetrics/Performance 2012

What Is Embedded DRAM?

- 2nd Most Common Embedded Memory
- Consists of 1 Transistor, 1 Capacitor cell
- 2X-3X denser than SRAM
- 2X-4X slower than SRAM

- Supported by Key ASIC and IP Vendors
- IBM, TSMC, NEC, Mosys, ST

- Used in a Number of Applications
- Servers, Networking, Storage, Gaming, Mobile

- Industry Examples
- IBM'sP7
- Sony Playstations, Nintendo GameCube, Wii
- Apple iPhone, Microsoft Zune HD, Xbox 360
- Cisco Catalyst 3K-10K

Select

StorageCapacitor

Data

eDRAM 1T1C Memory Cell

Problem: eDRAM Refresh Causes Memory Bandwidth Loss

DRAM Capacitor has Finite Retention Time (W = Tref)

Bank

Example: W= 18us @ 100C = 4050 cycles @ 225 MHz

1

All 64 rows will losedata in 4050 cycles!

Rows

R

Example: R = 64 rows

R/W Port

Refresh Port

Solution: Periodic Refresh --- Reserve Refresh Cycles for Every Cell in MemoryCauses Bandwidth Loss = R/W = 64 rows/4050 cycles ~ 1.58%

Trend: Higher Density Multi-banked Macros (Mb/mm2)

(2) More Banks are Packed Together and Need to be Refreshed

(1) More Rows are Packed Together and Need to be Refreshed

(4) Smaller W with Higher Temperature

Memory Banks

2

B

1

1

(3) Smaller Capacitor with Lower Geometry → Smaller W

Rows

R

Shared Refresh and R/W Ports

1

M

Shared Circuitry to Conserve Area

(5) Low Clock Speed Mode Decreases ‘Clock-time’ to Refresh

Periodic Refresh Bandwidth Loss ~ RB/W (Note: W ≥ RB)

Does not Scale with Larger Macros, Geometry & Low Power Modes

M = Number of Memory Ports, B = Number of Banks, R = Number of Rows, W = Cell Retention Time

The Problem is Only Getting Worse Over Time …

Vendor Solution: Concurrent Refresh

Memory Banks

2

B

1

1

Rows

R

Concurrent Refresh Port

1

R/W

Ports

M

Concurrent Refresh++: Refresh a Bank Which is Not Being Concurrently Accessed

++T. Kirihata et. al.,An 800-MHz embedded DRAM with a concurrent refresh mode. Solid-State Circuits, IEEE Journal of, 40(6):1377–1387, June 2005.

Refresh

Port

Memory Banks

B

1

2

RP1

RP3

RP4

RP16

RP2

Deficit Register Tracks Non-refreshed Bank(s)

Deficit Register

Next Concurrent

3

Accessed Bank

Bank 2

Count

Refresh Pointer

Standard Observation: N-1 out of N Banks Get Refreshed for Any PatternConcurrent Refresh Overhead is Proportional to 1 bankConcurrent Refresh Overhead = R/W = 64 rows /4050 cycles = ~1.58%

Our Observation: This is Incorrect! In Some Cases, Refresh Overhead can be Very Bad, Close to 100% for ANY Concurrent Refresh Scheduler

- Design a Concurrent Refresh Scheduler that can
- Provide Deterministic Memory Performance Guarantees
- Maximize Memory Throughput (Optimality)

- Be Universally Applicable
- For any eDRAM macro with B banks, R Rows, M memory ports
- For any characteristics of cell retention time W++, and Clock speed

- Maximize Memory Burst Tolerance
- Have Low Implementation Overhead

- Provide Deterministic Memory Performance Guarantees

++Note that W is itself a function of temperature, process, and the micro-architecture of the eDRAM

- We consider a general class of algorithms that require X refresh (idle) timeslots in every Y consecutive timeslots.

Refresh

Refresh

Refresh

Refresh

Refresh

Fixed TDM Constraint

Refresh Window 1

Refresh Window 2

Refresh Window 3

Refresh Window 4

...... . . . . . ….

Refresh

Refresh

Any Refresh Window

Any Refresh Window

Sliding Window Constraint Gives Maximum Flexibility for Handling Bursts, and When to Provide Idle Cycles

Sliding Window Constraint

Supports X idle cycles in any (t, t+Y)

- Refresh Overhead = X / Y
- Memory bandwidth wasted on refresh

- Burst Tolerance = Y – X
- Maximum number of consecutive memory accesses without interruption for refresh

We’ll Consider the Simple Case When the User is Required to Send X = 1 Idle in Y Cycles, and M = 1

Memory Banks

B

1

2

RP1

RP3

RP2

RP1

RP2

RPB

RP2

RP3

RP4

RP4

RP1

RPB

Max Register

Deficit Register

Next Concurrent

3

1

1

0

2

Count

Pointer

Count

Refresh Pointer

Bank with deficit has priority for refresh.

Maximum Allowed Deficit Register Controls Burst Tolerance(Y)

- At each time the BR memory cells have distinct ages ≥ (0, …, BR-1)
- An adversary keeps reading from a particular bank; idle slots are needed to refresh cells in that bank.
- A total of BR inequalities to ensure cells are refreshed in time
- Interestingly, only two of these inequalities matter
- The one corresponding to the oldest cell
- The one corresponding to the oldest “youngest cell in each bank”

- How much can the adversary age the oldest cell?
- Current age is at least BR-1
- Must wait for at least 1 idle before it is picked up: (BR -1) + Y ≤ W

- How much can the adversary age the oldest “youngest cell in each bank”?
- Current age is at least B-1
- Must wait for at least R idles before it is picked up:(B-1) + YR ≤ W

- Necessity: Result for any Algorithm
- Sufficiency:Result for VR Algorithm (with parameter X):

Nearly Optimal Refresh For X=1

Performance Guarantees of Versatile Refresh Algorithm

“Bad” Region with High Overhead

1

Increasing X

Worst-case Refresh Overhead

(X/Y)

Near-optimal Refresh Overhead for X = 1

Refresh Overhead ~ R/W, for W large

1/B

R/W

0

RB Wc = RB + B-1

Cell Retention Time (W)

Why Would We Ever Use Large X?

- Because of Burst Tolerance (large X → large Y – X)
- If memory accesses are bursty, refreshes can be hidden

- There is a Critical Value of X for Max Burst Tolerance
- Example: B = 16, R = 128, W = 2500

R = 1024, B = 6 Banks, W = 18.2us @ 100C = 6825 cycles @ 375MHz

(++Note that these numbers have been sanitized)

- Enhancement:
- No-conflict slot: A timeslot where the bank the VR scheduler wants to refresh is not being accessed.
- Any idle slot is a no-conflict slot; but not vice versa
- For VR, no-conflict slots are as good as idle slots.

- Observation:
- This allows lower refresh overhead (possibly zero) for non-adversarial memory access patterns

Memory Banks

B

1

2

RP4

RP1

RP1

RP1

RP2

RP3

RP3

RPB

RPB

RP2

RP2

RP4

Max Register

Next Refresh

Deficit Register

Repeat for Multiple Memory Ports (M)

3

2

Count

Bank Pointer

Pointer

Count

No conflict feedback

X idles in Y timeslots

Enforcer Module

(User Logic)

- Parameter Alpha Controls Degree of Temporal Locality
- alpha ~ 0 → always read from bank 1 (adversarial)
- alpha ~ 1 → read from random banks (benign)

VR with X = 4:

Min worst-case overhead (best for adversarial)

VR with X = 128:

Max burst tolerance

(best for benign)

Refresh Overhead has Disappeared Completely!

- With Versatile Refresh A Designer Can …
- Exactly Calculate Available Memory Bandwidth
- For any eDRAM macro with B banks, R Rows, M memory ports
- For any characteristics of Temperature, W= Tref and Clock speed

- Achieve Optimal Worst-case Memory Bandwidth
- Design for Large Burst Tolerance
- Potentially Eliminate Back-pressure
- Simplify associated complex design and verification

- Maximize Best-case Memory Bandwidth
- Avail of a Formally Verified VR Controller
- On a suitably reduced memory instance

- Exactly Calculate Available Memory Bandwidth