Versatile refresh low complexity refresh scheduling for high throughput multi banked edram
This presentation is the property of its rightful owner.
Sponsored Links
1 / 21

Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM PowerPoint PPT Presentation


  • 101 Views
  • Uploaded on
  • Presentation posted in: General

Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM. Mohammed Alizadeh , Adel Javanmard, Da Chuang , Sundar Iyer, Yi Lu ( alizade , adelj )@stanford.edu , ( dachuang , sundaes)@memoir-systems.com, [email protected]

Download Presentation

Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Versatile refresh low complexity refresh scheduling for high throughput multi banked edram

Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM

Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer, Yi Lu

(alizade, adelj)@stanford.edu, (dachuang, sundaes)@memoir-systems.com, [email protected]

ACM Sigmetrics/Performance 2012


Versatile refresh low complexity refresh scheduling for high throughput multi banked edram

What Is Embedded DRAM?

  • 2nd Most Common Embedded Memory

    • Consists of 1 Transistor, 1 Capacitor cell

    • 2X-3X denser than SRAM

    • 2X-4X slower than SRAM

  • Supported by Key ASIC and IP Vendors

    • IBM, TSMC, NEC, Mosys, ST

  • Used in a Number of Applications

    • Servers, Networking, Storage, Gaming, Mobile

  • Industry Examples

    • IBM'sP7

    • Sony Playstations, Nintendo GameCube, Wii

    • Apple iPhone, Microsoft Zune HD, Xbox 360

    • Cisco Catalyst 3K-10K

Select

StorageCapacitor

Data

eDRAM 1T1C Memory Cell


Versatile refresh low complexity refresh scheduling for high throughput multi banked edram

Problem: eDRAM Refresh Causes Memory Bandwidth Loss

DRAM Capacitor has Finite Retention Time (W = Tref)

Bank

Example: W= 18us @ 100C = 4050 cycles @ 225 MHz

1

All 64 rows will losedata in 4050 cycles!

Rows

R

Example: R = 64 rows

R/W Port

Refresh Port

Solution: Periodic Refresh --- Reserve Refresh Cycles for Every Cell in MemoryCauses Bandwidth Loss = R/W = 64 rows/4050 cycles ~ 1.58%


Versatile refresh low complexity refresh scheduling for high throughput multi banked edram

Trend: Higher Density Multi-banked Macros (Mb/mm2)

(2) More Banks are Packed Together and Need to be Refreshed

(1) More Rows are Packed Together and Need to be Refreshed

(4) Smaller W with Higher Temperature

Memory Banks

2

B

1

1

(3) Smaller Capacitor with Lower Geometry → Smaller W

Rows

R

Shared Refresh and R/W Ports

1

M

Shared Circuitry to Conserve Area

(5) Low Clock Speed Mode Decreases ‘Clock-time’ to Refresh

Periodic Refresh Bandwidth Loss ~ RB/W (Note: W ≥ RB)

Does not Scale with Larger Macros, Geometry & Low Power Modes


Examples of periodic refresh with multi banked macros

Examples of Periodic Refresh with Multi-banked Macros

M = Number of Memory Ports, B = Number of Banks, R = Number of Rows, W = Cell Retention Time

The Problem is Only Getting Worse Over Time …


Versatile refresh low complexity refresh scheduling for high throughput multi banked edram

Vendor Solution: Concurrent Refresh

Memory Banks

2

B

1

1

Rows

R

Concurrent Refresh Port

1

R/W

Ports

M

Concurrent Refresh++: Refresh a Bank Which is Not Being Concurrently Accessed

++T. Kirihata et. al.,An 800-MHz embedded DRAM with a concurrent refresh mode. Solid-State Circuits, IEEE Journal of, 40(6):1377–1387, June 2005.

Refresh

Port


How is concurrent r efresh u sed t oday

How is Concurrent Refresh Used Today?

Memory Banks

B

1

2

RP1

RP3

RP4

RP16

RP2

Deficit Register Tracks Non-refreshed Bank(s)

Deficit Register

Next Concurrent

3

Accessed Bank

Bank 2

Count

Refresh Pointer

Standard Observation: N-1 out of N Banks Get Refreshed for Any PatternConcurrent Refresh Overhead is Proportional to 1 bankConcurrent Refresh Overhead = R/W = 64 rows /4050 cycles = ~1.58%

Our Observation: This is Incorrect! In Some Cases, Refresh Overhead can be Very Bad, Close to 100% for ANY Concurrent Refresh Scheduler


Goals of o ur work an industry outlook

Goals of Our Work: An Industry Outlook

  • Design a Concurrent Refresh Scheduler that can

    • Provide Deterministic Memory Performance Guarantees

      • Maximize Memory Throughput (Optimality)

    • Be Universally Applicable

      • For any eDRAM macro with B banks, R Rows, M memory ports

      • For any characteristics of cell retention time W++, and Clock speed

    • Maximize Memory Burst Tolerance

    • Have Low Implementation Overhead

++Note that W is itself a function of temperature, process, and the micro-architecture of the eDRAM


Problem formulation

Problem Formulation

  • We consider a general class of algorithms that require X refresh (idle) timeslots in every Y consecutive timeslots.

Refresh

Refresh

Refresh

Refresh

Refresh

Fixed TDM Constraint

Refresh Window 1

Refresh Window 2

Refresh Window 3

Refresh Window 4

...... . . . . . ….

Refresh

Refresh

Any Refresh Window

Any Refresh Window

Sliding Window Constraint Gives Maximum Flexibility for Handling Bursts, and When to Provide Idle Cycles

Sliding Window Constraint

Supports X idle cycles in any (t, t+Y)


Key performance metrics

Key Performance Metrics

  • Refresh Overhead = X / Y

    • Memory bandwidth wasted on refresh

  • Burst Tolerance = Y – X

    • Maximum number of consecutive memory accesses without interruption for refresh

We’ll Consider the Simple Case When the User is Required to Send X = 1 Idle in Y Cycles, and M = 1


Our solution versatile refresh algorithm

Our Solution: Versatile Refresh Algorithm

Memory Banks

B

1

2

RP1

RP3

RP2

RP1

RP2

RPB

RP2

RP3

RP4

RP4

RP1

RPB

Max Register

Deficit Register

Next Concurrent

3

1

1

0

2

Count

Pointer

Count

Refresh Pointer

Bank with deficit has priority for refresh.

Maximum Allowed Deficit Register Controls Burst Tolerance(Y)


Necessary refresh overhead for any algorithm intuition x 1

Necessary Refresh Overhead for any Algorithm: Intuition, X=1

  • At each time the BR memory cells have distinct ages ≥ (0, …, BR-1)

  • An adversary keeps reading from a particular bank; idle slots are needed to refresh cells in that bank.

  • A total of BR inequalities to ensure cells are refreshed in time

  • Interestingly, only two of these inequalities matter

    • The one corresponding to the oldest cell

    • The one corresponding to the oldest “youngest cell in each bank”


Necessary refresh overhead for any algorithm derivation x 1

Necessary Refresh Overhead for any Algorithm: Derivation, X=1

  • How much can the adversary age the oldest cell?

    • Current age is at least BR-1

    • Must wait for at least 1 idle before it is picked up: (BR -1) + Y ≤ W

  • How much can the adversary age the oldest “youngest cell in each bank”?

    • Current age is at least B-1

    • Must wait for at least R idles before it is picked up:(B-1) + YR ≤ W


Optimality for versatile refresh overhead results x 1

Optimality for Versatile Refresh Overhead: Results, X =1

  • Necessity: Result for any Algorithm

  • Sufficiency:Result for VR Algorithm (with parameter X):

Nearly Optimal Refresh For X=1


Versatile refresh low complexity refresh scheduling for high throughput multi banked edram

Performance Guarantees of Versatile Refresh Algorithm

“Bad” Region with High Overhead

1

Increasing X

Worst-case Refresh Overhead

(X/Y)

Near-optimal Refresh Overhead for X = 1

Refresh Overhead ~ R/W, for W large

1/B

R/W

0

RB Wc = RB + B-1

Cell Retention Time (W)

Why Would We Ever Use Large X?


Why would w e ever u se l arge x

Why Would We Ever Use Large X?

  • Because of Burst Tolerance (large X → large Y – X)

    • If memory accesses are bursty, refreshes can be hidden

  • There is a Critical Value of X for Max Burst Tolerance

  • Example: B = 16, R = 128, W = 2500


Calculations for customer asic

Calculations for Customer ASIC++

R = 1024, B = 6 Banks, W = 18.2us @ 100C = 6825 cycles @ 375MHz

(++Note that these numbers have been sanitized)


Versatile refresh enhancement

Versatile Refresh Enhancement

  • Enhancement:

    • No-conflict slot: A timeslot where the bank the VR scheduler wants to refresh is not being accessed.

    • Any idle slot is a no-conflict slot; but not vice versa

    • For VR, no-conflict slots are as good as idle slots.

  • Observation:

    • This allows lower refresh overhead (possibly zero) for non-adversarial memory access patterns


Fully enhanced versatile refresh algorithm

Fully Enhanced Versatile Refresh Algorithm

Memory Banks

B

1

2

RP4

RP1

RP1

RP1

RP2

RP3

RP3

RPB

RPB

RP2

RP2

RP4

Max Register

Next Refresh

Deficit Register

Repeat for Multiple Memory Ports (M)

3

2

Count

Bank Pointer

Pointer

Count

No conflict feedback

X idles in Y timeslots

Enforcer Module

(User Logic)


Simulation synthetic statistical w orkload

Simulation: Synthetic Statistical Workload

  • Parameter Alpha Controls Degree of Temporal Locality

    • alpha ~ 0 → always read from bank 1 (adversarial)

    • alpha ~ 1 → read from random banks (benign)

VR with X = 4:

Min worst-case overhead (best for adversarial)

VR with X = 128:

Max burst tolerance

(best for benign)

Refresh Overhead has Disappeared Completely!


Conclusion

Conclusion

  • With Versatile Refresh A Designer Can …

    • Exactly Calculate Available Memory Bandwidth

      • For any eDRAM macro with B banks, R Rows, M memory ports

      • For any characteristics of Temperature, W= Tref and Clock speed

    • Achieve Optimal Worst-case Memory Bandwidth

    • Design for Large Burst Tolerance

    • Potentially Eliminate Back-pressure

      • Simplify associated complex design and verification

    • Maximize Best-case Memory Bandwidth

    • Avail of a Formally Verified VR Controller

      • On a suitably reduced memory instance


  • Login