versatile refresh low complexity refresh scheduling for high throughput multi banked edram
Download
Skip this Video
Download Presentation
Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM

Loading in 2 Seconds...

play fullscreen
1 / 21

Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM - PowerPoint PPT Presentation


  • 166 Views
  • Uploaded on

Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM. Mohammed Alizadeh , Adel Javanmard, Da Chuang , Sundar Iyer, Yi Lu ( alizade , adelj )@stanford.edu , ( dachuang , sundaes)@memoir-systems.com, [email protected]

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM' - caitir


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
versatile refresh low complexity refresh scheduling for high throughput multi banked edram

Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM

Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer, Yi Lu

(alizade, adelj)@stanford.edu, (dachuang, sundaes)@memoir-systems.com, [email protected]

ACM Sigmetrics/Performance 2012

slide2

What Is Embedded DRAM?

  • 2nd Most Common Embedded Memory
    • Consists of 1 Transistor, 1 Capacitor cell
    • 2X-3X denser than SRAM
    • 2X-4X slower than SRAM
  • Supported by Key ASIC and IP Vendors
    • IBM, TSMC, NEC, Mosys, ST
  • Used in a Number of Applications
    • Servers, Networking, Storage, Gaming, Mobile
  • Industry Examples
    • IBM\'sP7
    • Sony Playstations, Nintendo GameCube, Wii
    • Apple iPhone, Microsoft Zune HD, Xbox 360
    • Cisco Catalyst 3K-10K

Select

StorageCapacitor

Data

eDRAM 1T1C Memory Cell

slide3

Problem: eDRAM Refresh Causes Memory Bandwidth Loss

DRAM Capacitor has Finite Retention Time (W = Tref)

Bank

Example: W= 18us @ 100C = 4050 cycles @ 225 MHz

1

All 64 rows will losedata in 4050 cycles!

Rows

R

Example: R = 64 rows

R/W Port

Refresh Port

Solution: Periodic Refresh --- Reserve Refresh Cycles for Every Cell in MemoryCauses Bandwidth Loss = R/W = 64 rows/4050 cycles ~ 1.58%

slide4

Trend: Higher Density Multi-banked Macros (Mb/mm2)

(2) More Banks are Packed Together and Need to be Refreshed

(1) More Rows are Packed Together and Need to be Refreshed

(4) Smaller W with Higher Temperature

Memory Banks

2

B

1

1

(3) Smaller Capacitor with Lower Geometry → Smaller W

Rows

R

Shared Refresh and R/W Ports

1

M

Shared Circuitry to Conserve Area

(5) Low Clock Speed Mode Decreases ‘Clock-time’ to Refresh

Periodic Refresh Bandwidth Loss ~ RB/W (Note: W ≥ RB)

Does not Scale with Larger Macros, Geometry & Low Power Modes

examples of periodic refresh with multi banked macros
Examples of Periodic Refresh with Multi-banked Macros

M = Number of Memory Ports, B = Number of Banks, R = Number of Rows, W = Cell Retention Time

The Problem is Only Getting Worse Over Time …

slide6

Vendor Solution: Concurrent Refresh

Memory Banks

2

B

1

1

Rows

R

Concurrent Refresh Port

1

R/W

Ports

M

Concurrent Refresh++: Refresh a Bank Which is Not Being Concurrently Accessed

++T. Kirihata et. al.,An 800-MHz embedded DRAM with a concurrent refresh mode. Solid-State Circuits, IEEE Journal of, 40(6):1377–1387, June 2005.

Refresh

Port

how is concurrent r efresh u sed t oday
How is Concurrent Refresh Used Today?

Memory Banks

B

1

2

RP1

RP3

RP4

RP16

RP2

Deficit Register Tracks Non-refreshed Bank(s)

Deficit Register

Next Concurrent

3

Accessed Bank

Bank 2

Count

Refresh Pointer

Standard Observation: N-1 out of N Banks Get Refreshed for Any PatternConcurrent Refresh Overhead is Proportional to 1 bankConcurrent Refresh Overhead = R/W = 64 rows /4050 cycles = ~1.58%

Our Observation: This is Incorrect! In Some Cases, Refresh Overhead can be Very Bad, Close to 100% for ANY Concurrent Refresh Scheduler

goals of o ur work an industry outlook
Goals of Our Work: An Industry Outlook
  • Design a Concurrent Refresh Scheduler that can
    • Provide Deterministic Memory Performance Guarantees
      • Maximize Memory Throughput (Optimality)
    • Be Universally Applicable
      • For any eDRAM macro with B banks, R Rows, M memory ports
      • For any characteristics of cell retention time W++, and Clock speed
    • Maximize Memory Burst Tolerance
    • Have Low Implementation Overhead

++Note that W is itself a function of temperature, process, and the micro-architecture of the eDRAM

problem formulation
Problem Formulation
  • We consider a general class of algorithms that require X refresh (idle) timeslots in every Y consecutive timeslots.

Refresh

Refresh

Refresh

Refresh

Refresh

Fixed TDM Constraint

Refresh Window 1

Refresh Window 2

Refresh Window 3

Refresh Window 4

...... . . . . . ….

Refresh

Refresh

Any Refresh Window

Any Refresh Window

Sliding Window Constraint Gives Maximum Flexibility for Handling Bursts, and When to Provide Idle Cycles

Sliding Window Constraint

Supports X idle cycles in any (t, t+Y)

key performance metrics
Key Performance Metrics
  • Refresh Overhead = X / Y
    • Memory bandwidth wasted on refresh
  • Burst Tolerance = Y – X
    • Maximum number of consecutive memory accesses without interruption for refresh

We’ll Consider the Simple Case When the User is Required to Send X = 1 Idle in Y Cycles, and M = 1

our solution versatile refresh algorithm
Our Solution: Versatile Refresh Algorithm

Memory Banks

B

1

2

RP1

RP3

RP2

RP1

RP2

RPB

RP2

RP3

RP4

RP4

RP1

RPB

Max Register

Deficit Register

Next Concurrent

3

1

1

0

2

Count

Pointer

Count

Refresh Pointer

Bank with deficit has priority for refresh.

Maximum Allowed Deficit Register Controls Burst Tolerance(Y)

necessary refresh overhead for any algorithm intuition x 1
Necessary Refresh Overhead for any Algorithm: Intuition, X=1
  • At each time the BR memory cells have distinct ages ≥ (0, …, BR-1)
  • An adversary keeps reading from a particular bank; idle slots are needed to refresh cells in that bank.
  • A total of BR inequalities to ensure cells are refreshed in time
  • Interestingly, only two of these inequalities matter
    • The one corresponding to the oldest cell
    • The one corresponding to the oldest “youngest cell in each bank”
necessary refresh overhead for any algorithm derivation x 1
Necessary Refresh Overhead for any Algorithm: Derivation, X=1
  • How much can the adversary age the oldest cell?
    • Current age is at least BR-1
    • Must wait for at least 1 idle before it is picked up: (BR -1) + Y ≤ W
  • How much can the adversary age the oldest “youngest cell in each bank”?
    • Current age is at least B-1
    • Must wait for at least R idles before it is picked up:(B-1) + YR ≤ W
optimality for versatile refresh overhead results x 1
Optimality for Versatile Refresh Overhead: Results, X =1
  • Necessity: Result for any Algorithm
  • Sufficiency:Result for VR Algorithm (with parameter X):

Nearly Optimal Refresh For X=1

slide15

Performance Guarantees of Versatile Refresh Algorithm

“Bad” Region with High Overhead

1

Increasing X

Worst-case Refresh Overhead

(X/Y)

Near-optimal Refresh Overhead for X = 1

Refresh Overhead ~ R/W, for W large

1/B

R/W

0

RB Wc = RB + B-1

Cell Retention Time (W)

Why Would We Ever Use Large X?

why would w e ever u se l arge x
Why Would We Ever Use Large X?
  • Because of Burst Tolerance (large X → large Y – X)
    • If memory accesses are bursty, refreshes can be hidden
  • There is a Critical Value of X for Max Burst Tolerance
  • Example: B = 16, R = 128, W = 2500
calculations for customer asic
Calculations for Customer ASIC++

R = 1024, B = 6 Banks, W = 18.2us @ 100C = 6825 cycles @ 375MHz

(++Note that these numbers have been sanitized)

versatile refresh enhancement
Versatile Refresh Enhancement
  • Enhancement:
    • No-conflict slot: A timeslot where the bank the VR scheduler wants to refresh is not being accessed.
    • Any idle slot is a no-conflict slot; but not vice versa
    • For VR, no-conflict slots are as good as idle slots.
  • Observation:
    • This allows lower refresh overhead (possibly zero) for non-adversarial memory access patterns
fully enhanced versatile refresh algorithm
Fully Enhanced Versatile Refresh Algorithm

Memory Banks

B

1

2

RP4

RP1

RP1

RP1

RP2

RP3

RP3

RPB

RPB

RP2

RP2

RP4

Max Register

Next Refresh

Deficit Register

Repeat for Multiple Memory Ports (M)

3

2

Count

Bank Pointer

Pointer

Count

No conflict feedback

X idles in Y timeslots

Enforcer Module

(User Logic)

simulation synthetic statistical w orkload
Simulation: Synthetic Statistical Workload
  • Parameter Alpha Controls Degree of Temporal Locality
    • alpha ~ 0 → always read from bank 1 (adversarial)
    • alpha ~ 1 → read from random banks (benign)

VR with X = 4:

Min worst-case overhead (best for adversarial)

VR with X = 128:

Max burst tolerance

(best for benign)

Refresh Overhead has Disappeared Completely!

conclusion
Conclusion
  • With Versatile Refresh A Designer Can …
    • Exactly Calculate Available Memory Bandwidth
      • For any eDRAM macro with B banks, R Rows, M memory ports
      • For any characteristics of Temperature, W= Tref and Clock speed
    • Achieve Optimal Worst-case Memory Bandwidth
    • Design for Large Burst Tolerance
    • Potentially Eliminate Back-pressure
      • Simplify associated complex design and verification
    • Maximize Best-case Memory Bandwidth
    • Avail of a Formally Verified VR Controller
      • On a suitably reduced memory instance
ad