Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM

Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer, Yi Lu (alizade, adelj)@stanford.edu, (dachuang, sundaes)@memoir-systems.com, yilu4@illinois.edu ACM Sigmetrics/Performance 2012

What Is Embedded DRAM? • 2nd Most Common Embedded Memory • Consists of 1 Transistor, 1 Capacitor cell • 2X-3X denser than SRAM • 2X-4X slower than SRAM • Supported by Key ASIC and IP Vendors • IBM, TSMC, NEC, Mosys, ST • Used in a Number of Applications • Servers, Networking, Storage, Gaming, Mobile • Industry Examples • IBM'sP7 • Sony Playstations, Nintendo GameCube, Wii • Apple iPhone, Microsoft Zune HD, Xbox 360 • Cisco Catalyst 3K-10K Select StorageCapacitor Data eDRAM 1T1C Memory Cell

Problem: eDRAM Refresh Causes Memory Bandwidth Loss DRAM Capacitor has Finite Retention Time (W = Tref) Bank Example: W= 18us @ 100C = 4050 cycles @ 225 MHz 1 All 64 rows will losedata in 4050 cycles! Rows R Example: R = 64 rows R/W Port Refresh Port Solution: Periodic Refresh --- Reserve Refresh Cycles for Every Cell in MemoryCauses Bandwidth Loss = R/W = 64 rows/4050 cycles ~ 1.58%

Trend: Higher Density Multi-banked Macros (Mb/mm2) (2) More Banks are Packed Together and Need to be Refreshed (1) More Rows are Packed Together and Need to be Refreshed (4) Smaller W with Higher Temperature Memory Banks 2 B 1 1 (3) Smaller Capacitor with Lower Geometry → Smaller W Rows R Shared Refresh and R/W Ports 1 M Shared Circuitry to Conserve Area (5) Low Clock Speed Mode Decreases ‘Clock-time’ to Refresh Periodic Refresh Bandwidth Loss ~ RB/W (Note: W ≥ RB) Does not Scale with Larger Macros, Geometry & Low Power Modes

Examples of Periodic Refresh with Multi-banked Macros M = Number of Memory Ports, B = Number of Banks, R = Number of Rows, W = Cell Retention Time The Problem is Only Getting Worse Over Time …

Vendor Solution: Concurrent Refresh Memory Banks 2 B 1 1 Rows R Concurrent Refresh Port 1 R/W Ports M Concurrent Refresh++: Refresh a Bank Which is Not Being Concurrently Accessed ++T. Kirihata et. al.,An 800-MHz embedded DRAM with a concurrent refresh mode. Solid-State Circuits, IEEE Journal of, 40(6):1377–1387, June 2005. Refresh Port

How is Concurrent Refresh Used Today? Memory Banks B 1 2 RP1 RP3 RP4 RP16 RP2 Deficit Register Tracks Non-refreshed Bank(s) Deficit Register Next Concurrent 3 Accessed Bank Bank 2 Count Refresh Pointer Standard Observation: N-1 out of N Banks Get Refreshed for Any PatternConcurrent Refresh Overhead is Proportional to 1 bankConcurrent Refresh Overhead = R/W = 64 rows /4050 cycles = ~1.58% Our Observation: This is Incorrect! In Some Cases, Refresh Overhead can be Very Bad, Close to 100% for ANY Concurrent Refresh Scheduler

Goals of Our Work: An Industry Outlook • Design a Concurrent Refresh Scheduler that can • Provide Deterministic Memory Performance Guarantees • Maximize Memory Throughput (Optimality) • Be Universally Applicable • For any eDRAM macro with B banks, R Rows, M memory ports • For any characteristics of cell retention time W++, and Clock speed • Maximize Memory Burst Tolerance • Have Low Implementation Overhead ++Note that W is itself a function of temperature, process, and the micro-architecture of the eDRAM

Problem Formulation • We consider a general class of algorithms that require X refresh (idle) timeslots in every Y consecutive timeslots. Refresh Refresh Refresh Refresh Refresh Fixed TDM Constraint Refresh Window 1 Refresh Window 2 Refresh Window 3 Refresh Window 4 ...... . . . . . …. Refresh Refresh Any Refresh Window Any Refresh Window Sliding Window Constraint Gives Maximum Flexibility for Handling Bursts, and When to Provide Idle Cycles Sliding Window Constraint Supports X idle cycles in any (t, t+Y)

Key Performance Metrics • Refresh Overhead = X / Y • Memory bandwidth wasted on refresh • Burst Tolerance = Y – X • Maximum number of consecutive memory accesses without interruption for refresh We’ll Consider the Simple Case When the User is Required to Send X = 1 Idle in Y Cycles, and M = 1

Our Solution: Versatile Refresh Algorithm Memory Banks B 1 2 RP1 RP3 RP2 RP1 RP2 RPB RP2 RP3 RP4 RP4 RP1 RPB Max Register Deficit Register Next Concurrent 3 1 1 0 2 Count Pointer Count Refresh Pointer Bank with deficit has priority for refresh. Maximum Allowed Deficit Register Controls Burst Tolerance(Y)

Necessary Refresh Overhead for any Algorithm: Intuition, X=1 • At each time the BR memory cells have distinct ages ≥ (0, …, BR-1) • An adversary keeps reading from a particular bank; idle slots are needed to refresh cells in that bank. • A total of BR inequalities to ensure cells are refreshed in time • Interestingly, only two of these inequalities matter • The one corresponding to the oldest cell • The one corresponding to the oldest “youngest cell in each bank”

Necessary Refresh Overhead for any Algorithm: Derivation, X=1 • How much can the adversary age the oldest cell? • Current age is at least BR-1 • Must wait for at least 1 idle before it is picked up: (BR -1) + Y ≤ W • How much can the adversary age the oldest “youngest cell in each bank”? • Current age is at least B-1 • Must wait for at least R idles before it is picked up:(B-1) + YR ≤ W

Optimality for Versatile Refresh Overhead: Results, X =1 • Necessity: Result for any Algorithm • Sufficiency:Result for VR Algorithm (with parameter X): Nearly Optimal Refresh For X=1

Performance Guarantees of Versatile Refresh Algorithm “Bad” Region with High Overhead 1 Increasing X Worst-case Refresh Overhead (X/Y) Near-optimal Refresh Overhead for X = 1 Refresh Overhead ~ R/W, for W large 1/B R/W 0 RB Wc = RB + B-1 Cell Retention Time (W) Why Would We Ever Use Large X?

Why Would We Ever Use Large X? • Because of Burst Tolerance (large X → large Y – X) • If memory accesses are bursty, refreshes can be hidden • There is a Critical Value of X for Max Burst Tolerance • Example: B = 16, R = 128, W = 2500

Calculations for Customer ASIC++ R = 1024, B = 6 Banks, W = 18.2us @ 100C = 6825 cycles @ 375MHz (++Note that these numbers have been sanitized)

Versatile Refresh Enhancement • Enhancement: • No-conflict slot: A timeslot where the bank the VR scheduler wants to refresh is not being accessed. • Any idle slot is a no-conflict slot; but not vice versa • For VR, no-conflict slots are as good as idle slots. • Observation: • This allows lower refresh overhead (possibly zero) for non-adversarial memory access patterns

Fully Enhanced Versatile Refresh Algorithm Memory Banks B 1 2 RP4 RP1 RP1 RP1 RP2 RP3 RP3 RPB RPB RP2 RP2 RP4 Max Register Next Refresh Deficit Register Repeat for Multiple Memory Ports (M) 3 2 Count Bank Pointer Pointer Count No conflict feedback X idles in Y timeslots Enforcer Module (User Logic)

Simulation: Synthetic Statistical Workload • Parameter Alpha Controls Degree of Temporal Locality • alpha ~ 0 → always read from bank 1 (adversarial) • alpha ~ 1 → read from random banks (benign) VR with X = 4: Min worst-case overhead (best for adversarial) VR with X = 128: Max burst tolerance (best for benign) Refresh Overhead has Disappeared Completely!

Conclusion • With Versatile Refresh A Designer Can … • Exactly Calculate Available Memory Bandwidth • For any eDRAM macro with B banks, R Rows, M memory ports • For any characteristics of Temperature, W= Tref and Clock speed • Achieve Optimal Worst-case Memory Bandwidth • Design for Large Burst Tolerance • Potentially Eliminate Back-pressure • Simplify associated complex design and verification • Maximize Best-case Memory Bandwidth • Avail of a Formally Verified VR Controller • On a suitably reduced memory instance

Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM