Advanced Microarchitecture

Advanced Microarchitecture Lecture 14: DRAM and Prefetching

SRAM vs. DRAM • DRAM = Dynamic RAM • SRAM: 6T per bit • built with normal high-speed CMOS technology • DRAM: 1T per bit • built with special DRAM process optimized for density Lecture 14: DRAM and Prefetching

DRAM wordline b Hardware Structures SRAM wordline b b Lecture 14: DRAM and Prefetching

And the “dummy” transistor may need to be bigger to hold enough charge Implementing the Capacitor • You can use a “dead” transistor gate: But this wastes area because we now have two transistors Lecture 14: DRAM and Prefetching

Implementing the Capacitor (2) • There are other advanced structures Cell Plate Si “Trench Cell” Cap Insulator Refilling Poly Storage Node Poly Si Substrate Field Oxide DRAM figures from this slide and previous were taken from Prof. Nikolic’s EECS141/2003 Lecture notes from UC-Berkeley Lecture 14: DRAM and Prefetching

DRAM Chip Organization Row Decoder Row Address Memory Cell Array Sense Amps Row Buffer Column Address Column Decoder Data Bus Lecture 14: DRAM and Prefetching

DRAM Chip Organization (2) • High-Level organization is very similar to SRAM • cells are only single-ended • changes precharging and sensing circuits • makes reads destructive: contents are erased after reading • row buffer • read lots of bits all at once, and then parcel them out based on different column addresses • similar to reading a full cache line, but only accessing one word at a time • “Fast-Page Mode” FPM DRAM organizes the DRAM row to contain bits for a complete page • row address held constant, and then fast read from different locations from the same page Lecture 14: DRAM and Prefetching

After read of 0 or 1, cell contains something close to 1/2 Destructive Read sense amp Vdd bitline voltage 1 0 Wordline Enabled Sense Amp Enabled Vdd storage cell voltage Lecture 14: DRAM and Prefetching

Row Buffer Refresh • So after a read, the contents of the DRAM cell are gone • The values are stored in the row buffer • Write them back into the cells for the next read in the future DRAM cells Sense Amps Lecture 14: DRAM and Prefetching

Gate Leakage Refresh (2) • Fairly gradually, the DRAM cell will lose its contents even if it’s not accessed • This is why it’s called “dynamic” • Contrast to SRAM which is “static” in that once written, it maintains its value forever (so long as power remains on) • All DRAM rows need to be regularly read and re-written 1 0 Lecture 14: DRAM and Prefetching

DRAM Read Timing Accesses are asynchronous: triggered by RAS and CAS signals, which can in theory occur at arbitrary times (subject to DRAM timing constraints) Lecture 14: DRAM and Prefetching

Command frequency does not change SDRAM Read Timing Double-Data Rate (DDR) DRAM transfers data on both rising and falling edge of the clock Burst Length Timing figures taken from “A Performance Comparison of Contemporary DRAM Architectures” by Cuppu, Jacob, Davis and Mudge Lecture 14: DRAM and Prefetching

More wire delay getting to the memory chips Significant wire delay just getting from the CPU to the memory controller More Latency Width/Speed varies depending on memory type (plus the return trip…) Lecture 14: DRAM and Prefetching

Memory Controller Memory Controller Like Write-Combining Buffer, Scheduler may coalesce multiple accesses together, or re-order to reduce number of row accesses Commands Read Queue Write Queue Response Queue Data To/From CPU Scheduler Buffer Bank 0 Bank 1 Lecture 14: DRAM and Prefetching

Wire-Dominated Latency (2) • Access latency dominated by wire delay • mostly in the wordline and bitlines/sense • PCB traces between chips • Process technology improvements provide smaller and faster transistors • DRAM density doubles at about the same rate as Moore’s Law • DRAM latency improves very slowly because wire delay has not improved as fast as logic delay Lecture 14: DRAM and Prefetching

Wire-Dominated Latency • CPUs • frequency has increased at about 60% per year • DRAM • end-to-end latency has decreased only about 10% per year • Number of cycles for memory access keeps increasing • A.K.A. the memory wall • Note: absolute latency of memory is decreasing • Just not nearly as fast as the CPU Lecture 14: DRAM and Prefetching

So what do we do about it? • Caching • reduces average memory instruction latency by avoiding DRAM altogether • Limitations • Capacity • programs keep increasing in size • Compulsory misses Lecture 14: DRAM and Prefetching

Faster DRAM Speed • Clock FSB faster • DRAM chips may not be able to keep up • Latency dominated by wire delay • Bandwidth may be improved (DDR vs. regular) but latency doesn’t change much • Instead of 2 cycles for row access, may take 3 cycles at a faster bus speed • Doesn’t address latency of the memory access Lecture 14: DRAM and Prefetching

All on same chip: No slow PCB wires to drive On-Chip Memory Controller Memory controller can run at CPU speed instead of FSB clock speed Disadvantage: memory type is now tied to the CPU implementation Lecture 14: DRAM and Prefetching

L1 L2 DRAM Total Load-to-Use Latency Data May cause resource contention due to extra cache/DRAM activity Prefetch Load Data Much improved Load-to-Use Latency Somewhat improved Latency Prefetching • If memory takes a long time, start accessing earlier Load Lecture 14: DRAM and Prefetching

Reordering can mess up your code R1 = [R2] R0 = [R2] A A R1 = R1- 1 R1 = R1- 1 B C B C R1 = [R2] R3 = R1+4 R3 = R1+4 Using a prefetch instruction (or load to $zero) can help to avoid problems with data dependencies Software Prefetching A B C R1 = [R2] R3 = R1+4 Hopefully the load miss is serviced by the time we get to the consumer (Cache missing instruction in red) Lecture 14: DRAM and Prefetching

Software Prefetching (2) • Pros: • can leverage compiler level information • no hardware modifications • Cons: • prefetch instructions increase code footprint • may cause more I$ misses, code alignment issues • hard to hoist prefetches early enough to cover main memory latency • If memory is 100 cycles, and the CPU can sustain 2 instructions per cycle, then load needs to be moved 200 instructions earlier in the code • aggressive hoisting leads to many useless prefetches • control flow may go somewhere else (like block B in previous slide) Lecture 14: DRAM and Prefetching

Depending on prefetch algorithm/miss patterns, prefetcher injects additional memory requests Hardware Prefetching DRAM Hardware monitors miss traffic to DRAM CPU HW Prefetcher Cannot be overly aggressive since prefetches may contend for memory bandwidth, and may pollute the cache (evict other useful cache lines) Lecture 14: DRAM and Prefetching

Next-Line Prefetching • Very simple, if a request for cache line X goes to DRAM, also request X+1 • assumes spatial locality • often a good assumption • low chance of tying up the memory bus for too long • FPM DRAM already will have the correct page open for the request for X, so X+1 will likely be available in the row buffer • Can optimize by doing Next-Line-Unless-Crossing-A-Page-Boundary prefetching Lecture 14: DRAM and Prefetching

Next-N-Line Prefetching • Obvious extension • fetch the next N lines: • X+1, X+2, …, X+N • Need to carefully tune N • larger N may make it: • more likely to prefetch something useful • more likely to evict something useful • more likely to stall a useful load due to bus contention Lecture 14: DRAM and Prefetching

Stream Buffers Figures from Jouppi “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers,” ISCA’90 Lecture 14: DRAM and Prefetching

Stream Buffers (2) Lecture 14: DRAM and Prefetching

Stream Buffers • Can independently track multiple “inter-twined” sequences/streams of accesses • Separate buffers prevent prefetch streams from polluting cache until line is used at least once • similar effect to filter/promotion caches • Can extend to “Quasi-Sequential” Stream buffer • add comparator to all entries, and skip-ahead (partial flush) if hit on a non-head entry Lecture 14: DRAM and Prefetching

Stride Prefetching Layout in linear memory If array starts at address A, and we are accessing the kth column, each element is B bytes large, and there are N elements per row of the matrix, then the addresses accessed are: A+Bk, A+Bk+N, A+Bk+2N, A+Bk+3N, … Column traversal of a matrix Or, if you miss on address X, prefetch X+N Lecture 14: DRAM and Prefetching

Stride Prefetching (2) • Like Next-N-Line prefetching, need to limit how far ahead stride is allowed to go • previous example: no point in prefetching past the end of the array • How can you tell the difference between: • A[i]  A[i+1] • X  Y • Typically only do stride prefetch if same stride observed at least a few times Lecture 14: DRAM and Prefetching

Miss traffic now looks like: A+Bk, X+Bk, Y+Bk, A+Bk+N, X+Bk+N, Y+Bk+N, A+Bk+2N, X+Bk+2N, Y+Bk+2N, … (X-A) No detectable stride! (Y-X) (A+N-Y) Stride Prefetching (3) What if we’re doing Y = A + X? Lecture 14: DRAM and Prefetching

+ Prefetch A=Bk+4N PC-Based Stride Tag Addr Stride Count If seen same stride enough times (count > q) 0x409A34 Load R1 = 0[R2] A A+Bk+3N N 2 0x409A50 Load R3 = 0[R4] <program is here> X+Bk+3N N 2 X 0x409A5C Store R5 = 0[R6] Y Y+Bk+2N N 1 Lecture 14: DRAM and Prefetching

D F Actual memory layout (no chance for stride to get this right) B A C E Other Patterns Linked-List Traversal A B C D E F Lecture 14: DRAM and Prefetching

Context-Sensitive Prefetching D F B A What to Prefetch Next C D E E F ? A Similar to history-based branch predictors: Last time I saw X, Y happened Ex 1: X = taken branch, Y = not-taken Ex 2: X = Missed A, Y = Missed B B B C C E D F Lecture 14: DRAM and Prefetching

Prefetch prediction table A B F B D E D B D B E A B E B B B A C A C Context-Sensitive Prefetching (2) • Like branch predictors, longer history enables learning more complex patterns • and increases training time DFS traversal: ABDBEBACFCGCA A B C D E F G Lecture 14: DRAM and Prefetching

Markov Prefetching • Alternative to explicitly remembering the patterns is to remember multiple next-states G C A D B F B C C A B, C D E F G B D, E, A C F, G, A E B Lecture 14: DRAM and Prefetching

Cache line comes back 1 4128 900120230 900120758 Nope Maybe! Maybe! Nope Pointer Prefetching DRAM Miss to DRAM Go ahead and prefetch these Scan for anything that looks like a pointer (is it within the heap range?) structbintree_node_t { int data1; int data2; structbintree_node_t * left; structbintree_node_t * right; }; This allows you to walk the tree (or other pointer-based data structures which are typically hard to prefetch) Lecture 14: DRAM and Prefetching

X DRAM Latency Stride Prefetcher X+N DRAM Latency X+2N DRAM Latency Pointer Prefetching A DRAM Latency B DRAM Latency C DRAM Latency Pointer Prefetching (2) • Don’t necessarily need extra hardware to store patterns • Prefetch speed is slower: See “Pointer-Cache Assisted Prefetching” by Collins et al. MICRO-2002 for reducing this serialization effect. Lecture 14: DRAM and Prefetching

Value-Prediction-Based Prefetching • Takes advantage of value locality • Mispredictions are less painful • Normal VPred misprediction causes pipeline flush • Misprediction of address just causes spurious memory accesses DRAM Load PC Value Predictor for address only L2 L1 Lecture 14: DRAM and Prefetching

Evaluating Prefetchers • compare to simply increasing LLC size • complex prefetcher vs. simpler with slightly larger cache • metrics: performance, power, area, bus utilization • key is balancing prefetch aggressiveness with resource utilization (reduce pollution, cache port contention, DRAM bus contention) Lecture 14: DRAM and Prefetching

Where to Prefetch? • Prefetching can be done at any level of the cache hierarchy • Prefetching algorithm may vary as well • depends on why you’re having misses • capacity, conflict or compulsory • may make capacity misses worse • simpler technique (victim cache) may be better for conflict • has better chance than other techniques for compulsory • behaviors vary by cache level, I$ vs. D$ Lecture 14: DRAM and Prefetching

Advanced Microarchitecture