CENG 450 Computer Systems and Architecture Lecture 15

CENG 450Computer Systems and ArchitectureLecture 15 Amirali Baniasadi amirali@ece.uvic.ca

Announcements • Last Quiz scheduled for March 31st.

Cache Write Policy: Write Through versus Write Back • Cache read is much easier to handle than cache write: • Instruction cache is much easier to design than data cache • Cache write: • How do we keep data in the cache and memory consistent? • Two options: • Write Back: write to cache only. Write the cache block to memory when that cache block is being replaced on a cache miss. • Need a “dirty” bit for each cache block • Greatly reduce the memory bandwidth requirement • Control can be complex • Write Through: write to cache and memory at the same time. • Isn’t memory too slow for this?

Write Buffer for Write Through Cache Processor DRAM • A Write Buffer is needed between the Cache and Memory • Processor: writes data into the cache and the write buffer • Memory controller: write contents of the buffer to memory • Write buffer is just a FIFO: • Typical number of entries: 4 • Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle • Memory system designer’s nightmare: • Store frequency (w.r.t. time) -> 1 / DRAM write cycle • Write buffer saturation Write Buffer

Write Buffer Saturation Cache Processor DRAM Write Buffer • Store frequency (w.r.t. time) -> 1 / DRAM write cycle • If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row): • Store buffer will overflow no matter how big you make it • The CPU Cycle Time <= DRAM Write Cycle Time • Solution for write buffer saturation: • Use a write back cache • Install a second level (L2) cache: Cache L2 Cache Processor DRAM Write Buffer

Improving Cache Performance Average Memory Access Time= Hit Time+ Miss Rate * Miss Penalty 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.

1. Reduce Misses via Larger Block Size

2. Reduce Misses via Higher Associativity • 2:1 Cache Rule: • Miss Rate DM cache size N Miss Rate 2-way cache size N/2 • Beware: Execution time is only final measure! • Will Clock Cycle time increase? • Hill [1988] suggested hit time for 2-way vs. 1-way external cache +10%, internal + 2%

3. Reducing Misses via a “Victim Cache” • How to combine fast hit time of direct mapped yet still avoid conflict misses? • Add buffer to place data discarded from cache • Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache • Used in Alpha, HP machines DATA TAGS One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator To Next Lower Level In Hierarchy

4. Reducing Misses via “Pseudo-Associativity” • How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way associative cache? • Divide cache: on a miss, check other half of cache to see if the data is there, if so have a pseudo-hit (slow hit) • Drawback: Difficult to build a CPU pipeline if hit may take either 1 or 2 cycles • Better for caches not tied directly to processor (L2) • Used in MIPS R1000 L2 cache, similar in UltraSPARC Hit Time Miss Penalty Pseudo Hit Time Time

5. Reducing Misses by Compiler Optimizations • Instructions • Not discussed here. • Data • Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays • Loop Interchange: change nesting of loops to access data in order stored in memory • Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap

Merging Arrays Example /* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE]; /* After: 1 array of structures */ struct merge { int val; int key; }; struct merge merged_array[SIZE]; Reducing conflicts between val & key; improve spatial locality

Loop Interchange Example j ··· ··· ··· ··· i ··· ··· memory addresses /* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k = k+1) for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2 * x[i][j]; Sequential accesses instead of striding through memory every 100 words; improved spatial locality

Loop Fusion Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j]= 1/b[i][j] * c[i][j]; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) d[i][j] = a[i][j]+ c[i][j]; /* After */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} 2 misses per access to a & c vs. one miss per access; improve spatial locality

Summary of Compiler Optimizations (by hand)

Summary: Miss Rate Reduction • 3 Cs: Compulsory, Capacity, Conflict 1. Reduce Misses via Larger Block Size 2. Reduce Misses via Higher Associativity 3. Reducing Misses via Victim Cache 4. Reducing Misses via Pseudo-Associativity 5. Reducing Misses by Compiler Optimizations

Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.

1. Reduce Miss Penalty: Early Restart and Critical Word First • Don’t wait for full block to be loaded before restarting CPU • Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution • Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first • Generally useful only when cache line > bus width • Spatial locality a problem; tend to want next sequential word, so not clear if benefit by early restart block

2. Reduce Miss Penalty: Non-blocking Caches to reduce stalls on misses • Non-blocking cacheor lockup-free cacheallow data cache to continue to supply cache hits during a miss • “hit under miss” reduces the effective miss penalty by working during miss vs. ignoring CPU requests • “hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses • Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses • Pentium Pro allows 4 outstanding memory misses

3: Use a multi-level cache • L2 Equations AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 AMAT = Hit TimeL1 + Miss RateL1x (Hit TimeL2 +Miss RateL2x Miss PenaltyL2) • Definitions: • Local miss rate— misses in this cache divided by the total number of memory accessesto this cache (Miss rateL2) • Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU(Miss RateL1 x Miss RateL2) • Global Miss Rate is what matters

Reducing Misses: Which apply to L2 Cache? • Reducing Miss Rate 1. Reduce Misses via Larger Block Size 2. Reduce Conflict Misses via Higher Associativity 3. Reducing Conflict Misses via Victim Cache 4. Reducing Conflict Misses via Pseudo-Associativity 5. Reducing Capacity/Conf. Misses by Compiler Optimizations

L2 cache block size & A.M.A.T. • 32KB L1, 8 byte path to memory

Reducing Miss Penalty Summary • Three techniques • Early Restart and Critical Word First on miss • Non-blocking Caches (Hit under Miss, Miss under Miss) • Second Level Cache • Can be applied recursively to Multilevel Caches • Danger is that time to DRAM will grow with multiple levels in between

Summary: The Cache Design Space Cache Size Associativity • Several interacting dimensions • cache size • block size • associativity • replacement policy • write-through vs. write-back • The optimal choice is a compromise • depends on access characteristics • workload • use (I-cache, D-cache, TLB) • depends on technology / cost • Simplicity often wins Block Size Bad Factor A Factor B Good Less More

IBM POWER4 Memory Hierarchy 4 cycles to load to a floating point register 128-byte blocks divided into 32-byte sectors L1(Instr.) 64 KB Direct Mapped L1(Data) 32 KB 2-way, FIFO write allocate 14 cycles to load to a floating point register 128-byte blocks L2(Instr. + Data) 1440 KB, 3-way, pseudo-LRU (shared by two processors) L3(Instr. + Data) 128 MB 8-way (shared by two processors)  340 cycles 512-byte blocks divided into 128-byte sectors

Intel Itanium Processor L1(Data) 16 KB, 4-way dual-ported write through L1(Instr.) 16 KB 4-way 32-byte blocks 2 cycles 64-byte blocks write allocate 12 cycles L2 (Instr. + Data) 96 KB, 6-way 4 MB (on package, off chip) 64-byte blocks 128 bits bus at 800 MHz (12.8 GB/s) 20 cycles

3rd Generation Itanium • 1.5 GHz • 410 million transistors • 6MB 24-way set associative L3 cache • 6-level copper interconnect, 0.13 micron • 130W (i.e. lasts 17s on an AA NiCd)

Cache performance • Miss-oriented Approach to Memory Access: • CPIExecution includes ALU and Memory instructions • Separating out Memory component entirely • AMAT = Average Memory Access Time • CPIALUOps does not include memory instructions

Impact on Performance • Suppose a processor executes at • Clock Rate = 1 GHz (1 ns per cycle), Ideal (no misses) CPI = 1.1 • 50% arith/logic, 30% ld/st, 20% control • Suppose that 10% of memory operations get 100 cycle miss penalty • Suppose that 1% of instructions get same miss penalty 78% of the time the proc is stalled waiting for memory!

Example: Harvard Architecture Proc Proc I-Cache-1 Proc D-Cache-1 Unified Cache-1 Unified Cache-2 Unified Cache-2 • Unified vs. Separate I&D (Harvard) • 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47% • 32KB unified: Aggregate miss rate=1.99% • Which is better (ignore L2 cache)? • Assume 33% data ops  75% accesses from instructions (1.0/1.33) • hit time=1, miss time=50 • Note that data hit has 1 stall for unified cache (only one port) AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05 AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24

Summary: • The Principle of Locality: • Program access a relatively small portion of the address space at any instant of time. • Temporal Locality: Locality in Time • Spatial Locality: Locality in Space • Three Major Categories of Cache Misses: • Compulsory Misses: sad facts of life. Example: cold start misses. • Conflict Misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! • Capacity Misses: increase cache size • Write Policy: • Write Through: need a write buffer. Nightmare: WB saturation • Write Back: control can be complex • Cache Performance

CENG 450 Computer Systems and Architecture Lecture 15

CENG 450 Computer Systems and Architecture Lecture 15

Presentation Transcript

CENG 450 Computer Systems and Architecture Lecture 8

CENG 450 Computer Systems and Architecture Lecture 11

CENG 450 Computer Systems and Architecture Lecture 13

CENG 450 Computer Systems and Architecture Lecture 7

CENG 450 Computer Systems and Architecture Lecture 10

CENG 450 Computer Systems and Architecture Lecture 9

CENG 450 Computer Systems and Architecture Lecture 9

CENG 450 Computer Systems and Architecture Lecture 12

CENG 450 Computer Systems and Architecture Lecture 6

CENG 450 Computer Systems and Architecture Lecture 14

CENG 450 Computer Systems and Architecture Lecture 4

CENG 450 Computer Systems and Architecture Lecture 8

CENG 450 Computer Systems and Architecture Lecture 16

CENG 450 Computer Systems and Architecture Lecture 6

CENG 450 Computer Systems and Architecture Lecture 13

CENG 450 Computer Systems and Architecture Lecture 4

CENG 450 Computer Systems and Architecture Lecture 14

CENG 450 Computer Systems and Architecture Lecture 9

CENG 450 Computer Systems and Architecture Lecture 8

CENG 450 Computer Systems and Architecture Lecture 15

CENG 450 Computer Systems and Architecture Lecture 7

CENG 450 Computer Systems and Architecture Lecture 12