CSE 502 Graduate Computer Architecture Lec 15 – MidTerm Review

CSE 502 Graduate Computer Architecture Lec 15 – MidTerm Review Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from David Patterson, UC-Berkeley cs252-s06 CSE502-F09, Lec 15 - MTRevu

Review: SomeBasicUnitDefinitions Kilobyte (KB) – 210 (1,024) or 103(1,000 or “thousand”) Bytes (a 500-page book) Megabyte (MB) – 220 (1,048,576) or 106 (“million”) Bytes (1 wall of 1000 books) Gigabyte (GB) – 230 (1,073,741,824) or 109 (“billion”) Bytes (a 1000-wall library) Terabyte (TB) – 240 (1.100 x 1012) or 1012(“trillion”) Bytes (1000 big libraries) Petabyte (PB) – 250 (1.126 x 1015) or 1015 (“quadrillion”) Bytes (½hr satellite data) Exabyte – 260 (1.153 x 1018) or 1018 (“quintillion”) Bytes (40 days: 1satellite’sdata) Remember that 8bits = 1 Byte millisec (ms) – 10-3 (“a thousandth of a”) second light goes 300 kilometers icrosec(s) – 10-6 (“a millionth of a”) second light goes 300 meters nanosec (ns)– 10-9(“a billionth of a”) second light goes 30 cm, 1 foot picosec (ps) – 10-12(“a trillionth of a”) second light goes 300 m, 6 hairs femtosec (fs)– 10-15(“one quadrillionth”) secondlight goes 300 nm, 1 cell attosec– 10-18(“one quintillionth of a”) second light goes 0.3 nm, 1 atom

CSE 502 Graduate Computer Architecture Lec 1-2 - Introduction Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from David Patterson, UC-Berkeley cs252-s06 CSE502-F09, Lec 01-3 - intro

Crossroads: Uniprocessor Performance From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, October, 2006 • VAX : 25%/year 1978 to 1986 • RISC + x86: 52%/year 1986 to 2002 • RISC + x86: ??%/year 2002 to 2006 CSE502-F09, Lec 01-3 - intro

1) Taking Advantage of Parallelism • Increasing throughput of server computer via multiple processors or multiple disks • Detailed HW design • Carry lookahead adders uses parallelism to speed up computing sums from linear to logarithmic in number of bits per operand • Multiple memory banks searched in parallel in set-associative caches • Pipelining: overlap instruction execution to reduce the total time to complete an instruction sequence. • Not every instruction depends on immediate predecessor  executing instructions completely/partially in parallel possible • Classic 5-stage pipeline: 1) Instruction Fetch (Ifetch), 2) Register Read (Reg), 3) Execute (ALU), 4) Data Memory Access (Dmem), 5) Register Write (Reg) CSE502-F09, Lec 01-3 - intro

Reg Reg Reg Reg Reg Reg Reg Reg ALU Ifetch DMem ALU Ifetch Ifetch Ifetch DMem DMem DMem ALU ALU Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Pipelined Instruction Execution Is Faster CSE502-F09, Lec 01-3 - intro

Reg Reg Reg Reg Reg Reg Reg Reg Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem ALU ALU ALU ALU Limits to Pipelining • Hazards prevent next instruction from executing during its designated clock cycle • Structural hazards: attempt to use the same hardware to do two different things at once • Data hazards: Instruction depends on result of prior instruction still in the pipeline • Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps). Time (clock cycles) I n s t r. O r d e r CSE502-F09, Lec 01-3 - intro

2) The Principle of Locality => Caches ($) • The Principle of Locality: • Programs access a relatively small portion of the address space at any instant of time. • Two Different Types of Locality: • Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) • Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straight-line code, array access) • For 30 years, HW has relied on locality for memory perf. MEM P $ CSE502-F09, Lec 01-3 - intro

Levels of the Memory Hierarchy Capacity Access Time Cost Staging Xfer Unit Upper Level CPU Registers 100s Bytes 300 – 500 ps (0.3-0.5 ns) Registers prog./compiler 1-8 bytes Instr. Operands Faster L1 Cache L1 and L2 Cache 10s-100s K Bytes ~1 ns - ~10 ns $1000s/ GByte cache cntlr 32-64 bytes Blocks L2 Cache cache cntlr 64-128 bytes Blocks Main Memory G Bytes 80ns- 200ns ~ $100/ GByte Memory OS 4K-8K bytes Pages Disk 10s T Bytes, 10 ms (10,000,000 ns) ~ $0.25 / GByte Disk user/operator Mbytes Files Larger Tape Vault Semi-infinite sec-min ~$1 / GByte Tape Lower Level CSE502-F09, Lec 01-3 - intro

3) Focus on the Common Case“Make Frequent Case Fast and Rest Right” • Common sense guides computer design • Since its engineering, common sense is valuable • In making a design trade-off, favor the frequent case over the infrequent case • E.g., Instruction fetch and decode unit used more frequently than multiplier, so optimize it first • E.g., If database server has 50 disks / processor, storage dependability dominates system dependability, so optimize it 1st • Frequent case is often simpler and can be done faster than the infrequent case • E.g., overflow is rare when adding 2 numbers, so improve performance by optimizing more common case of no overflow • May slow down overflow, but overall performance improved by optimizing for the normal case • What is frequent case and how much performance improved by making case faster => Amdahl’s Law CSE502-F09, Lec 01-3 - intro

Example: An I/O bound server gets a new CPU that is 10X faster, but 60% of server time is spent waiting for I/O. 4) Amdahl’s Law - Partial Enhancement Limits Best to ever achieve: A 10X faster CPU allures, but the server is only 1.6X faster. CSE502-F09, Lec 01-3 - intro

CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle CPI 5) Processor performance equation Inst count CPU time = Inst Count x CPI x Clock Cycle Program X Compiler X (X) Inst. Set. X X Organization X X Technology X Cycle time CSE502-F09, Lec 01-3 - intro

What Determines a Clock Cycle? • At transition edge(s) of each clock pulse, state devices sample and save their present input signals • Past: 1 cycle = time for signals to pass 10 levels of gates • Today: determined by numerous time-of-flight issues + gate delays • clock propagation, wire lengths, drivers Latch or register combinational logic CSE502-F09, Lec 01-3 - intro

Performance Milestones Processor: ‘286, ‘386, ‘486, Pentium, Pentium Pro, Pentium 4 (21x,2250x) Ethernet: 10Mb, 100Mb, 1000Mb, 10000 Mb/s (16x,1000x) Memory Module: 16bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x) Disk : 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x) CPU high, Memory low(“Memory Wall”) Δ.Latency Lags Δ.Bandwidth (for last 20 yrs) (Latency = simple operation w/o contention, BW = best-case) CSE502-F09, Lec 01-3 - intro

Summary of Technology Trends • For disk, LAN, memory, and microprocessor, bandwidth improves by more than the square of latency improvement • In the time that bandwidth doubles, latency improves by no more than 1.2X to 1.4X • Lag of gains for latency vs bandwidth probably even larger in real systems, as bandwidth gains multiplied by replicated components • Multiple processors in a cluster or even on a chip • Multiple disks in a disk array • Multiple memory modules in a large memory • Simultaneous communication in switched local area networks (LANs) • HW and SW developers should innovate assuming Latency Lags Bandwidth • If everything improves at the same rate, then nothing really changes • When rates vary, good designs require real innovation CSE502-F09, Lec 01-3 - intro

Define and quantify power ( 1 / 2) • For CMOS chips, traditional dominant energy use has been in switching transistors, called dynamic power • For mobile devices, energy is a better metric • For a fixed task, slowing clock rate (the switching frequency) reduces power, but not energy • Capacitive load is function of number of transistors connected to output and the technology, which determines the capacitance of wires and transistors • Dropping voltage helps both, so ICs went from 5V to 1V • To save energy & dynamic power, most CPUs now turn off clock of inactive modules (e.g. Fltg. Pt. Arith. Unit) • If a 15% voltage reduction causes a 15% reduction in frequency, what is the impact on dynamic power? • New power/old = 0.852 x 0.85 = 0.853 = 0.614 “39% reduction” • Because leakage current flows even when a transistor is off, now static power important too CSE502-F09, Lec 01-3 - intro

performance(x) = 1 execution_time(x) N = Performance(X) = Execution_time(Y) Performance(Y) Execution_time(X) Define and quantity dependability (2/3) • Module reliability = measure of continuous service accomplishment (or time to failure). • Mean Time To Failure (MTTF) measures Reliability • Failures In Time (FIT) = 1/MTTF, the failure rate • Usually reported as failures per billion hours of operation Definition: Performance • Performance is in units of things-done per second • bigger is better • If we are primarily concerned with response time • " X is N times faster than Y" means The Speedup = N “mushroom”: The BIG Time the little time CSE502-F09, Lec 01-3 - intro

And in conclusion … • Computer Science at the crossroads from sequential to parallel computing • Salvation requires innovation in many fields, including computer architecture • An architect must track & extrapolate technology • Bandwidth in disks, DRAM, networks, and processors improves by at least as much as the square of the improvement in Latency • Quantify dynamic and static power • Capacitance x Voltage2 x frequency, Energy vs. power • Quantify dependability • Reliability (MTTF, FIT), Availability (99.9…) • Quantify and summarize performance • Ratios, Geometric Mean, Multiplicative Standard Deviation • Read Chapter 1, then Appendix A CSE502-F09, Lec 01-3 - intro

CSE 502 Graduate Computer Architecture Lec 3-5 – Performance + Instruction Pipelining Review Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from David Patterson, UC-Berkeley cs252-s06 CSE502-F09, Lec 03+4+5-perf & pipes

A "Typical" RISC ISA • 32-bit fixed format instruction (3 formats) • 32 32-bit GPR (R0 contains zero, DP take pair) • 3-address, reg-reg arithmetic instruction • Single address mode for load/store: base + displacement • no indirection (since it needs another memory access) • Simple branch conditions (e.g., single-bit: 0 or not?) • (Delayed branch - ineffective in deep pipelines) see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3 CSE502-F09, Lec 03+4+5-perf & pipes

Example: MIPS Register-Register – R Format – Arithmetic operations 6 5 11 10 31 26 25 21 20 16 15 0 Op Rs1 Rs2 Rd Opx Register-Immediate – I Format – All immediate arithmetic ops 31 26 25 21 20 16 15 0 immediate Op Rs1 Rd Branch – I Format – Moderate relative distance conditional branches 31 26 25 21 20 16 15 0 immediate Op Rs1 Rs2/Opx Jump / Call – J Format – Long distance jumps 31 26 25 0 target Op CSE502-F09, Lec 03+4+5-perf & pipes

MEM/WB ID/EX EX/MEM IF/ID Adder 4 Address ALU 5-Stage MIPS Datapath(has pipeline latches) Figure A.3, Page A-9 Instruction Fetch Execute Addr. Calc Memory Access Instr. Decode Reg. Fetch Write Back Next PC MUX Next SEQ PC Next SEQ PC Zero? RS1 Reg File MUX Memory RS2 Data Memory MUX MUX Sign Extend WB Data Imm RD RD RD • Data stationary control • local decode for each instruction phase / pipeline stage CSE502-F09, Lec 03+4+5-perf & pipes

Code SpeedUp Equation for Pipelining For simple RISC pipeline, Ideal CPI = 1: CSE502-F09, Lec 03+4+5-perf & pipes

Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg ALU ALU ALU ALU ALU Ifetch Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem DMem EX WB MEM IF ID/RF I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Data Hazard on Register R1 (If No Forwarding)Figure A.6, Page A-17 Time (clock cycles) No forwarding needed since write reg in 1st half cycle, read reg in 2nd half cycle. CSE502-F09, Lec 03+4+5-perf & pipes

Three Generic Data Hazards • Read After Write (RAW)InstrJ tries to read operand before InstrI writes it • Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communicating a new data value. I: add r1,r2,r3 J: sub r4,r1,r3 CSE502-F09, Lec 03+4+5-perf & pipes

I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Three Generic Data Hazards • Write After Read (WAR)InstrJ writes operand before InstrI reads it • Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”. • Cannot happen in MIPS 5 stage pipeline because: • All instructions take 5 stages, and • Register reads are always in stage 2, and • Register writes are always in stage 5 CSE502-F09, Lec 03+4+5-perf & pipes

I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Three Generic Data Hazards • Write After Write (WAW)InstrJ writes operand before InstrI writes it. • Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”. • Cannot happen in MIPS 5 stage pipeline because: • All instructions take 5 stages, and • Register writes are always in stage 5 • Will see WAR and WAW in more complicated pipes CSE502-F09, Lec 03+4+5-perf & pipes

Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg ALU ALU ALU ALU ALU Ifetch Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem DMem I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Forwarding to Avoid Data HazardFigure A.7, Page A-19 Forwarding of ALU outputs needed as ALU inputs 1 & 2 cycles later. Forwarding of LW MEM outputs to SW MEM or ALU inputs 1 or 2 cycles later. Time (clock cycles) Need no forwarding since write reg is in 1st half cycle, read reg in 2nd half cycle. CSE502-F09, Lec 03+4+5-perf & pipes

ALU HW Datapath Changes (in red) for ForwardingFigure A.23, Page A-37 To forward ALU, MEM 2 cycles to ALU To forward ALU output 1 cycle to ALU inputs ID/EX EX/MEM MEM/WR NextPC mux Registers (From LW Data Memory) Data Memory mux mux mux Immediate (From ALU) To forward MEM 1 cycle to SW MEM input What circuit detects and resolves this hazard? CSE502-F09, Lec 03+4+5-perf & pipes

Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg ALU ALU ALU ALU ALU Ifetch Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem DMem I n s t r. O r d e r add r1,r2,r3 lw r4, 0(r1) sw r4,12(r1) or r8,r6,r9 xor r10,r9,r11 Forwarding Avoids ALU-ALU & LW-SW Data HazardsFigure A.8, Page A-20 Time (clock cycles) CSE502-F09, Lec 03+4+5-perf & pipes

Reg Reg Reg Reg Reg Reg Reg Reg ALU ALU ALU ALU Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem lwr1, 0(r2) I n s t r. O r d e r sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9 LW-ALU Data Hazard Even with ForwardingFigure A.9, Page A-21 Time (clock cycles) No forwarding needed since write reg in 1st half cycle, read reg in 2nd half cycle. CSE502-F09, Lec 03+4+5-perf & pipes

Reg Reg Reg ALU Ifetch Ifetch Ifetch Ifetch DMem Bubble ALU ALU Reg Reg DMem DMem Bubble Reg Reg Data Hazard Even with Forwarding(Similar to Figure A.10, Page A-21) Time (clock cycles) No forwarding needed since write reg in 1st half cycle, read reg in 2nd half cycle. I n s t r. O r d e r lwr1, 0(r2) sub r4,r1,r6 and r6,r1,r7 Bubble ALU DMem or r8,r1,r9 How is this hazard detected? CSE502-F09, Lec 03+4+5-perf & pipes

Software Scheduling to Avoid Load Hazards Fast code (no stalls): LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd Try producing fast code with no stalls for a = b + c; d = e – f; assuming a, b, c, d ,e, and f are in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd Stall ===> Stall ===> Compiler optimizes for performance. Hardware checks for safety. CSE502-F09, Lec 03+4+5-perf & pipes

MEM/WB ID/EX EX/MEM IF/ID Adder 4 Address ALU 5-Stage MIPS Datapath(has pipeline latches) Figure A.3, Page A-9 Instruction Fetch Execute Addr. Calc Memory Access Instr. Decode Reg. Fetch Write Back Next PC MUX Next SEQ PC Next SEQ PC Zero? RS1 Reg File MUX Memory RS2 Data Memory MUX MUX Sign Extend WB Data Imm RD RD RD • Simple design put branch completion in stage 4 (Mem) CSE502-F09, Lec 03+4+5-perf & pipes

Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg ALU ALU ALU ALU ALU Ifetch Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem DMem 10: beq r1,r3,36 14: and r2,r3,r5 18: or r6,r1,r7 22: add r8,r1,r9 36: xor r10,r1,r11 Control Hazard on Branch - Three Cycle Stall MEM ID/RF What do you do with the 3 instructions in between? How do you do it? Where is the “commit”? CSE502-F09, Lec 03+4+5-perf & pipes

Branch Stall Impact if Commit in Stage 4 • If CPI = 1 and 15% of instructions are branches, Stall 3 cycles => new CPI = 1.45! • Two-part solution: • Determine sooner whether branch taken or not, AND • Compute taken branch address earlier • MIPS branch tests if register = 0 or  0 • MIPS Solution: • Move zero_test to ID/RF (Instr Decode & Register Fetch) stage (2, 4=MEM) • Add extra adder to calculate new PC (Program Counter) in ID/RF stage • Result is 1 clock cycle penalty for branch versus 3 when decided in MEM CSE502-F09, Lec 03+4+5-perf & pipes

MEM/WB ID/EX EX/MEM IF/ID Adder 4 Address ALU Pipelined MIPS DatapathFigure A.24, page A-38 Instruction Fetch Execute Addr. Calc Memory Access Instr. Decode Reg. Fetch Write Back Next SEQ PC Next PC MUX Adder Zero? RS1 Reg File Memory RS2 Data Memory MUX MUX Sign Extend The fast_branch design needs a longer stage 2 cycle time, so the clock is slower for all stages. WB Data Imm RD RD RD • Interplay of instruction set design and cycle time. CSE502-F09, Lec 03+4+5-perf & pipes

Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken • Execute the next instructions in sequence • PC+4 already calculated, so use it to get next instruction • Nullify bad instructions in pipeline if branch is actually taken • Nullify easier since pipeline state updates are late (MEM, WB) • 47% MIPS branches not taken on average #3: Predict Branch Taken • 53% MIPS branches taken on average • But have not calculated branch target address in MIPS • MIPS still incurs 1 cycle branch penalty • Other machines: branch target known before outcome CSE502-F09, Lec 03+4+5-perf & pipes

Four Branch Hazard Alternatives #4: Delayed Branch • Define branch to take place AFTER a following instruction branch instruction sequential successor1 sequential successor2 ........ sequential successorn branch target if taken • 1 slot delay allows proper decision and branch target address in 5 stage pipeline • MIPS 1st used this (Later versions of MIPS did not; pipeline deeper) Branch delay of length n CSE502-F09, Lec 03+4+5-perf & pipes

And In Conclusion: Control and Pipelining • Quantify and summarize performance • Ratios, Geometric Mean, Multiplicative Standard Deviation • F&P: Benchmarks age, disks fail, single-point failure • Control via State Machines and Microprogramming • Just overlap tasks; easy if tasks are independent • Speed Up  Pipeline Depth; if ideal CPI is 1, then: • Hazards limit performance on computers: • Structural: need more HW resources • Data (RAW,WAR,WAW): need forwarding, compiler scheduling • Control: delayed branch or branch (taken/not-taken) prediction • Exceptions and interrupts add complexity • Next time: Read Appendix C • No class Tuesday 9/29/09, when Monday classes will run. CSE502-F09, Lec 03+4+5-perf & pipes

CSE 502 Graduate Computer Architecture Lec 6-7 – Memory Hierarchy Review Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from David Patterson, UC-Berkeley cs252-s06

Gap grew 50% per year Since 1980, CPU has outpaced DRAM ... Q. How do architects address this gap? A. Put smaller, faster “cache” memories between CPU and DRAM. Create a “memory hierarchy”. Performance (1/latency) CPU 60% per yr 2X in 1.5 yrs 1000 CPU 100 DRAM 9% per yr 2X in 10 yrs 10 DRAM 1980 1990 2000 Year CSE502-F09, Lec 06+7-cache VM TLB

Apple || (1977) Latencies CPU: 1000 ns DRAM: 400 ns Steve Wozniak Steve Jobs 1977: DRAM faster than microprocessors CSE502-F09, Lec 06+7-cache VM TLB

Managed by compiler Managed by OS, hardware, application Managed by hardware iMac G5 1.6 GHz 1600 (mem: 7.3) x Apple II Memory Hierarchy: Apple iMac G5 Goal: Illusion of large, fast, cheap memory Let programs address a memory space that scales to the disk size, at a speed that is usually nearly as fast as register access CSE502-F09, Lec 06+7-cache VM TLB

L1 (64K Instruction) 1/2 KB Registers 512K L2 1/2 KB L1 (32K Data) iMac’s PowerPC 970 (G5): All caches on-chip CSE502-F09, Lec 06+7-cache VM TLB

The Principle of Locality • The Principle of Locality: • Program access a relatively small portion of the address space at any instant of time. • Two Different Types of Locality: • Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) • Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) • For last 15 years, HW has relied on locality for speed Locality is a property of programs which is exploited in machine design. CSE502-F09, Lec 06+7-cache VM TLB

Bad locality behavior Temporal Locality Spatial Locality Programs with locality cache well ... Memory Address (one dot per access) Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971) Time=> CSE502-F09, Lec 06+7-cache VM TLB

Lower Level Memory Upper Level Memory To Processor Blk X From Processor Blk Y Memory Hierarchy: Terminology • Hit: data appears in some block in the upper level (example: Block X) • Hit Rate: the fraction of memory accesses found in the upper level • Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss • Miss: data needs to be retrieved from a block in the lower level (Block Y) • Miss Rate = 1 - (Hit Rate) • Miss Penalty: Time to replace a block in the upper level + Time to deliver the block to the upper level • Hit Time << Miss Penalty(=500 instructions on 21264!) CSE502-F09, Lec 06+7-cache VM TLB

Cache Measures • Hit rate: fraction found in that level • So high that usually talk about Miss rate • Miss rate fallacy: as MIPS to CPU performance, miss rate to average memory access time in memory • Average memory-access time = Hit time + Miss rate x Miss penalty (ns or clocks) • Miss penalty: time to replace a block from lower level, including time to replace in CPU • {replacement time: time to make upper-level room for block} • access time: time to lower level = f(latency to lower level) • transfer time: time to transfer block =f(BW between upper & lower levels) CSE502-F09, Lec 06+7-cache VM TLB

4 Questions for Memory Hierarchy • Q1: Where can a block be placed in the upper level? (Block placement) • Q2: How is a block found if it is in the upper level? (Block identification) • Q3: Which block should be replaced on a miss? (Block replacement) • Q4: What happens on a write? (Write strategy) CSE502-F09, Lec 06+7-cache VM TLB

CSE 502 Graduate Computer Architecture Lec 15 – MidTerm Review