Eeng 449bg cpsc 439bg computer systems lecture 17 memory hierarchy design part i
1 / 33

EENG 449bG/CPSC 439bG Computer Systems Lecture 17 Memory Hierarchy Design Part I - PowerPoint PPT Presentation

  • Uploaded on

EENG 449bG/CPSC 439bG Computer Systems Lecture 17 Memory Hierarchy Design Part I. April 7, 2005 Prof. Andreas Savvides Spring 2005 µProc 60%/yr. 1000. CPU. “Moore’s Law”. 100. Processor-Memory Performance Gap: (grows 50% / year).

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' EENG 449bG/CPSC 439bG Computer Systems Lecture 17 Memory Hierarchy Design Part I' - alden

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Eeng 449bg cpsc 439bg computer systems lecture 17 memory hierarchy design part i

EENG 449bG/CPSC 439bG Computer SystemsLecture 17Memory Hierarchy Design Part I

April 7, 2005

Prof. Andreas Savvides

Spring 2005

Who cares about the memory hierarchy





“Moore’s Law”



Performance Gap:(grows 50% / year)



“Less’ Law?”


























Who Cares About the Memory Hierarchy?


  • 1980: no cache in µproc; 1995 2-level cache on chip(1989 first Intel µproc with a cache on chip)

Review of caches
Review of Caches

  • Cache is the name given to the first level of the memory hierarchy encountered one the address leaves the CPU

    • Cache hit / cache miss when data is found / not found

    • Block – a fixed size collection of data containing the requested word

    • Spatial / temporal localities

    • Latency – time to retrieve the first word in the block

    • Bandwidth – time to retrieve the rest of the block

    • Address space is broken into fixed-size blocks called pages

    • Page fault – when CPU references something that is not on cache or main memory

Generations of microprocessors
Generations of Microprocessors

  • Time of a full cache miss in instructions executed:

    1st Alpha: 340 ns/5.0 ns =  68 clks x 2 or 136

    2nd Alpha: 266 ns/3.3 ns =  80 clks x 4 or 320

    3rd Alpha: 180 ns/1.7 ns =108 clks x 6 or 648

  • 1/2X latency x 3X clock rate x 3X Instr/clock  ­5X

Processor memory performance gap tax
Processor-Memory Performance Gap “Tax”

Processor % Area %Transistors

(­cost) (­power)

  • Alpha 21164 37% 77%

  • StrongArm SA110 61% 94%

  • Pentium Pro 64% 88%

    • 2 dies per package: Proc/I$/D$ + L2$

  • Caches have no “inherent value”, only try to close performance gap

What is a cache
What is a cache?

  • Small, fast storage used to improve average access time to slow memory.

  • Exploits spacial and temporal locality

  • In computer architecture, almost everything is a cache!

    • Registers “a cache” on variables – software managed

    • First-level cache a cache on second-level cache

    • Second-level cache a cache on memory

    • Memory a cache on disk (virtual memory)

    • TLB a cache on page table

    • Branch-prediction a cache on prediction information?







Disk, Tape, etc.

Review cache performance

  • Separating out Memory component entirely

    • AMAT = Average Memory Access Time

    • CPIALUOps does not include memory instructions

Review: Cache performance

Impact on performance
Impact on Performance

  • Suppose a processor executes at

    • Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1

    • 50% arith/logic, 30% ld/st, 20% control

  • Suppose that 10% of memory operations get 50 cycle miss penalty

  • Suppose that 1% of instructions get same miss penalty

  • CPI = ideal CPI + average stalls per instruction =1.1(cycles/ins) + [ 0.30 (DataMops/ins) x 0.10 (miss/DataMop) x 50 (cycle/miss)] + [ 1 (InstMop/ins) x 0.01 (miss/InstMop) x 50 (cycle/miss)] = (1.1 + 1.5 + .5) cycle/ins = 3.1

  • 58% of the time the proc is stalled waiting for memory!

Traditional four questions for memory hierarchy designers
Traditional Four Questions for Memory Hierarchy Designers

  • Q1: Where can a block be placed in the upper level? (Block placement)

    • Fully Associative, Set Associative, Direct Mapped

  • Q2: How is a block found if it is in the upper level? (Block identification)

    • Tag/Block

  • Q3: Which block should be replaced on a miss? (Block replacement)

    • Random, LRU

  • Q4: What happens on a write? (Write strategy)

    • Write Back or Write Through (with Write Buffer)

Set associatively
Set Associatively

  • Direct mapped = one-way set associative

    • (block address) MOD (Number of blocks in cache)

  • Fully associative = set associative with 1 set

    • Block can be placed anywhere in the cache

  • Set associative – a block can be placed in a restricted set of places

    • Block first mapped to a set and then placed anywhere in the set

      • (Block address) MOD (Number of sets in cache)

    • If there are n blocks in a set, then the cache is n-way set associative

  • Most popular cache configurations in today’s processors

    • Direct mapped, 2-way set associative, 4-way set associative

Q2 how is a block found if it is in the cache
Q2: How is a block found if it is in the cache?

Selects the set

Selects the desired data from the block

Compared against for a hit

  • If cache size remains the same increasing associativity increases

  • The number of blocks per set => decrease index size and increase tag

Q3 which block should be replaced on a cache miss
Q3: Which Block Should be Replaced on a Cache Miss?

  • Directly mapped cache

    • No choice – a single block is checked for a hit. If there is a miss, data is fetched into that block

  • Fully associative and set associative

    • Random

    • Least Recently Used(LRU) – locality principles

    • First in, First Out(FIFO) - Approximates LRU

Q4 what happens on a write
Q4: What Happens on a Write?

  • Cache accesses dominated by reads

  • E.g on MIPS 10% stores and 37% loads = 21% of cache traffic is writes

  • Writes are much slower than reads

    • Block is read from cache at the same time a block is read and compared

      • If the write is a hit the block is passed to the CPU

    • Writing cannot begin unless the address is a hit

  • Write through – information is written to both the cache and the lower-level memory

  • Write back – information only written to cache. Written to memory only on block replacement

    • Dirty bit – used to indicate whether a block has been changed

      While in cache

Write through vs write back
Write Through vs. Write Back

  • Write back – uses cache speed, all entries updated once during the writing of a block

  • Write through – slower, BUT cache is always clean

    • Cache read misses never result in writes at the lower level

    • Next lower level of the cache has the most current copy of the data

Example alpha 21264 data cache
Example: Alpha 21264 Data Cache

  • 2-way set associative

  • Write back

  • Each block has 64 bytes of data

    • Offset points to the data we want

    • Total cache size 65,536 bytes

  • Index 29=512 points to the block

  • Tag comparison determines if we have a hit

  • Victim buffer to helps with write back

Address breakdown
Address Breakdown

  • Physical address is 44 bits wide, 38-bit block address and 6-bit offset (2^6=64)

  • Calculating cache index size field

  • Blocks are 64 bytes so offset needs 6 bits

  • Tag size = 38 – 9 = 29 bits

Writing to cache
Writing to cache

  • If word to be written is in the cache, first 3 steps are the same as read

  • The 21264 processor uses write back so the block cannot simply discarded on a miss

  • If the “victim” was modified its data and address are sent to the victim buffer.

  • Data cache cannot apply all processor needs

    • Separate Instruction and Data Caches may be needed

Unified vs split caches












Unified vs Split Caches

  • Unified vs Separate Instruction and Data caches

  • Example:

    • 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47%

    • 32KB unified: Aggregate miss rate=1.99%

    • Using miss rate in the evaluation may be misleading!

  • Which is better (ignore L2 cache)?

    • Assume 25% data ops  75% accesses from instructions (1.0/1.33)

    • hit time=1, miss time=50

    • Note that data hit has 1 stall for unified cache (only one port)

      AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05

      AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24

Impact of caches on performance
Impact of Caches on Performance

  • Consider a in-order execution computer

    • Cache miss penalty 100 clock cycles, CPI=1

    • Average miss rate 2% and an average of 1.5 memory references per instruction

    • Average # of cache misses: 30 per 1000 instructions

      Performance with cache misses

Impact of caches on performance1
Impact of Caches on Performance

  • Calculate the performance using miss rate

  • 4x increase in CPU time from “perfect cache”

  • No cache – 1.0 + 100 x 1.5 = 151 – a factor of 40 compared to a system

  • with cache

  • Minimizing memory access does not always imply reduction in CPU time

How to improve cache performance
How to Improve Cache Performance?

Four main categories of optimizations

  • Reducing miss penalty

    - multilevel caches, critical word first, read miss before write miss, merging write buffers and victim caches

  • Reducing miss rate

    - larger block size, larger cache size, higher associativity, way prediction and pseudoassociativity and computer optimizations

    2. Reduce the miss penalty or miss rate via parallelism

    - non-blocking caches, hardware prefetching and compiler prefetching

    3. Reduce the time to hit in the cache

    - small and simple caches, avoiding address translation, pipelined cache access

Where to misses come from
Where to misses come from?

  • Classifying Misses: 3 Cs

    • Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses.(Misses in even an Infinite Cache)

    • Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved.(Misses in Fully Associative Size X Cache)

    • Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses.(Misses in N-way Associative, Size X Cache)

Cache organization
Cache Organization?

  • Assume total cache size not changed:

  • What happens if:

  • Change Block Size:

  • Change Associativity:

    3) Change Compiler:

    Which of 3Cs is obviously affected?

Increasing block size
Increasing Block Size

  • Larger block sizes reduce compulsory misses

    • Takes advantage of spatial locality

  • Larger blocks increase miss penalty

    • Our goal: reduce miss rate and miss penalty!

  • Block size selection depends on latency and bandwidth of lower level memory

    • High latency & high BW => large block sizes

    • Low latency & low BW => smaller block sizes

Larger block size fixed size assoc







Larger Block Size (fixed size&assoc)

What else drives up block size?

Cache size
Cache Size

  • Old rule of thumb: 2x size => 25% cut in miss rate

  • What does it reduce?

Increase associativity
Increase Associativity

  • Higher set-associativity improves miss rates

  • Rule of thumb for caches

    • A direct mapped cache of size N has about the same miss rate as a two-way, set-associative cache of N/2

    • Holds for < 128Kb

  • Disadvantages

    • Reduces miss rate but increases miss penalty

    • Greater associativity can come at the cost of inreased hit time.



3cs relative miss rate
3Cs Relative Miss Rate


Flaws: for fixed block size

Good: insight => invention

Associativity vs cycle time
Associativity vs Cycle Time

  • Beware: Execution time is only final measure!

  • Why is cycle time tied to hit time?

  • Will Clock Cycle time increase?

Next time
Next Time

  • Cache Tradeoffs for Performance

  • Reducing Hit Times