eeng 449bg cpsc 439bg computer systems lecture 17 memory hierarchy design part i
Download
Skip this Video
Download Presentation
EENG 449bG/CPSC 439bG Computer Systems Lecture 17 Memory Hierarchy Design Part I

Loading in 2 Seconds...

play fullscreen
1 / 33

EENG 449bG/CPSC 439bG Computer Systems Lecture 17 Memory Hierarchy Design Part I - PowerPoint PPT Presentation


  • 95 Views
  • Uploaded on

EENG 449bG/CPSC 439bG Computer Systems Lecture 17 Memory Hierarchy Design Part I. April 7, 2005 Prof. Andreas Savvides Spring 2005 http://www.eng.yale.edu/courses/2005s/eeng449b. µProc 60%/yr. 1000. CPU. “Moore’s Law”. 100. Processor-Memory Performance Gap: (grows 50% / year).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' EENG 449bG/CPSC 439bG Computer Systems Lecture 17 Memory Hierarchy Design Part I' - alden


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
eeng 449bg cpsc 439bg computer systems lecture 17 memory hierarchy design part i

EENG 449bG/CPSC 439bG Computer SystemsLecture 17Memory Hierarchy Design Part I

April 7, 2005

Prof. Andreas Savvides

Spring 2005

http://www.eng.yale.edu/courses/2005s/eeng449b

who cares about the memory hierarchy

µProc

60%/yr.

1000

CPU

“Moore’s Law”

100

Processor-Memory

Performance Gap:(grows 50% / year)

Performance

10

“Less’ Law?”

DRAM

7%/yr.

DRAM

1

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

Who Cares About the Memory Hierarchy?

CPU-DRAM Gap

  • 1980: no cache in µproc; 1995 2-level cache on chip(1989 first Intel µproc with a cache on chip)
review of caches
Review of Caches
  • Cache is the name given to the first level of the memory hierarchy encountered one the address leaves the CPU
    • Cache hit / cache miss when data is found / not found
    • Block – a fixed size collection of data containing the requested word
    • Spatial / temporal localities
    • Latency – time to retrieve the first word in the block
    • Bandwidth – time to retrieve the rest of the block
    • Address space is broken into fixed-size blocks called pages
    • Page fault – when CPU references something that is not on cache or main memory
generations of microprocessors
Generations of Microprocessors
  • Time of a full cache miss in instructions executed:

1st Alpha: 340 ns/5.0 ns =  68 clks x 2 or 136

2nd Alpha: 266 ns/3.3 ns =  80 clks x 4 or 320

3rd Alpha: 180 ns/1.7 ns =108 clks x 6 or 648

  • 1/2X latency x 3X clock rate x 3X Instr/clock  ­5X
processor memory performance gap tax
Processor-Memory Performance Gap “Tax”

Processor % Area %Transistors

(­cost) (­power)

  • Alpha 21164 37% 77%
  • StrongArm SA110 61% 94%
  • Pentium Pro 64% 88%
    • 2 dies per package: Proc/I$/D$ + L2$
  • Caches have no “inherent value”, only try to close performance gap
what is a cache
What is a cache?
  • Small, fast storage used to improve average access time to slow memory.
  • Exploits spacial and temporal locality
  • In computer architecture, almost everything is a cache!
    • Registers “a cache” on variables – software managed
    • First-level cache a cache on second-level cache
    • Second-level cache a cache on memory
    • Memory a cache on disk (virtual memory)
    • TLB a cache on page table
    • Branch-prediction a cache on prediction information?

Proc/Regs

L1-Cache

Bigger

Faster

L2-Cache

Memory

Disk, Tape, etc.

review cache performance

Miss-oriented Approach to Memory Access:

    • CPIExecution includes ALU and Memory instructions
  • Separating out Memory component entirely
    • AMAT = Average Memory Access Time
    • CPIALUOps does not include memory instructions
Review: Cache performance
impact on performance
Impact on Performance
  • Suppose a processor executes at
    • Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1
    • 50% arith/logic, 30% ld/st, 20% control
  • Suppose that 10% of memory operations get 50 cycle miss penalty
  • Suppose that 1% of instructions get same miss penalty
  • CPI = ideal CPI + average stalls per instruction =1.1(cycles/ins) + [ 0.30 (DataMops/ins) x 0.10 (miss/DataMop) x 50 (cycle/miss)] + [ 1 (InstMop/ins) x 0.01 (miss/InstMop) x 50 (cycle/miss)] = (1.1 + 1.5 + .5) cycle/ins = 3.1
  • 58% of the time the proc is stalled waiting for memory!
traditional four questions for memory hierarchy designers
Traditional Four Questions for Memory Hierarchy Designers
  • Q1: Where can a block be placed in the upper level? (Block placement)
    • Fully Associative, Set Associative, Direct Mapped
  • Q2: How is a block found if it is in the upper level? (Block identification)
    • Tag/Block
  • Q3: Which block should be replaced on a miss? (Block replacement)
    • Random, LRU
  • Q4: What happens on a write? (Write strategy)
    • Write Back or Write Through (with Write Buffer)
set associatively
Set Associatively
  • Direct mapped = one-way set associative
    • (block address) MOD (Number of blocks in cache)
  • Fully associative = set associative with 1 set
    • Block can be placed anywhere in the cache
  • Set associative – a block can be placed in a restricted set of places
    • Block first mapped to a set and then placed anywhere in the set
      • (Block address) MOD (Number of sets in cache)
    • If there are n blocks in a set, then the cache is n-way set associative
  • Most popular cache configurations in today’s processors
    • Direct mapped, 2-way set associative, 4-way set associative
q2 how is a block found if it is in the cache
Q2: How is a block found if it is in the cache?

Selects the set

Selects the desired data from the block

Compared against for a hit

  • If cache size remains the same increasing associativity increases
  • The number of blocks per set => decrease index size and increase tag
q3 which block should be replaced on a cache miss
Q3: Which Block Should be Replaced on a Cache Miss?
  • Directly mapped cache
    • No choice – a single block is checked for a hit. If there is a miss, data is fetched into that block
  • Fully associative and set associative
    • Random
    • Least Recently Used(LRU) – locality principles
    • First in, First Out(FIFO) - Approximates LRU
q4 what happens on a write
Q4: What Happens on a Write?
  • Cache accesses dominated by reads
  • E.g on MIPS 10% stores and 37% loads = 21% of cache traffic is writes
  • Writes are much slower than reads
    • Block is read from cache at the same time a block is read and compared
      • If the write is a hit the block is passed to the CPU
    • Writing cannot begin unless the address is a hit
  • Write through – information is written to both the cache and the lower-level memory
  • Write back – information only written to cache. Written to memory only on block replacement
    • Dirty bit – used to indicate whether a block has been changed

While in cache

write through vs write back
Write Through vs. Write Back
  • Write back – uses cache speed, all entries updated once during the writing of a block
  • Write through – slower, BUT cache is always clean
    • Cache read misses never result in writes at the lower level
    • Next lower level of the cache has the most current copy of the data
example alpha 21264 data cache
Example: Alpha 21264 Data Cache
  • 2-way set associative
  • Write back
  • Each block has 64 bytes of data
    • Offset points to the data we want
    • Total cache size 65,536 bytes
  • Index 29=512 points to the block
  • Tag comparison determines if we have a hit
  • Victim buffer to helps with write back
address breakdown
Address Breakdown
  • Physical address is 44 bits wide, 38-bit block address and 6-bit offset (2^6=64)
  • Calculating cache index size field
  • Blocks are 64 bytes so offset needs 6 bits
  • Tag size = 38 – 9 = 29 bits
writing to cache
Writing to cache
  • If word to be written is in the cache, first 3 steps are the same as read
  • The 21264 processor uses write back so the block cannot simply discarded on a miss
  • If the “victim” was modified its data and address are sent to the victim buffer.
  • Data cache cannot apply all processor needs
    • Separate Instruction and Data Caches may be needed
unified vs split caches

Proc

Proc

I-Cache-1

Proc

D-Cache-1

Unified

Cache-1

Unified

Cache-2

Unified

Cache-2

Unified vs Split Caches
  • Unified vs Separate Instruction and Data caches
  • Example:
    • 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47%
    • 32KB unified: Aggregate miss rate=1.99%
    • Using miss rate in the evaluation may be misleading!
  • Which is better (ignore L2 cache)?
    • Assume 25% data ops  75% accesses from instructions (1.0/1.33)
    • hit time=1, miss time=50
    • Note that data hit has 1 stall for unified cache (only one port)

AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05

AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24

impact of caches on performance
Impact of Caches on Performance
  • Consider a in-order execution computer
    • Cache miss penalty 100 clock cycles, CPI=1
    • Average miss rate 2% and an average of 1.5 memory references per instruction
    • Average # of cache misses: 30 per 1000 instructions

Performance with cache misses

impact of caches on performance1
Impact of Caches on Performance
  • Calculate the performance using miss rate
  • 4x increase in CPU time from “perfect cache”
  • No cache – 1.0 + 100 x 1.5 = 151 – a factor of 40 compared to a system
  • with cache
  • Minimizing memory access does not always imply reduction in CPU time
how to improve cache performance
How to Improve Cache Performance?

Four main categories of optimizations

  • Reducing miss penalty

- multilevel caches, critical word first, read miss before write miss, merging write buffers and victim caches

  • Reducing miss rate

- larger block size, larger cache size, higher associativity, way prediction and pseudoassociativity and computer optimizations

2. Reduce the miss penalty or miss rate via parallelism

- non-blocking caches, hardware prefetching and compiler prefetching

3. Reduce the time to hit in the cache

- small and simple caches, avoiding address translation, pipelined cache access

where to misses come from
Where to misses come from?
  • Classifying Misses: 3 Cs
    • Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses.(Misses in even an Infinite Cache)
    • Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved.(Misses in Fully Associative Size X Cache)
    • Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses.(Misses in N-way Associative, Size X Cache)
cache organization
Cache Organization?
  • Assume total cache size not changed:
  • What happens if:
  • Change Block Size:
  • Change Associativity:

3) Change Compiler:

Which of 3Cs is obviously affected?

increasing block size
Increasing Block Size
  • Larger block sizes reduce compulsory misses
    • Takes advantage of spatial locality
  • Larger blocks increase miss penalty
    • Our goal: reduce miss rate and miss penalty!
  • Block size selection depends on latency and bandwidth of lower level memory
    • High latency & high BW => large block sizes
    • Low latency & low BW => smaller block sizes
larger block size fixed size assoc

Reduced

compulsory

misses

Increased

Conflict

Misses

Larger Block Size (fixed size&assoc)

What else drives up block size?

cache size
Cache Size
  • Old rule of thumb: 2x size => 25% cut in miss rate
  • What does it reduce?
increase associativity
Increase Associativity
  • Higher set-associativity improves miss rates
  • Rule of thumb for caches
    • A direct mapped cache of size N has about the same miss rate as a two-way, set-associative cache of N/2
    • Holds for < 128Kb
  • Disadvantages
    • Reduces miss rate but increases miss penalty
    • Greater associativity can come at the cost of inreased hit time.
3cs relative miss rate
3Cs Relative Miss Rate

Conflict

Flaws: for fixed block size

Good: insight => invention

associativity vs cycle time
Associativity vs Cycle Time
  • Beware: Execution time is only final measure!
  • Why is cycle time tied to hit time?
  • Will Clock Cycle time increase?
next time
Next Time
  • Cache Tradeoffs for Performance
  • Reducing Hit Times
ad