Eeng 449bg cpsc 439bg computer systems lecture 17 memory hierarchy design part i
This presentation is the property of its rightful owner.
Sponsored Links
1 / 33

EENG 449bG/CPSC 439bG Computer Systems Lecture 17 Memory Hierarchy Design Part I PowerPoint PPT Presentation


  • 58 Views
  • Uploaded on
  • Presentation posted in: General

EENG 449bG/CPSC 439bG Computer Systems Lecture 17 Memory Hierarchy Design Part I. April 7, 2005 Prof. Andreas Savvides Spring 2005 http://www.eng.yale.edu/courses/2005s/eeng449b. µProc 60%/yr. 1000. CPU. “Moore’s Law”. 100. Processor-Memory Performance Gap: (grows 50% / year).

Download Presentation

EENG 449bG/CPSC 439bG Computer Systems Lecture 17 Memory Hierarchy Design Part I

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Eeng 449bg cpsc 439bg computer systems lecture 17 memory hierarchy design part i

EENG 449bG/CPSC 439bG Computer SystemsLecture 17Memory Hierarchy Design Part I

April 7, 2005

Prof. Andreas Savvides

Spring 2005

http://www.eng.yale.edu/courses/2005s/eeng449b


Who cares about the memory hierarchy

µProc

60%/yr.

1000

CPU

“Moore’s Law”

100

Processor-Memory

Performance Gap:(grows 50% / year)

Performance

10

“Less’ Law?”

DRAM

7%/yr.

DRAM

1

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

Who Cares About the Memory Hierarchy?

CPU-DRAM Gap

  • 1980: no cache in µproc; 1995 2-level cache on chip(1989 first Intel µproc with a cache on chip)


Review of caches

Review of Caches

  • Cache is the name given to the first level of the memory hierarchy encountered one the address leaves the CPU

    • Cache hit / cache miss when data is found / not found

    • Block – a fixed size collection of data containing the requested word

    • Spatial / temporal localities

    • Latency – time to retrieve the first word in the block

    • Bandwidth – time to retrieve the rest of the block

    • Address space is broken into fixed-size blocks called pages

    • Page fault – when CPU references something that is not on cache or main memory


Generations of microprocessors

Generations of Microprocessors

  • Time of a full cache miss in instructions executed:

    1st Alpha: 340 ns/5.0 ns =  68 clks x 2 or136

    2nd Alpha:266 ns/3.3 ns =  80 clks x 4 or320

    3rd Alpha:180 ns/1.7 ns =108 clks x 6 or648

  • 1/2X latency x 3X clock rate x 3X Instr/clock  ­5X


Processor memory performance gap tax

Processor-Memory Performance Gap “Tax”

Processor % Area %Transistors

(­cost)(­power)

  • Alpha 2116437%77%

  • StrongArm SA11061%94%

  • Pentium Pro64%88%

    • 2 dies per package: Proc/I$/D$ + L2$

  • Caches have no “inherent value”, only try to close performance gap


What is a cache

What is a cache?

  • Small, fast storage used to improve average access time to slow memory.

  • Exploits spacial and temporal locality

  • In computer architecture, almost everything is a cache!

    • Registers “a cache” on variables – software managed

    • First-level cache a cache on second-level cache

    • Second-level cache a cache on memory

    • Memory a cache on disk (virtual memory)

    • TLB a cache on page table

    • Branch-prediction a cache on prediction information?

Proc/Regs

L1-Cache

Bigger

Faster

L2-Cache

Memory

Disk, Tape, etc.


Review cache performance

  • Miss-oriented Approach to Memory Access:

    • CPIExecution includes ALU and Memory instructions

  • Separating out Memory component entirely

    • AMAT = Average Memory Access Time

    • CPIALUOps does not include memory instructions

Review: Cache performance


Impact on performance

Impact on Performance

  • Suppose a processor executes at

    • Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1

    • 50% arith/logic, 30% ld/st, 20% control

  • Suppose that 10% of memory operations get 50 cycle miss penalty

  • Suppose that 1% of instructions get same miss penalty

  • CPI = ideal CPI + average stalls per instruction=1.1(cycles/ins) +[ 0.30 (DataMops/ins) x 0.10 (miss/DataMop) x 50 (cycle/miss)] +[ 1 (InstMop/ins) x 0.01 (miss/InstMop) x 50 (cycle/miss)] = (1.1 + 1.5 + .5) cycle/ins = 3.1

  • 58% of the time the proc is stalled waiting for memory!


Traditional four questions for memory hierarchy designers

Traditional Four Questions for Memory Hierarchy Designers

  • Q1: Where can a block be placed in the upper level? (Block placement)

    • Fully Associative, Set Associative, Direct Mapped

  • Q2: How is a block found if it is in the upper level? (Block identification)

    • Tag/Block

  • Q3: Which block should be replaced on a miss? (Block replacement)

    • Random, LRU

  • Q4: What happens on a write? (Write strategy)

    • Write Back or Write Through (with Write Buffer)


Q1 where can a block be placed in a cache

Q1: Where can a Block Be Placed in a Cache?


Set associatively

Set Associatively

  • Direct mapped = one-way set associative

    • (block address) MOD (Number of blocks in cache)

  • Fully associative = set associative with 1 set

    • Block can be placed anywhere in the cache

  • Set associative – a block can be placed in a restricted set of places

    • Block first mapped to a set and then placed anywhere in the set

      • (Block address) MOD (Number of sets in cache)

    • If there are n blocks in a set, then the cache is n-way set associative

  • Most popular cache configurations in today’s processors

    • Direct mapped, 2-way set associative, 4-way set associative


Q2 how is a block found if it is in the cache

Q2: How is a block found if it is in the cache?

Selects the set

Selects the desired data from the block

Compared against for a hit

  • If cache size remains the same increasing associativity increases

  • The number of blocks per set => decrease index size and increase tag


Q3 which block should be replaced on a cache miss

Q3: Which Block Should be Replaced on a Cache Miss?

  • Directly mapped cache

    • No choice – a single block is checked for a hit. If there is a miss, data is fetched into that block

  • Fully associative and set associative

    • Random

    • Least Recently Used(LRU) – locality principles

    • First in, First Out(FIFO) - Approximates LRU


Q4 what happens on a write

Q4: What Happens on a Write?

  • Cache accesses dominated by reads

  • E.g on MIPS 10% stores and 37% loads = 21% of cache traffic is writes

  • Writes are much slower than reads

    • Block is read from cache at the same time a block is read and compared

      • If the write is a hit the block is passed to the CPU

    • Writing cannot begin unless the address is a hit

  • Write through – information is written to both the cache and the lower-level memory

  • Write back – information only written to cache. Written to memory only on block replacement

    • Dirty bit – used to indicate whether a block has been changed

      While in cache


Write through vs write back

Write Through vs. Write Back

  • Write back – uses cache speed, all entries updated once during the writing of a block

  • Write through – slower, BUT cache is always clean

    • Cache read misses never result in writes at the lower level

    • Next lower level of the cache has the most current copy of the data


Example alpha 21264 data cache

Example: Alpha 21264 Data Cache

  • 2-way set associative

  • Write back

  • Each block has 64 bytes of data

    • Offset points to the data we want

    • Total cache size 65,536 bytes

  • Index 29=512 points to the block

  • Tag comparison determines if we have a hit

  • Victim buffer to helps with write back


Address breakdown

Address Breakdown

  • Physical address is 44 bits wide, 38-bit block address and 6-bit offset (2^6=64)

  • Calculating cache index size field

  • Blocks are 64 bytes so offset needs 6 bits

  • Tag size = 38 – 9 = 29 bits


Writing to cache

Writing to cache

  • If word to be written is in the cache, first 3 steps are the same as read

  • The 21264 processor uses write back so the block cannot simply discarded on a miss

  • If the “victim” was modified its data and address are sent to the victim buffer.

  • Data cache cannot apply all processor needs

    • Separate Instruction and Data Caches may be needed


Unified vs split caches

Proc

Proc

I-Cache-1

Proc

D-Cache-1

Unified

Cache-1

Unified

Cache-2

Unified

Cache-2

Unified vs Split Caches

  • Unified vs Separate Instruction and Data caches

  • Example:

    • 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47%

    • 32KB unified: Aggregate miss rate=1.99%

    • Using miss rate in the evaluation may be misleading!

  • Which is better (ignore L2 cache)?

    • Assume 25% data ops  75% accesses from instructions (1.0/1.33)

    • hit time=1, miss time=50

    • Note that data hit has 1 stall for unified cache (only one port)

      AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05

      AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24


Impact of caches on performance

Impact of Caches on Performance

  • Consider a in-order execution computer

    • Cache miss penalty 100 clock cycles, CPI=1

    • Average miss rate 2% and an average of 1.5 memory references per instruction

    • Average # of cache misses: 30 per 1000 instructions

      Performance with cache misses


Impact of caches on performance1

Impact of Caches on Performance

  • Calculate the performance using miss rate

  • 4x increase in CPU time from “perfect cache”

  • No cache – 1.0 + 100 x 1.5 = 151 – a factor of 40 compared to a system

  • with cache

  • Minimizing memory access does not always imply reduction in CPU time


How to improve cache performance

How to Improve Cache Performance?

Four main categories of optimizations

  • Reducing miss penalty

    - multilevel caches, critical word first, read miss before write miss, merging write buffers and victim caches

  • Reducing miss rate

    - larger block size, larger cache size, higher associativity, way prediction and pseudoassociativity and computer optimizations

    2. Reduce the miss penalty or miss rate via parallelism

    - non-blocking caches, hardware prefetching and compiler prefetching

    3. Reduce the time to hit in the cache

    - small and simple caches, avoiding address translation, pipelined cache access


Where to misses come from

Where to misses come from?

  • Classifying Misses: 3 Cs

    • Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses.(Misses in even an Infinite Cache)

    • Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved.(Misses in Fully Associative Size X Cache)

    • Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses.(Misses in N-way Associative, Size X Cache)


3cs absolute miss rate spec92

3Cs Absolute Miss Rate (SPEC92)

Conflict

Miss rate


Cache organization

Cache Organization?

  • Assume total cache size not changed:

  • What happens if:

  • Change Block Size:

  • Change Associativity:

    3) Change Compiler:

    Which of 3Cs is obviously affected?


Increasing block size

Increasing Block Size

  • Larger block sizes reduce compulsory misses

    • Takes advantage of spatial locality

  • Larger blocks increase miss penalty

    • Our goal: reduce miss rate and miss penalty!

  • Block size selection depends on latency and bandwidth of lower level memory

    • High latency & high BW => large block sizes

    • Low latency & low BW => smaller block sizes


Larger block size fixed size assoc

Reduced

compulsory

misses

Increased

Conflict

Misses

Larger Block Size (fixed size&assoc)

What else drives up block size?


Cache size

Cache Size

  • Old rule of thumb: 2x size => 25% cut in miss rate

  • What does it reduce?


Increase associativity

Increase Associativity

  • Higher set-associativity improves miss rates

  • Rule of thumb for caches

    • A direct mapped cache of size N has about the same miss rate as a two-way, set-associative cache of N/2

    • Holds for < 128Kb

  • Disadvantages

    • Reduces miss rate but increases miss penalty

    • Greater associativity can come at the cost of inreased hit time.


Associativity

Associativity

Conflict


3cs relative miss rate

3Cs Relative Miss Rate

Conflict

Flaws: for fixed block size

Good: insight => invention


Associativity vs cycle time

Associativity vs Cycle Time

  • Beware: Execution time is only final measure!

  • Why is cycle time tied to hit time?

  • Will Clock Cycle time increase?


Next time

Next Time

  • Cache Tradeoffs for Performance

  • Reducing Hit Times


  • Login