Cmpe 421 parallel computer architecture
Download
1 / 27

CMPE 421 Parallel Computer Architecture - PowerPoint PPT Presentation


  • 62 Views
  • Uploaded on

CMPE 421 Parallel Computer Architecture. PART4 Caching with Associativity. Fully Associative Cache Reducing Cache Misses by More Flexible Placement Blocks . Instead of direct mapped, we allow any memory block to be placed in any cache slot.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' CMPE 421 Parallel Computer Architecture' - mikaia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Cmpe 421 parallel computer architecture

CMPE 421 Parallel Computer Architecture

PART4

Caching with Associativity


Fully associative cache reducing cache misses by more flexible placement blocks
Fully Associative CacheReducing Cache Misses by More Flexible Placement Blocks

  • Instead of direct mapped, we allow any memory block to be placed in any cache slot.

    • There are many different potential addresses that mapped to each index

    • Use any available entry to store memory elements

    • Remember: Direct memory caches are more rigid, any cache data goes directly where the index says to, even if the rest of the cache is empty

    • But in Fully associative cache, nothing gets “thrown out” of the cache until it is completely full.

  • It’s harder to check for a hit (hit time will increase).

  • Requires lots more hardware (a comparator for each cache slot).

  • Each tag will be a complete block address (No index bits are used).


Fully associative cache
Fully Associative Cache

  • Must compare tags of all entries in parallel to find the desired one (if there is a hit)

    • But Direct mapped cache only need to look one place

  • No conflict misses, only capacity misses

  • Practical only for caches with small number of blocks, since searching increases the hardware cost



Direct mapped vs fully associative

V

Tag

Data

V

Tag

Data

0:

1:

2

3:

4:

5:

6:

7:

8

9:

10:

11:

12:

13:

14:

15:

Direct Mapped vs Fully Associative

Fully Associative

Direct Mapped

Index

No Index

Each address has only one possible location

Address = Tag | Index | Block offset

Address = Tag | Block offset


Trade off
Trade off

  • Fully Associate is much more flexible, so the miss rate will be lower.

  • Direct Mapped requires less hardware (cheaper).

    – will also be faster!

  • Tradeoff of miss rate vs. hit time.

  • Therefore we might be able to compromise to find best solution between direct mapped cache and fully associative cache

  • We can also provide more flexibility without going to a fully associative placement policy.

  • For each memory location, provide a small number of cache slots that can hold the memory element.

  • This is much more flexible than direct-mapped, but requires less hardware than fully associative.

    SOLUTION: Set Associative


Set associative cache
SET Associative Cache

  • A fixed number of locations where each block can be placed.

  • N-way set associative means there are N places (slots) where each block can be placed.

  • Divide the cache into a number of sets each set is of size N “ways” (N way set associative)

  • Therefore, A memory block maps to unique set (specified by index field) and can be placed in any “way” of that set

    • So there N choices

    • A memory block can be mapped is Set-accociative cache

      • (Block address) modulo (Number of set in the cache)

      • Remember that in a direct mapped cache the position of memory block is given by

        (Block address) modulo (Number of cache blocks)


A compromise

V

Tag

Data

V

Tag

Data

0:

0:

1:

1:

2:

3:

2:

4:

5:

3:

6:

7:

A Compromise

4-Way set associative

2-Way set associative

Each address has four possible

locations with the same index

Each address has two possible

locations with the same index

One fewer index bit: 1/2 the indexes

Two fewer index bits: 1/4 the indexes

Address = Tag | Index | Block offset

Address = Tag | Index | Block offset


Range of set associative caches

Used for tag compare

Selects the set

Selects the word in the block

Increasing associativity

Decreasing associativity

Fully associative

(only one set)

Tag is all the bits except

block and byte offset

Direct mapped

(only one way)

Smaller tags

Range of Set Associative Caches

Index is the set number is used to determine which set the block can be placed in

Tag

Index

Block offset

Byte offset


Range of set associative caches1
Range of Set Associative Caches

  • For a fixed size cache,

    • each increase by a factor of two in associativity doubles the number of blocks per set (i.e. the numbers or ways)

    • And halves the number of sets,

    • Decreases the size of the index by 1 bit

    • And increases the size of the tag by 1 bit

Tag

Index

Block offset

Byte offset


Set associative cache1
Set Associative Cache

Main Memory

0000xx

0001xx

0010xx

0011xx

0100xx

0101xx

0110xx

0111xx

1000xx

1001xx

1010xx

1011xx

1100xx

1101xx

1110xx

1111xx

Two low order bits define the byte in the word (32-b words)

One word blocks

Cache

V

Tag

Data

0

0

1

0

1

1

Q1: How do we find it?

Use next 1 low order memory address bit to determine which cache set (i.e., modulo the number of sets in the cache)

Q2: Is it there?

Compare all the cache tags in the set to the high order 3 memory address bits to tell if the memory block is in the cache

(block address) modulo (# set in the cache)

Valid bit indicates whether an entry contains valid information – if the bit is not set, there cannot be a match for this block


Set associative cache organization
Set Associative Cache Organization

FIGURE 7.17 The implementation of a four-way set-associative cache requires four comparators and a 4-to-1 multiplexor. The comparators determine which element of the selected set (if any) matches the tag. The output of the comparators is used to select the data from one of the four blocks of the indexed set, using a multiplexor with a decoded select signal. In some implementations, the Output enable signals on the data portions of the cache RAMs can be used to select the entry in the set that drives the output. The Output enable signal comes from the comparators, causing the element that matches to drive the data outputs.


Set associative cache organization1
Set Associative Cache Organization

  • This is called a 4-way set associative cache because there are four cache entries for each cache index. Essentially, you have four direct mapped cache working in parallel.

  • This is how it works: the cache index selects a set from the cache. The four tags in the set are compared in parallel with the upper bits of the memory address.

  • If no tags match the incoming address tag, we have a cache miss.

  • Otherwise, we have a cache hit and we will select the data from the way where the tag matches occur.

  • This is simple enough. What is its disadvantages?


N way set associative cache versus direct mapped cache
N-way Set Associative Cache versus Direct Mapped Cache:

  • N way set associative cache will also be slower than a direct mapped cache because

    • N comparators vs. 1

    • Extra MUX delay for the data

    • Data comes AFTER Hit/Miss decision and set selection

    • In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:

    • Possible to assume a hit and continue. Recover later if miss.


Remember the example for direct mapping ping pong effect

01

4

00

01

0

4

00

0

01

4

00

0

01

4

Remember the Example for Direct Mapping (ping pong effect)

  • Consider the main memory word reference string

    0 4 0 4 0 4 0 4

Start with an empty cache - all blocks initially marked as not valid

miss

miss

miss

miss

0

4

0

4

00 Mem(0)

00 Mem(0)

01 Mem(4)

00 Mem(0)

4

0

4

0

miss

miss

miss

miss

01 Mem(4)

00 Mem(0)

01 Mem(4)

00 Mem(0)

  • 8 requests, 8 misses

  • Ping pong effect due to conflict misses - two memory locations that map into the same cache block


Solution use set associative cache
Solution: Use set associative cache

  • Consider the main memory word reference string

    0 4 0 4 0 4 0 4

Start with an empty cache - all blocks initially marked as not valid

miss

miss

hit

hit

0

4

0

4

000 Mem(0)

000 Mem(0)

000 Mem(0)

000 Mem(0)

010 Mem(4)

010 Mem(4)

010 Mem(4)

  • 8 requests, 2 misses

  • Solves the ping pong effect in a direct mapped cache due to conflict misses since now two memory locations that map into the same cache set can co-exist!


Set associative example

Index

Index

Index

V

Tag

Data

V

Tag

Data

V

Tag

Data

000:

0

0

0

00:

001:

0

0

0

0:

010:

0

0

0

01:

011:

0

0

0

100:

0

0

0

10:

101:

0

0

0

1:

110:

0

0

0

11:

111:

0

0

0

Byte offset (2 bits)Block offset (2 bits)Index (1-3 bits)Tag (3-5 bits)

Set Associative Example

0100111000

0100111000

0100111000

Miss

Miss

Miss

Miss

Miss

Miss

1100110100

1100110100

1100110100

Miss

Hit

Hit

0100111100

0100111100

0100111100

Miss

Miss

Miss

0110110000

0110110000

0110110000

1100111000

Miss

1100111000

Miss

1100111000

Hit

-

010

011

110

110

010

1

-

01001

1

-

11001

1

-

01101

-

0100

1100

1

1

-

1100

0110

1

Direct-Mapped

2-Way Set Assoc.

4-Way Set Assoc.


New performance numbers
New Performance Numbers

Miss rates for DEC 3100 (MIPS machine)

Separate 64KB Instruction/Data Caches

Benchmark Associativity Instruction Data miss Combined

rate miss rate

gcc Direct 2.0% 1.7% 1.9%

gcc 2-way 1.6% 1.4% 1.5%

gcc 4-way 1.6% 1.4% 1.5%

spice Direct 0.3% 0.6% 0.4%

spice 2-way 0.3% 0.6% 0.4%

spice 4-way 0.3% 0.6% 0.4%


Benefits of set associative caches
Benefits of Set Associative Caches

  • The choice of direct mapped or set associative depends on the cost of a miss versus the cost of implementation

Data from Hennessy & Patterson, Computer Architecture, 2003

  • Largest gains are in going from direct mapped to 2-way (20%+ reduction in miss rate)


Benefits of set associative caches1
Benefits of Set Associative Caches

  • As the cache size grow, the relative improvement from associativity increases only slightly

  • Since overall miss rate of a larger cache is lower, the opportunity for improving the miss rate decreases

  • And the obsolete improvement in miss rate from associativity shrinks significantly


Cache block replacement policy
Cache Block Replacement Policy

For deciding which block to replace when a new entry is coming

  • Random Replacement:

    • Hardware randomly selects a cache item and throw it out

  • First in First Out (FIFO)

    • Equally fair / equally unfair to all frames

  • Least Recently Used (LRU) strategy:

    • Use idea of temporal locality to select the entry that has not been accessed recently

    • Additional bit(s) required in the cache entry to track access order

      • Must update on each access, must scan all on a replace

    • For two way set associative cache one needs one bit for LRU replacement.

  • Common approach is to use pseudo LRU strategy

  • Example of a Simple “Pseudo” Least Recently Used Implementation:

  • Assume 64 Fully Associative Entries

  • Hardware replacement pointer points to one cache entry

  • Whenever an access is made to the entry the pointer points to:

    - Move the pointer to the next entry

    -Otherwise: do not move the pointer



Designing a cache
Designing a cache

Not: If you are running “billions” of instructions compulsory misses are insignificand





Multilevel caches
Multilevel caches

  • Two level cache structure allows the primary cache (L1) to focus on reducing hit time to yield a shorter clock cycle.

  • The second level cache (L2) focuses on reducing the penalty of long memory access time.

  • Compared to the cache of a single cache machine, L1 on a multilevel cache machine is usually smaller, has a smaller block size, and has a higher miss rate.

  • Compared to the cache of a single cache machine, L2 on a multilevel cache machine is often larger with a larger block size.

  • The access time of L2 is less critical than that of the cache of a single cache machine.


ad