Cache hierarchy
This presentation is the property of its rightful owner.
Sponsored Links
1 / 94

Cache Hierarchy PowerPoint PPT Presentation


  • 94 Views
  • Uploaded on
  • Presentation posted in: General

Cache Hierarchy. J. Nelson Amaral University of Alberta. Address Translation (flashback). valid bit = 0 implies a page fault (there is no frame in memory for this page). Baer, p. 62. Should the Cache be Accessed with Physical or Virtual Addresses?. Baer, p. 209. Instruction Cache.

Download Presentation

Cache Hierarchy

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Cache hierarchy

Cache Hierarchy

J. Nelson Amaral

University of Alberta


Address translation flashback

Address Translation(flashback)

valid bit = 0 implies

a page fault (there is

no frame in memory

for this page)

Baer, p. 62


Should the cache be accessed with physical or virtual addresses

Should the Cache be Accessed with Physical or Virtual Addresses?

Baer, p. 209


Instruction cache

Instruction Cache

  • Instructions are fetched either:

    • sequentially (same cache line)

    • with address form a branch target buffer (BTB)

      • BTB contains physical addresses

  • When needed, translation is:

    • done in parallel with delivery of previous instruction

  • Thus Instruction Cache can be physical

Baer, p. 209


Data cache

For a 2k page size, the last k bits are identical in the virtual and physical addresses.

Data Cache

Index

Physical

Virtual

Tags

Tags

Physical

Physical

Virtual

If the cache index fits within these k bits,

then these two schemes are identical.

Baer, p. 209


Parallel tlb and cache access

Parallel TLB and Cache Access

Page size = 2k

Only works if index + displ. ≤ k

Baer, p. 210


Pipeline stages

Pipeline Stages

Saves a pipeline stage when

there is a hit in the TLB and cache.

Stage 2: If Tag in TLB ≠ Tag in TLB:

- Void Data in Register

- Start Replay

Stage 1: Send

Data to Register

Baer, p. 210


Cache hierarchy

Page Sizes are typically 4KB or 8KB.

An 8KB L1 cache is too small.

Increase cache associativity.

Two solutions:

Increase the number of bits that are not translated.

Baer, p. 210


Limits on associativity

Limits on Associativity

Time to do the tag comparisons

Solution: Do comparisons in parallel

Still need time for latches/multiplexors

Solution: Don’t compare with all tags. How?

Use a set predictor.

For L1 predictor must be fast.

Baer, p. 210


Page coloring

Page Coloring

  • Goal: increase number of non-translated bits

  • Idea: Restrict mapping of pages into frames

    • Divide both pages and frames into colors.

    • A page must map to a frame of the same color.

    • For l additional non-translated bits needs 2l colors. Restrict mapping between virtual and physical addresses

  • Alternative to coloring:

    • Use a predictor for the l bits.

Baer, p. 211


Virtual cache

Virtual Cache

  • Virtual index and virtual tags allow for fast access to cache.

  • However…

    • Page protection and recency-of-use information (stored in TLB) must be accessed.

      • TLB must be accessed anyway

      • TLB access can be in parallel with cache access

    • Context switches activate new virtual address space

      • Entire cache content becomes stale. Either:

        • flush the cache

        • append a PID to tag in cache

          • must flush part of cache when recycling PIDs

    • Synonym Problem

Baer, p. 211


Synonym problem

Synonym Problem

Virtual Address A

Physical Address 1

Virtual Address B

Occurs when data is shared among processes

What happens in a virtual cache if two

synonyms are cached simultaneously,

and one of them is modified?

The other becomes inconsistent!

Baer, p. 211


Avoiding stale synonyms

Avoiding Stale Synonyms

  • Variation on Page Coloring: Require that the bits used to index the cache be the same for all synonyms

  • Software must be aware about potential synonyms

    • Easier for instruction caches

    • Tricky for data caches

      • Sun UltraSPARC has virtual instruction cache.

Baer, p. 212


Example

Example

  • Page size is 4 Kbytes

    • How many bits for page number and page offset?

      • 20 bits for page number and 12 bits for page offset

  • Direct-mapped D-Cache has 16 bytes per lines, 512 lines:

    • How many bits for tag, index, and displacement?

      • 16x512 = 8192 bytes = 213 bytes

      • displacement = 4 bits

      • index = 13-4 = 9 bits

      • tag = 32-13 = 19 bits

Lowest bit of page number

is part of the index.

Baer, p. 212


Cache hierarchy

Page Number

Page Offset

28

24

20

16

12

8

4

0

Cache Tag

Cache Index

Cache

Displ.

Process A

Physical Pages

Page 4

Page

14

Process B

Page 17

Baer, p. 212


Cache hierarchy

Process A reads line 8 of its page 4 (line 8 physical page 14):

Page Number

Page Number

Page Offset

Page Offset

Process B reads line 8 of its page 17 (line 8 of physical page 14):

12

20

16

20

16

12

28

24

28

24

4

0

8

8

4

0

Cache

Displac.

Cache

Displac.

Cache Tag

Cache Tag

Cache Index

Cache Index

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Baer, p. 212


Cache hierarchy

Process A reads line 8 of its page 4 (line 8 physical page 14):

Page Number

Page Number

Page Offset

Page Offset

Process B reads line 8 of its page 17 (line 8 of physical page 14):

12

20

16

12

20

16

28

24

24

28

8

0

4

4

0

8

Cache

Displac.

Cache

Displac.

Cache Tag

Cache Tag

Cache Index

Cache Index

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Now processor

B writes to its

cache line 8.

Baer, p. 212


Cache hierarchy

Process A reads line 8 of its page 4 (line 8 physical page 14):

We have a synonym problem: two copies of the same

physical line in the cache, and they are inconsistent.

Page Number

Page Number

Page Offset

Page Offset

Process B reads line 8 of its page 17 (line 8 of physical page 14):

How can we avoid the synonym?

16

12

20

16

20

12

28

24

28

24

4

8

4

0

0

8

Cache

Displac.

Cache

Displac.

Cache Tag

Cache Tag

Cache Index

Cache Index

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Now processor

B writes to its

cache line 8.

Baer, p. 212


Cache hierarchy

Checks on a cache miss:

  • check virtual tags to ensure that there is a miss (this would be in the tag at line 264)

(ii) compare the physical page number of the missing item (page 14 in the example) with the physical tag(s) of all other locations in the cache that could be potential synonyms (physical page number for virtual tag of line 8 in the example).

Baer, p. 212


Other drawbacks of virtual caches

Other drawbacks of Virtual Caches

I/O addresses are physical.

Cache coherency in multiprocessors use physical addresses.

Virtual caches are not currently used in practice.

Baer, p. 212


Virtual index and physical tags

Virtual Index and Physical Tags

Idea: limit mapping in a similar way to page coloring

but apply it only to the cache.

Example: Consider an m-way set associative cache

with capacity m x 2k.

Constraint: Each line in a set has a pattern of l bits

that is different from all other lines.

The pattern should be above the lower k bits, and

l < m.

Baer, p. 212


Virtual index and physical tags cont

Virtual Index and Physical Tags (Cont.)

On an access:

Set: determined by k untranslated bits.

Use a prediction for the m patterns of the l virtual bits.

If the prediction does not match the full match in the

TLB, need to repeat access.

Drawback: lines mapped to the same set with the same

l-bit pattern cannot be in the cache simultaneously.

Baer, p. 212


Faking associativity

“Faking” Associativity

Column-associative Caches: Treat a direct-mapped

cache as two independent halves.

First Access: Access the cache using the usual index.

Second Access: If a miss occurs, rehash the address

and perform a second access to a different line.

Swap: If the second access is a hit, swap the entries

for the first and second accesses.

Baer, p. 213


Column associative cache

Column-Associative Cache

Assume that

the high order

bit of a

makes it first

to be looked

at Half 1.

Processor

Memory

&a

a

a

c

Cache

Half 2

Half 1

b

a


Column associative cache1

Column-Associative Cache

Assume that

the high order

bit of b

makes it first

to be looked

at Half 1.

Processor

Memory

&b

a

a

c

Cache

Half 2

Half 1

b

b

a

b


Column associative cache2

Column-Associative Cache

Assume that

c’s high order

bit is the

opposite of

a and b.

Processor

Memory

&c

a

a

Cache

Thus, look

for c on

Half 2 first.

Half 2

Half 1

b

b

a

b

c

c

c


Column associative cache3

Column-Associative Cache

Processor

Memory

&b

a

a

Cache

Half 2

Half 1

b

b

c

b

a

c

c


Column associative cache4

Column-Associative Cache

Processor

Memory

&c

a

a

Cache

Half 2

Half 1

b

b

a

b

c

c

c


Column associative cache5

Column-Associative Cache

Processor

Memory

&b

a

a

Cache

Half 2

Half 1

b

b

c

b

a

c

c


Cache hierarchy

Thus the sequence abcbcbcbcbc….

results in a miss in every access.

The same sequence would result in close to

100% hit rate in a 2-way associative cache.

Solution: add a rehash bit that indicates that the

entry is not in its original location

If there is a miss on first access and rehash is on,

there will be also a miss on the second access.

The entry with rehash bit on is the LRU and should

be evicted.

Baer, p. 214


Operation of column assoc cache

Operation of Column-Assoc. Cache

tag(index) = tag_ref?

It is a hit.

Serve entry to processor.

yes

no

rehash_bit(index)

on

It is a miss. Enter:

off

rehash bit

index1 ← flip_high_bit(index)

It is a secondary hit.

swap[entry(index),entry(index1)]

rehash_bit(index1) ← 1

Serve entry to processor

tag(index1) = tag_ref?

yes

no

tag_ref

0

data

tag_ref

1

data

index

index

It is a secondary miss. Enter:

swap[entry(index),entry(index1)]

rehash_bit(index1) ← 1

Baer, p. 214


Performance of column associative caches cac

Performance of Column-Associative Caches (CAC)

  • Comparing CAC with Direct-Mapped Cache (DMC) of same capacity:

    • Miss ratio(CAC) < Miss ratio(DMC)

    • Access to second half(CAC) > Access(DMC)

  • Comparing with a Two-Way Set Associative Cache (2SAC) of same capacity:

    • miss ratio(CAC) approaches miss ratio(2SAC)

Baer, p. 215


Design question

Design Question

When should a column-associative cache

be chosen over a 2-way set associative cache?

When the 2-way set associative cache

requires the processor clock to be longer.

Baer, p. 215


Design question 2

Design Question #2

Can a column-associative cache be expanded

to higher associativity? How?

Have to replace the hashing function.

The single-bit hashing does not work.

Could use some XOR combination of PC bits.

Baer, p. 215


Victim caches

Victim Caches

Baer, p. 215


Operation of victim cache

Operation of Victim Cache

tag(index) = tag_ref?

It is a hit.

Serve entry to processor.

yes

swap(victim, VC[i])

Update VC LRU information

Serve entry to processor

Tag(VC[i]) = VCTag

yes

no

Let VC(j) be the LRU entry in VC

no

VCTag = concat[index,tag(index)]

Assoc. search victim cache for VCtag

yes

VC(j) dirty?

Writeback(VC[j])

no

VC[j] ← victim

Baer, p. 216


History first victim cache

History: First Victim Cache

HP 7100, introduced in 1992

HP 7200, introduced in 1995

64-entry victim cache

120 MHz clock

Baer, p. 216


Code reordering

Code Reordering

  • Procedure reordering

  • Basic Block Reordering

Baer, p. 217


Data reordering

Data Reordering

  • Cache-conscious algorithms

  • Pool allocation

  • Structure reorganization

  • Loop tiling (cache tiling?)

Baer, p. 218


Hiding memory latencies

Hiding Memory Latencies

  • Prefetching

    • Software instructions

    • Hardware assisted

    • Hardware only (stream prefetching)

  • Between any two levels of memory hierarchy

  • Prefetching is predictive, can we use the same predictor modeling as for branches?

Baer, p. 218


Prefetching prediction branch prediction

Prefetching Prediction × Branch Prediction

Baer, p. 219


Assessment criteria for prefetching

Assessment Criteria for Prefetching

Timeliness:

  • too early: displaces useful data that have to be reloaded before prefetched data is needed

  • too late: data is not there when needed

Baer, p. 219


Prefetching

Prefetching

  • Why?

  • What?

  • When?

  • Where?

To hide memory latency by increasing hit ratio.

Ideally semantic objects. In practice cache lines.

In a timely manner when a trigger happens.

A given cache level or a special prefetch storage buffer.

Baer, p. 219


Disadvantages of prefetching

Disadvantages of Prefetching

  • Compete for resources with regular memory operations

    • P.e., might need an extra port in cache to check tags before prefetching

    • Competition for memory bus with regular loads and stores

Baer, p. 220


Software prefetching

Software Prefetching

  • Non-binding loads: a load that does not write to a register

  • More sophisticated instructions: designate the level in the cache hierarchy where prefetched line should stop (Itanium)

Baer, p. 220


Software prefetching example

Software Prefetching Example

for (i=0 ; i<n ; i++)

inner = inner + a[i]*b[i];

What is the drawback now?

Code with Prefetching:

for (i=0 ; i<n ; i++){

prefetch (&a[i+1]);

prefetch(&b[i+1]);

inner = inner + a[i]*b[i];

}

Each prefetch instruction

brings an entire cache line.

Same line is fetched several

times.

May cause an exception on

the last iteration.

Baer, p. 221


Software prefetching example1

Software Prefetching Example

for (i=0 ; i<n ; i++)

inner = inner + a[i]*b[i];

What is the drawback now?

Prefetching with Predicate:

for (i=0 ; i<n ; i++){

if (i ≠ n-1 and i mod 4 = 0){

prefetch (&a[i+1]);

prefetch(&b[i+1]);

}

inner = inner + a[i]*b[i];

}

This branch is not easy

to predict correctly.

Baer, p. 221


Software prefetching example2

Software Prefetching Example

for (i=0 ; i<n ; i++)

inner = inner + a[i]*b[i];

With Loop Unrolling:

Issues:

- register pressure

- code growth

prefetch(&a[0]);

prefetch(&b[0]);

for (i=0 ; i<n-4 ; i += 4){

prefetch (&a[i+4]);

prefetch(&b[i+4]);

inner = inner + a[i]*b[i];

inner = inner + a[i+1]*b[i+!];

inner = inner + a[i+2]*b[i+2];

inner = inner + a[i+3]*b[i+3];

}

for ( ; i<n ; i++)

inner = inner + a[i]*b[i];

Baer, p. 221


Sequential prefetching

Sequential Prefetching

  • or one-block lookahead (OBL) prefetching:

    • prefetch the next cache line

      • makes the line size look larger

  • Strategies

    • always-prefetch: high coverage, low accuracy

    • prefetch-on-miss: good for I-caches

    • tagged-prefetching: good for D-caches

      • one-bit tag in the cache indicates if next line should be prefetched

        • when line is prefetched: tag ← 0

        • when line is referenced: tag ← 1

Baer, p. 222


One block ahead prefetching

One-Block Ahead Prefetching

Problem: Timeliness of OBL is usually poor (it is too late).

Solution: Prefetch multiple lines ahead.

accuracy can be low.

Number of lines ahead cold be adaptive on

previous success (a predictor!).

unreliable feedback: timeless of feedback

information is also poor.

Baer, p. 222


Stream prefetching

Stream Prefetching

  • Bring sequential lines into a FIFO stream buffer.

    • On a miss, check the head of the stream buffer

      • If the head of the stream buffer is also a miss, flush the buffer

    • Works well for I-caches (not as good for D-caches)

      • A stream buffer with four entries yields significant performance improvement.

Baer, p. 223


Stride prefetching

Stride Prefetching

  • Types of access in a loop nest:

    • Scalar: reference to scalar variables

    • Zero stride: access to arrays in outer loops

    • Constant stride: regular accesses to arrays

    • Irregular: access through pointers, or to arrays using complex expressions for index;

      • Example: A[i1,B[i2]]

Baer, p. 223


Stride prefetching example

Stride Prefetching - Example

  • Consider the string of references a, b, c, … such that the distance between consecutive references is constant

    • How many references must be observed to detect that this is a strided access?

      • three

    • What is the value of the stride?

      b-a

  • Long cache lines do not help if stride is large.

Baer, p. 223


Stride prefetching design decisions

Stride Prefetching –Design Decisions

  • Where?

    • Cache × Stream Buffers

  • When?

    • How much lookahead?

  • How much?

    • How many references to prefetch?

Baer, p. 223


Reference prediction table

Reference Prediction Table

Goal: To indicate whether prefetching should be done

for a given memory access operation.

based on instruction address

state transition field

Baer, p. 224


Reference prediction table1

Let i ← index in RPT of entry that matches the Tag of a memory reference instruction M

stride(M,i) ← Op.address(M) – Op.address(RPT[i])

Reference Prediction Table

Decoding an M that

is not in the table

stride[i] = stride(M,i)

Initial

Steady

stride[i] ≠ stride(M,i)

stride[i] = stride(M,i)

stride[i] ← stride(M,i)

stride[i] ≠ stride(M,i)

stride[i] ≠ stride(M,i)

Transient

No-Prediction

stride[i] = stride(M,i)

Baer, p. 224


Rpt table example

RPT Table - Example

int A[100,100], B[100,100], C[100,100]

for( i=0 ; i<100 ; i++)

for( j=0 ; j<100 ; j++){

A[i,j] = 0;

for( k=0 ; k<100 ; k++)

A[i,j] = A[i,j] + B[i,k] × C[k,j];

}

Accesses for iteration

i=0; j=0; k=1:

&(A[0,0]) = 0x0001 0000

&(B[0,1]) = 0x0004 0004

&(C[1,0]) = 0x0008 0190

Address Instruction

⋅⋅⋅ ⋅⋅⋅

0x0000 0100 load B[i,k]

0x0000 0104 load C[k,j]

0x0000 0108 ⋅⋅⋅

0x0000 010c load A[i,j]

Memory

0x0001 0000

A[0,0]

A[0,1]

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

0x0004 0000

B[0,0]

B[0,1]

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

0x0008 0000

C[0,0]

C[0,1]

⋅⋅⋅

⋅⋅⋅

What are the RPT contents for

these instructions after iterations

0, 1, and 2 of the inner loop?

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

Baer p. 257


Rpt table example1

RPT Table - Example

Before Iteration (i=0, j=0, k=0)

Accesses for iteration

i=0; j=0; k=1:

&(A[0,0]) = 0x0001 0000

&(B[0,1]) = 0x0004 0004

&(C[1,0]) = 0x0008 0190

Address Instruction

⋅⋅⋅ ⋅⋅⋅

0x0000 0100 load B[i,k]

0x0000 0104 load C[k,j]

0x0000 0108 ⋅⋅⋅

0x0000 010c load A[i,j]

Memory

0x0001 0000

A[0,0]

A[0,1]

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

0x0004 0000

B[0,0]

B[0,1]

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

0x0008 0000

C[0,0]

C[0,1]

⋅⋅⋅

⋅⋅⋅

What are the RPT contents for

these instructions after iterations

0, 1, and 2 of the inner loop?

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

Baer p. 257


Rpt table example2

RPT Table - Example

After Iteration (i=0, j=0, k=0)

Accesses for iteration

i=0; j=0; k=1:

&(A[0,0]) = 0x0001 0000

&(B[0,1]) = 0x0004 0004

&(C[1,0]) = 0x0008 0190

Address Instruction

⋅⋅⋅ ⋅⋅⋅

0x0000 0100 load B[i,k]

0x0000 0104 load C[k,j]

0x0000 0108 ⋅⋅⋅

0x0000 010c load A[i,j]

Memory

0x0001 0000

A[0,0]

A[0,1]

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

0x0004 0000

B[0,0]

B[0,1]

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

0x0008 0000

C[0,0]

C[0,1]

⋅⋅⋅

⋅⋅⋅

What are the RPT contents for

these instructions after iterations

0, 1, and 2 of the inner loop?

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

Baer p. 257


Rpt table example3

RPT Table - Example

After Iteration (i=0, j=0, k=1)

Accesses for iteration

i=0; j=0; k=1:

&(A[0,0]) = 0x0001 0000

&(B[0,1]) = 0x0004 0004

&(C[1,0]) = 0x0008 0190

Address Instruction

⋅⋅⋅ ⋅⋅⋅

0x0000 0100 load B[i,k]

0x0000 0104 load C[k,j]

0x0000 0108 ⋅⋅⋅

0x0000 010c load A[i,j]

Memory

0x0001 0000

A[0,0]

A[0,1]

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

0x0004 0000

B[0,0]

B[0,1]

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

0x0008 0000

C[0,0]

C[0,1]

⋅⋅⋅

⋅⋅⋅

What are the RPT contents for

these instructions after iterations

0, 1, and 2 of the inner loop?

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

Baer p. 257


Rpt table example4

RPT Table - Example

After Iteration (i=0, j=0, k=2)

Accesses for iteration

i=0; j=0; k=1:

&(A[0,0]) = 0x0001 0000

&(B[0,1]) = 0x0004 0004

&(C[1,0]) = 0x0008 0190

Address Instruction

⋅⋅⋅ ⋅⋅⋅

0x0000 0100 load B[i,k]

0x0000 0104 load C[k,j]

0x0000 0108 ⋅⋅⋅

0x0000 010c load A[i,j]

Memory

0x0001 0000

A[0,0]

A[0,1]

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

0x0004 0000

B[0,0]

B[0,1]

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

0x0008 0000

C[0,0]

C[0,1]

⋅⋅⋅

⋅⋅⋅

What are the RPT contents for

these instructions after iterations

0, 1, and 2 of the inner loop?

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

Baer p. 257


Increasing rpt s timeliness lookahead rpt

Increasing RPT’s Timeliness: Lookahead RPT

Use a lookahead program counter (LA-PC)

Ignore non-memory instructions accessed through LA-PC.

Branch predictor modifies LA-PC.

RPT must record how many instructions ahead of the

PC the LA-PC is.

Use a counter incremented by LA-PC and decremented by PC.

If LA-PC gets too far ahead, stop prefetching with LA-PC.

Branch missprediction forces LA-PC ← PC.

Baer, p. 225


Hardware complexity of la pc

Hardware Complexity of LA-PC

One port for PC

Two-port RPT

One port for LA-PC

Need to check if line to be prefetched is already in cache

Additional port in Cache

Baer, p. 225


Stream buffer extensions

Stream Buffer Extensions

  • Multiple buffers

    • IBM POWER4 has eight

    • When all are busy, replace LRU buffer

  • Filtering:

    • Only prefetch on misses to consecutive blocks

      • An extension of one-block lookahead

  • Nonunit stride:

    • Use finite-state machines for different regions of state space

Baer, p. 226


Correlation prefetches

Correlation Prefetches

  • Use a Markov model to find correlations between misses

  • Dead line: a line that is still in the cache but that will no longer be used.

  • Alternative to miss-based Markov model: Use a predictor to predict when a line becomes dead.

Baer, p. 227


Prefetching for pointer based structures

Prefetching for Pointer-Based Structures

Artour Stoutchinin, José Nelson Amaral, Guang R. Gao, Jim Dehnert, Suneel Jain, and Alban Douillet, “Speculative Pointer Prefetching of Induction Pointers,” Compiler Construction, Genova, Italy, April, 2001, pp. 289-303.

key

key

key

key

key

next

next

next

next

next

Motivation: Often elements of a linked list are allocated

either contiguous in memory, or with a

constant amount of allocations between elements.


Prefetching for pointer based structures example

Prefetching for Pointer-Based Structures - Example

Pointer-chasing loop:

With prefetch:

  • max = 0;

  • current = head;

  • tmp = current;

  • 3 while(current != NULL) {

  • 4 if(current != NULL) {

  • max = current -> key;

  • 6 current = current -> next;

  • stride = current – tmp;

  • prefetch(current + stride*k);

  • tmp = current;

  • }

  • max = 0;

  • current = head;

  • while(current != NULL) {

  • if(current != NULL) {

  • max = current -> key;

  • current = current -> next;

  • }

Stoutchinin et al., CC 2001


Prefetching for pointer based structures instruction examples

Prefetching for Pointer-Based Structures – Instruction Examples

for(…){

node = ptr->next;

ptr = node->ptr;

}

for(…){

r1 ← load(r2, offset_next);

r2 ← load(r1, offset_ptr);

}

20% performance improvement for mcf on a MIPS R10000 machine

with a 32-KB, 2-way associative, non-blocking L1 and with a

1 MB L2.

for(…){

r2 ← r1;

r1 ← load(r2, offset_pred);

}

for(…){

father = father->pred;

}

Stoutchinin et al., CC 2001


Lockup free caches

Lockup-free Caches

  • Hit-under-miss policy: allows subsequent references to proceed while handling a miss.

  • Lockup-free or nonblocking caches allow several concurrent misses.

Assuming that write misses and dirty-line

replacements are handled by write buffers,

we can focus only on read misses.

Baer, p. 229


Implementation of lockup free caches

Implementation of Lockup-Free Caches

  • Associate a missing status holding register (MSHR) with each read miss.

If there is no free

MSHR to give to a

miss, we have an

structural hazard

→ stall processor.

On a miss, the line address of all MSHRs are checked to see if a

request for that miss has already been issued.

Baer, p. 229


Implementation of lockup free caches1

Implementation of Lockup-Free Caches

  • Associate a missing status holding register (MSHR) with each read miss.

Necessary for cache

coherence protocols.

When an entry arrives from the next level, its address is

compared with all MSHRs. One must match the entry.

Baer, p. 229


Critical word first

Critical Word First

  • L2 first sends to L1 the word requested by the processor.

    • If this word is in the middle of a line, the line must be rotated by L2

  • Only requires a buffer and a shifter

Baer, p. 231


Multiple write misses

Multiple write misses

  • When a large structure or array is initialized or copied

    • A large number of write misses occur

    • With a write-back/write-allocate strategy the write buffer will fill

      • processor has to stall because of structural hazard

int A[1000000], B[1000000];

⋅⋅⋅

for(i=0 ; i<N ; i++){

A[i] = B[i];

}

int A[1000000];

for(i=0 ; i<N ; i++){

A[i] = 0;

}

Baer, p. 231


Write validate policy

Write Validate Policy

  • Upon a write miss, write directly into the cache.

    • Requires a valid bit per word in the line

    • If one word is written, all other words in the line must be invalidated (the tag of the entry changed)

    • Best for write-through caches

Baer, p. 232


Multilevel inclusion property

Multilevel Inclusion Property

The contents of L2 are a superset of the contents of L1.

For a write-back policy:

Space has been allocated in L2

to store the contents of L1.

Baer, p. 232


Multilevel inclusion property example

Multilevel Inclusion Property (example)

  • Assumptions for the example:

    • single processor

    • write-back write policy

    • L1 and L2 have the same line sizes

    • Both L1 and L2 are 2-way set associative with LRU

    • a, a’ and b are lines in L1

    • A, A’ and B are lines in L2

Baer, p. 232


Multilevel inclusion property example1

Multilevel Inclusion Property (example)

B was replaced in L2, therefore b has to be

invalidated in L1.

LRU

LRU

a

A’

a

b

A

a’

B

A

Inv.

0

0

0

1

1

1

0

0

1

1

L1

On each Miss: New line is allocated

in both caches.

Sequence of reads: a, b, a’, b, a

L2

Baer, p. 233


Cache hierarchy

Multilevel Inclusion Property – Support for Invalidation

When an invalidation request for a line is received,

a check is done in L2. If it is not in L2, L1 does not

need to be disturbed.

If the line is in L2, a bit can be used to indicate

if the entry is in L1.

If the line of L2 is larger than the line of L1, then multiple

inclusion bits are necessary in each L2 line.

The associativity of L2 must be high enough so that

it can simultaneouly map all lines in L1 (see example in book).

Baer, p. 233


Multilevel exclusion

Multilevel Exclusion

  • A block is in either L1 or L2, but not in both.

    • Alternative when multilevel inclusion is not enforced.

    • L2 is simply a huge victim cache for L1

    • To prevent coherency operations from disturbing the use of L1 by the processor, L1 must be dual tagged.

Baer, p. 234


Replacement algorithms variations on lru

Replacement AlgorithmsVariations on LRU

Baer, p. 235


Tree based algorithm

Tree-Based Algorithm

  • Example: 4-way set associative cache.

References:

b

b

c

c

a

a

d

d

in L2 cache of IBM Power4

a

b

c

d

a

b

c

d

Requires n-1 bits per set for a n-way set-associative cache.

Baer, p. 235


Half random selection

Half-Random Selection

  • Divide entries in a set into two halves

    • Use a single bit to remember the LRU half

    • Replace randomly from the LRU half

  • Overall performance is as good as LRU


Impact of l2 misses example

Impact of L2 Misses - Example

What is the IPC with L2 misses?

What is the percentage IPC

improvement with 10%

fewer L2 misses?

CPIwith L2 = CPIorig + CPIL2

CPIwith L2 = 0.5 + 0.01 × 100 = 1.5

CPIwith L2 = 0.5 + 0.009 × 100 = 1.4

IPCwith L2 = 1/1.5 = 0.67

IPCwith L2 = 1/1.4 = 0.71

Baer, p. 236


Sector caches

Sector Caches

On a miss, replace one subblock and

invalidate all the others

Smaller area for tags for given capacity.

Smaller miss penalty.

Baer, p. 239


Sector caches today

Sector Caches today

8-way set associative cache with 512-byte lines and four (128-byte) subblocks per line. for tags.

Tags of L3 are kept inside the chip: less area for tags.

in L3 of IBM Power4

Subblock is unit of coherence

⇒ Reduce false sharing

Baer, p. 239


Non uniform cache access

Non-Uniform Cache Access

With very large L2 caches, the time to access entries close

to the processor is shorter than for entries far away from it.

Divide L2 into banks and connect banks through an

interconnection network.

Baer, p. 240


Non uniform cache access example

Non-Uniform Cache Access(Example)

Within a column, banks

are searched

sequentially.

Each colum correspond to one way

in an n-way associative cache.

Best to have the requested entry (MRU)

in the top banck (closest to processor).

On a hit, swap line with bank above it.

On a miss LRU will be at the bottom bank.

New entry end up far from processor.

Baer, p. 240


Cache hierarchy

80 physical int. reg.

72 physical FP reg.

Each of the two processors

can run two threads

simultaneously

80-instruction ROB

load speculation

Branch Predictor:

- Local predictor

- Global predictor

- Predictor of predictors

Distributed Windows:

1 queue/instruction type

11 FIFO queues

Up to 80 instr. may queue

IBM Power4

Power4 (1.7 GHz) and Power5 (1.9 GHz) are 8-way out-of-order superscalars.

Power6 (5GHz – 2007) is in-order – same energy dissipation.

Baer, p. 241


Cache hierarchy

L1 D-Cache: 32KB

2-way/4-way

128-byte lines

Write-through

LRU

L1 I-Cache: 64KB

Direct/2-way

128-byte lines

LRU – 4 sectors

1/1 cycle

1/1 cycle

  • L3:

  • - 32MB/36 MB

  • - 8-way/12-way

  • 512-byte line

  • Writeback

  • Replacement ???

  • 4 sectors

L2:

1.5 MB/2 MB

8 way/10way

128-byte lines

Writeback

Pseudo-LRU

IBM Power4/Power5

L3: 123/87 cycles

12/13 cycles

L3 data is in memory in Power4 and in the processor in Power5.

Baer, p. 242

Memory: 351/220 cycles


Cache hierarchy

128-byte lines divided into 4 32-byte sectors

Single-ported

32 bytes/cycle

≡ 1 sector

≡ 8 instructions

I1

D1

I1

D1

3-port D1 cache:

two 8-byte reads

one 8-byte write

4-line prefetch buffer

4-line prefetch buffer:

Fetched lines go into the buffer. Critical sector sent to pipeline.

In later cycles (on other misses) the line is transferred to the cache.

IBM Power4/Power5

Prefetch:

Hit in prefetch buffer ⇒ prefetch next line.

Miss in the prefetch buffer ⇒ prefetch two subsequent lines.

Baer, p. 242


Cache hierarchy

Multilevel Inclusion: L2 tags contain bits to indicate in which L1 cache the entry is.

8-way set associative

  • Coherence:

  • with L1s

    • (4 coherency processors per slice)

  • with other chip multiprocessors

    • (4 coherency processors per slice)

4 banks of SRAM per slice

Each bank may supply 32 bytes/cycle to L1

512 KB of data per slice

Duplicated tag array

for coherence (snooping).

IBM Power4/Power5

Replacement:

Tree-based pseudo LRU

Tag array is parity-protected. Data is ECC protected.

  • Queues in each slice:

  • one store queue for each processor (stores from L1)

  • a writeback queue fro dirty lines (stores to L3)

  • Several MSHRs for pending requests to L3

  • a queue for snooping requests

Two noncacheable units

(one per processor). (p.e.

to handle memory-mapped I/O).

Baer, p. 243


Cache hierarchy

4 quadrants; 2 banks/quadrant

4 MB embedded DRAM/bank

Sector cache with 4 sectors.

Controler:

4 quadrants;

2 coherence processors/quadrant

1 simple processor/quadrant

for writebacks and DMA

Does not enforce inclusion.

IBM Power4/Power5

Baer, p. 244


Prefetching1

Prefetching

Hardware prefetching at all levels

8 streams of sequential data can be preteched concurrently

Start prefetching a stream after 4 consecutive cache misses

in various levels.

IBM Power4/Power5

A touch instruction can accelerate stream detection.

Baer, p. 245


Stream prefetching1

Stream Prefetching

L1: First access to a prefetched line B0

⇒ prefetch line B1 = B0 + 128 from L2 to L1

  • ⇒ L2: prefetch line B5 = B0 + 512 from L3 to L2

    • Account for longer latency of L2

IBM Power4/Power5

  • ⇒ L3: Every 4th time:

  • prefetch lines B17, B18, B19, B20 from memory toL3

    • Account for longer latency of L3

    • Account for longer lines in L3

  • Prefetching stops at the end of the virtual memory page.

    • Has 4KB and 16MB pages

Baer, p. 245


  • Login