advanced microarchitecture n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Advanced Microarchitecture PowerPoint Presentation
Download Presentation
Advanced Microarchitecture

Loading in 2 Seconds...

play fullscreen
1 / 32

Advanced Microarchitecture - PowerPoint PPT Presentation


  • 130 Views
  • Uploaded on

Advanced Microarchitecture. Lecture 12: Caches and Memory. 1. 1. “6T SRAM” cell 2 access gates 2T per inverter. b. b. SRAM {Over|Re}view. Chained inverters maintain a stable state Access gates provide access to the cell

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Advanced Microarchitecture' - dezso


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
advanced microarchitecture

Advanced Microarchitecture

Lecture 12: Caches and Memory

sram over re view

1

1

“6T SRAM” cell

2 access gates

2T per inverter

b

b

SRAM {Over|Re}view
  • Chained inverters maintain a stable state
  • Access gates provide access to the cell
  • Writing to a cell involves over-powering the two small storage inverters

0

1

0

1

Lecture 12: Caches and Memory

64 1 bit sram array organization

Why are we reading

both b and b?

64×1-bit SRAM Array Organization

1-of-8 Decoder

“Wordline”

1-of-8 Decoder

“Bitlines”

“Column

Mux”

Lecture 12: Caches and Memory

sram density vs speed

*Long* metal line with a

lot of parasitic loading

SRAM Density vs. Speed
  • 6T cell must be small as possible to have dense storage
    • Bigger caches
    • Smaller transistors  slower transistors

So dinky inverters cannot

drive their outputs very

quickly…

Lecture 12: Caches and Memory

sense amplifiers

b

Small cell discharges

bitline very slowly

b

Sense amp “sees” the difference

quickly and outputs b’s value

Wordline

enabled

Sense Amplifiers
  • Type of differential amplifier
    • Two inputs, amplifies the difference

X

Diff

Amp

Bitlines

precharged

To Vdd

a × (X – Y) + Vbias

Y

Sometimes precharge

bitlines to Vdd/2 which

makes a bigger “delta”

for faster sensing

Lecture 12: Caches and Memory

multi porting

Wordline2

b2

b2

Wordlines = 2 × ports

Bitlines = 4 × ports

Area = O(ports2)

Multi-Porting

Wordline1

b1

b1

Lecture 12: Caches and Memory

port requirements
Port Requirements
  • ARF, PRF, RAT all need many read and write ports to support superscalar execution
    • Luckily, these have limited number of entries/bytes
  • Caches also need multiple ports
    • Not as many ports
    • But the overall size is much larger

Lecture 12: Caches and Memory

delay of regular caches
Delay Of Regular Caches
  • I$
    • low port requirement (one fetch group/$-line per cycle)
    • latency only exposed on branch mispredict
  • D$
    • higher port requirement (multiple LD/ST per cycle)
    • latency often on critical path of execution
  • L2
    • lower port requirement (most accesses hit in L1)
    • latency less important (only observed on L1 miss)
    • optimizing for hit rate usually more important than latency
      • difference between L2 latency and DRAM latency is large

Lecture 12: Caches and Memory

banking
Banking

Big

4-ported

L1 Data

Cache

Decoder

Decoder

Decoder

Decoder

SRAM

Array

Decoder

SRAM

Array

Decoder

SRAM

Array

S

S

Decoder

SRAM

Array

Decoder

SRAM

Array

S

S

Sense

Sense

Sense

Sense

Column

Muxing

4 Banks, 1 port each

Each bank is much faster

Slow due to quadratic

area growth

Lecture 12: Caches and Memory

bank conflicts
Bank Conflicts
  • Banking provides high bandwidth
  • But only if all accesses are to different banks
  • Banks typically address interleaved
    • For N banks
    • Addr  bank[Addr % N]
      • Addr on cache line granularity
    • For 4 banks, 2 accesses, chance of conflict is 25%
    • Need to match # banks to access patterns/BW

Lecture 12: Caches and Memory

associativity

RAM

CAM

Associativity
  • You should know this already

foo

foo

foo

foo’s value

foo

foo’s value

foo’s value

foo

set associative

CAM/RAM hybrid?

direct mapped

fully associative

Lecture 12: Caches and Memory

set associative caches
Set-Associative Caches
  • Set-associativity good for reducing conflict misses
  • Cost: slower cache access
    • often dominated by the tag array comparisons
    • Basically mini-CAM logic
  • Must trade off:
    • Smaller cache size
    • Longer latency
    • Lower associativity
  • Every option hurts performance

=

=

=

=

40-50 bit

comparison!

Lecture 12: Caches and Memory

way prediction

=

=

=

=

Tag check

still occurs

to validate

way-pred

Way-Prediction
  • If figuring out the way takes too long, then just guess!

“E”

Way

Pred

S

X

X

X

Load

PC

Payload

  • May be hard to predict way if the same load accesses different addresses

Lecture 12: Caches and Memory

way prediction 2

Way-prediction continues

to hit

Way-Prediction (2)
  • Organize data array s.t. left most way is the MRU

MRU

LRU

Accesses

On way-miss, move block

to MRU position

Way-Miss (Cache Hit)

Way-prediction keeps hitting

Way-predict the MRU way

Complication: data array needs datapath

for swapping blocks (maybe 100’s of bits)

Normally just update a few LRU bits in

the tag array (< 10 bits?)

Lecture 12: Caches and Memory

partial tagging

=

=

=

=

Tag array lookup

now much faster!

Partial Tagging
  • Like BTBs, just use part of the tag

=

=

=

=

Partial tags lead to false hits:

Tag 0x45120001 looks like a hit

for Address 0x3B120001

Similar to way-prediction, full tag

comparison still needed to verify

“real” hit --- not on critical path

Lecture 12: Caches and Memory

in the lsq
… in the LSQ
  • Partial tagging can be used in the LSQ as well

Do address check on

partial addresses only

On a partial hit,

forward the data

Slower complete

tag check verifies

the match/no match

Replay or flush

as needed

If a store finds a later partially-matched

load, don’t do pipeline flush right away

Penalty is too severe, wait for slow

check before flushing the pipe

Lecture 12: Caches and Memory

interaction with scheduling
Interaction With Scheduling
  • Bank conflicts, way-mispredictions, partial-tag false hits
    • All change the latency of the load instruction
    • Increases frequency of replays
      • more “replay conditions” exist/encountered
    • Need careful tradeoff between
      • performance (reducing effective cache latency)
      • performance (frequency of replaying instructions)
      • power (frequency of replaying instructions)

Lecture 12: Caches and Memory

alternatives to adding associativity
Alternatives to Adding Associativity
  • More Set-Assoc needed when number of items mapping to same cache set > number of ways
  • Not all sets suffer from high conflict rates
  • Idea: provide a little extra associativity, but not for each and every set

Lecture 12: Caches and Memory

victim cache

Victim

Cache

A

B

C

D

X

Y

Z

J

K

L

M

P

Q

R

Victim Cache

E

A

B

C

A

B

C

D

D

E

A

B

C

A

C

B

D

X

Y

Z

K

M

L

J

N

J

M

K

J

L

K

L

M

N

J

K

L

P

Q

R

Every access is a miss!

ABCED and JKLMN

do not “fit” in a 4-way

set associative cache

Victim cache provides

a “fifth way” so long as

only four sets overflow

into it at the same time

Can even provide 6th

or 7th … ways

Lecture 12: Caches and Memory

skewed associativity
Skewed Associativity

Regular Set-Associative Cache

A

B

Lots of misses

C

B

D

D

A

X

Y

C

W

Z

W

X

Y

Z

Skewed-Associative Cache

D

Fewer of misses

X

B

A

Z

W

C

Y

Lecture 12: Caches and Memory

required associativity varies
Required Associativity Varies
  • Program stack needs very little associativity
    • spatial locality
      • stack frame is laid out sequentially
      • function usually only refers to own stack frame

f()

Layout in 4-way Cache

Call Stack

g()

Addresses

laid out in

linear

organization

h()

MRU

LRU

j()

Associativity not being used effectively

k()

Lecture 12: Caches and Memory

stack cache

Lots of conflicts!

“Regular”

Cache

Stack Cache

Stack Cache

f()

g()

“Nice” stack

accesses

h()

j()

k()

Disorganized

heap accesses

Lecture 12: Caches and Memory

stack cache 2
Stack Cache (2)
  • Stack cache portion can be a lot simpler due to direct-mapped structure
    • relatively easily prefetched for by monitoring call/retn’s
  • “Regular” cache portion can have lower associativity
    • doesn’t have conflicts due to stack/heap interaction

Lecture 12: Caches and Memory

stack cache 3
Stack Cache (3)
  • Which cache does a load access?
    • Many ISA’s have a “default” stack-pointer register

LDQ 0[$sp]

Stack Cache

MOV $t3 = $sp

LDQ 12[$sp]

LDQ 8[$t3]

Need stack base and offset

information, and then need

to check each cache access

against these bounds

Wrong cache  replay

X

Regular Cache

LDQ 24[$sp]

LDQ 0[$t1]

Lecture 12: Caches and Memory

multi lateral caches
Multi-Lateral Caches
  • Normal cache is “uni-lateral” in that everything goes into the same place
  • Stack cache is an example of “multi-lateral” caches
    • multiple cache structures with disjoint contents
    • I$ vs. D$ could be considered multi-lateral

Lecture 12: Caches and Memory

access patterns
Access Patterns
  • Stack cache showed how different loads exhibit different access patterns

Stack

(multiple push/pop’s

of frames)

Heap

(heavily data-dependent

access patterns)

Streaming

(linear accesses

with low/no reuse)

Lecture 12: Caches and Memory

low reuse accesses

while(some condition) {

struct tree_t * parent = getNextRoot(…);

if(parent->valid) {

doTreeTraversalStuff(parent);

doMoreStuffToTree(parent);

pickFruitFromTree(parent);

}

}

parent->valid accessed once,

and then not used again

Fields map to different cache lines

Low-Reuse Accesses
  • Streaming
    • once you’re done decoding MPEG frame, no need to revisit
  • Other

structtree_t {

int valid;

intother_fields[24];

intnum_children;

structtree_t * children;

};

Lecture 12: Caches and Memory

filter caches

If not accessed again, eventually LRU’d out

Fill on miss

First-time misses

are placed in filter

cache

If accessed again, promote

to the main cache

Filter Caches
  • Several proposed variations
    • annex cache, pollution control cache, etc.

Main

Cache

Main cache only contains

lines with proven reuse

One-time-use lines have

been filtered out

Small

Filter

Cache

Can be thought of as the

“dual” of the victim cache

Lecture 12: Caches and Memory

trouble w multi lateral caches
Trouble w/ Multi-Lateral Caches
  • More complexity
    • load may need to be routed to different places
      • may require some form of prediction to pick the right one
        • guessing wrong can cause replays
      • or accessing multiple in parallel increases power
        • no bandwidth benefit
    • more sources to bypass from
      • costs both latency and power in bypass network

Lecture 12: Caches and Memory

memory level parallelism mlp
Memory-Level Parallelism (MLP)
  • What if memory latency is 10000 cycles?
    • Not enough traditional ILP to cover this latency
    • Runtime dominated by waiting for memory
    • What matters is overlapping memory accesses
  • MLP: “number of outstanding cache misses [to main memory] that can be generated and executed in an overlapped manner.”
  • ILP is a property of a DFG; MLP is a metric
    • ILP is independent of the underlying execution engine
    • MLP is dependent on the microarchitecture assumptions
    • You can measure MLP for uniprocessor, CMP, etc.

Lecture 12: Caches and Memory

uarchs for mlp
uArchs for MLP
  • WIB – Waiting Instruction Buffer

WIB

Scheduler

Scheduler

Load

miss

Load miss

No instructions in

forward slice can

execute

Eventually expose

other independent

load misses (MLP)

Move forward slice

to separate buffer

Eventually all independent

insts issue and scheduler

contains only insts in the

forward slice… stalled

Independent insts

continue to issue

New insts keep

the scheduler busy

Lecture 12: Caches and Memory

wib hardware
WIB Hardware
  • Similar to replay – continue issuing dependent instructions, but need to shunt to the WIB
  • WIB hardware can potentially be large
    • WIB doesn’t do scheduling – no CAM logic needed
  • Need to redispatch from WIB back into RS when load comes back from memory
    • like redispatching from replay-queue

Lecture 12: Caches and Memory