The Story so far:
This presentation is the property of its rightful owner.
Sponsored Links
1 / 76

Instruction Set Architectures Performance issues ALUs Single Cycle CPU PowerPoint PPT Presentation


  • 87 Views
  • Uploaded on
  • Presentation posted in: General

The Story so far:. Instruction Set Architectures Performance issues ALUs Single Cycle CPU Multicycle CPU: datapath; control, Exceptions Pipelining Memory systems Static/Dynamic RAM technologies Cache structures: Direct mapped, Associative Virtual Memory. Memory. Memory systems.

Download Presentation

Instruction Set Architectures Performance issues ALUs Single Cycle CPU

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Instruction set architectures performance issues alus single cycle cpu

The Story so far:

  • Instruction Set Architectures

  • Performance issues

  • ALUs

  • Single Cycle CPU

  • Multicycle CPU: datapath; control, Exceptions

  • Pipelining

  • Memory systems

    • Static/Dynamic RAM technologies

    • Cache structures: Direct mapped, Associative

    • Virtual Memory

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Memory

Memory systems

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Memory Systems

Computer

Control

Input

Memory

Output

Datapath

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Technology Trends (from 1st lecture)

CapacitySpeed (latency)

Logic:2x in 3 years2x in 3 years

DRAM:4x in 3 years2x in 10 years

Disk:4x in 3 years2x in 10 years

DRAM

YearSizeCycle Time

198064 Kb250 ns

1983256 Kb220 ns

19861 Mb190 ns

19894 Mb165 ns

199216 Mb145 ns

199564 Mb120 ns

1000:1!

2:1!

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Who Cares About the Memory Hierarchy?

Processor-DRAM Memory Gap (latency)

µProc

60%/yr.

(2X/1.5yr)

1000

CPU

“Moore’s Law”

100

Processor-Memory

Performance Gap:(grows 50% / year)

Performance

10

DRAM

9%/yr.

(2X/10 yrs)

DRAM

1

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

Time

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Today’s Situation: Microprocessor

  • Processor Speeds

    • Intel Pentium III : 4GHz =

  • Memory Speeds

    • Mac/PC/Workstation DRAM: 50 ns

    • Disks are even slower....

  • How can we span this “access time” gap?

    • 1 instruction fetch per instruction

    • 15/100 instructions also do a data read or write (load or store)

  • Rely on caches to bridge gap

  • Microprocessor-DRAM performance gap

    • time of a full cache miss in instructions executed

      1st Alpha (7000): 340 ns/5.0 ns =  68 clks x 2 or136 instructions

      2nd Alpha (8400):266 ns/3.3 ns =  80 clks x 4 or320 instructions

      3rd Alpha (t.b.d.):180 ns/1.7 ns =108 clks x 6 or648 instructions

    • 1/2X latency x 3X clock rate x 3X Instr/clock  ­5X

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Impact on Performance

  • Suppose a processor executes at

    • Clock Rate = 200 MHz (5 ns per cycle)

    • CPI = 1.1

    • 50% arith/logic, 30% ld/st, 20% control

  • Suppose that 10% of memory ops get 50 cycle miss penalty

  • CPI = ideal CPI + average stalls per instruction= 1.1(cyc) +( 0.30 (datamops/ins) x 0.10 (miss/datamop) x 50 (cycle/miss) )= 1.1 cycle + 1.5 cycle = 2. 6

  • 58 % of the time the processor is stalled waiting for memory!

  • a 1% instruction miss rate would add an additional 0.5 cycles to the CPI!

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Memory system-hierarchical

Processor

Control

Memory

Memory

Memory

Datapath

Memory

Memory

Slowest

Speed:

Fastest

Biggest

Size:

Smallest

Lowest

Cost:

Highest

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Probability

of reference

0

2^n - 1

Address Space

Why hierarchy works

  • The Principle of Locality:

    • Program access a relatively small portion of the address space at any instant of time.

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

x

x

x

t

likely to reference x

x

likely reference zone

address

space

Locality

  • Property of memory references in “typical programs”

  • Tendency to favor a portion of their address space at any given time

  • Temporal

    • Tendency to reference locations recently referenced

  • Spatial

    • Tendency to reference locations “near” those recently referenced

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Lower Level

Memory

Upper Level

Memory

To Processor

Blk X

From Processor

Blk Y

Memory Hierarchy: How Does it Work?

  • Temporal Locality(Locality in Time):

    => Keep most recently accessed data items closer to the processor

  • Spatial Locality(Locality in Space):

    => Move blocks consisting of contiguous words to the upper levels

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Memory Hierarchy: How Does it Work?

  • Memory hierarchies exploit locality by cacheing (keeping close to the processor) data likely to be used again.

  • This is done because we can build large, slow memories and small, fast memories, but we can’t build large, fast memories.

  • If it works, we get the illusion of SRAM access time with disk capacity

    SRAM access times are 2 - 25ns at cost of $100 to $250 per Mbyte.

    DRAM access times are 60-120ns at cost of $5 to $10 per Mbyte.

    Disk access times are 10 to 20 million ns at cost of $.10 to $.20 per Mbyte.

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Memory Hierarchy: Terminology

  • Hit: data appears in some block in the upper level (example: Block X)

    • Hit Rate: the fraction of memory access found in the upper level

    • Hit Time: Time to access the upper level which consists of

      RAM access time + Time to determine hit/miss

  • Miss: data needs to be retrieve from a block in the lower level (Block Y)

    • Miss Rate= 1 - (Hit Rate)

    • Miss Penalty: Time to replace a block in the upper level +

      Time to deliver the block the processor

  • Hit Time << Miss Penalty

Lower Level

Memory

Upper Level

Memory

To Processor

Blk X

From Processor

Blk Y

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Memory Hierarchy of a Modern Computer System

  • By taking advantage of the principle of locality:

    • Present the user with as much memory as is available in the cheapest technology.

    • Provide access at the speed offered by the fastest technology.

Processor

Control

Tertiary

Storage

(Disk)

Secondary

Storage

(Disk)

Second

Level

Cache

(SRAM)

Main

Memory

(DRAM)

On-Chip

Cache

Datapath

Registers

Speed (ns):

1s

10s

100s

10,000,000s

(10s ms)

10,000,000,000s

(10s sec)

Size (bytes):

100s

Ks

Ms

Gs

Ts

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Main Memory Background

  • Performance of Main Memory:

    • Latency: Cache Miss Penalty

      • Access Time: time between request and word arrives

      • Cycle Time: time between requests

    • Bandwidth: I/O & Large Block Miss Penalty (L2)

  • Main Memory is DRAM: Dynamic Random Access Memory

    • Dynamic since needs to be refreshed periodically (8 ms)

    • Addresses divided into 2 halves (Memory as a 2D matrix):

      • RAS or Row Access Strobe

      • CAS or Column Access Strobe

  • Cache uses SRAM: Static Random Access Memory

    • No refresh (6 transistors/bit vs. 1 transistor /bit)

    • Address not divided

  • Size: DRAM/SRAM ­ 4-8, Cost/Cycle time: SRAM/DRAM ­ 8-16

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Static RAM Cell

6-Transistor SRAM Cell

word

word

(row select)

0

1

0

1

bit

bit

  • Write:

    • 1. Drive bit lines (bit=1, bit=0)

    • 2.. Select row

  • Read:

    • 1. Precharge bit and bit to Vdd

    • 2.. Select row

    • 3. Cell pulls one line low

    • 4. Sense amp on column detects difference between bit and bit

bit

bit

replaced with pullup

to save area

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Wr Driver &

Precharger

Wr Driver &

Precharger

Wr Driver &

Precharger

Wr Driver &

Precharger

-

+

-

+

-

+

-

+

-

+

-

+

-

+

-

+

Sense Amp

Sense Amp

Sense Amp

Sense Amp

Typical SRAM Organization: 16-word x 4-bit

Din 3

Din 2

Din 1

Din 0

WrEn

Precharge

A0

Word 0

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

A1

A2

Address Decoder

Word 1

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

A3

:

:

:

:

Word 15

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

Dout 3

Dout 2

Dout 1

Dout 0

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Problems with SRAM

Select = 1

P1

P2

Off

On

On

On

On

Off

N1

N2

bit = 1

bit = 0

  • Six transistors use up a lot of area

  • Consider a “Zero” is stored in the cell:

    • Transistor N1 will try to pull “bit” to 0

    • Transistor P2 will try to pull “bit bar” to 1

  • But bit lines are precharged to high: Are P1 and P2 necessary?

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

A

N

2

words

N

x M bit

SRAM

WE_L

OE_L

D

M

Logic Diagram of a Typical SRAM

  • Write Enable is usually active low (WE_L)

  • Din and Dout combined to save pins:

  • output enable (OE_L) (new signal)

  • WE_L asserted , OE_L deasserted

    • D => data input pin

  • WE_L deasserted , OE_L asserted

    • D => data output pin

Read Timing:

Write Timing:

High Z

D

Data In

Data Out

Data Out

Junk

A

Write Address

Read Address

Read Address

OE_L

WE_L

Write

Hold Time

Read Access

Time

Read Access

Time

Write Setup Time

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

1-Transistor Memory Cell (DRAM)

row select

  • Write:

    • 1. Drive bit line

    • 2.. Select row

  • Read:

    • 1. Precharge bit line to Vdd

    • 2.. Select row

    • 3. Cell and bit line share charges

      • Very small voltage changes on the bit line

    • 4. Sense (fancy sense amp)

      • Can detect changes of ~1 million electrons

    • 5. Write: restore the value

  • Refresh

    • 1. Just do a dummy read to every cell.

bit

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Classical DRAM Organization (square)

bit (data) lines

r

o

w

d

e

c

o

d

e

r

Each intersection represents

a 1-T DRAM Cell

RAM Cell

Array

word (row) select

Column Selector &

I/O Circuits

Column

Address

row

address

  • Row and Column Address together:

    • Select 1 bit a time

data

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Logic Diagram of a Typical DRAM

RAS_L

CAS_L

WE_L

OE_L

A

256K x 8

DRAM

D

9

8

  • Control Signals (RAS_L, CAS_L, WE_L, OE_L) are all active low

  • Din and Dout are combined (D):

    • WE_L is asserted (Low), OE_L is deasserted (High)

      • D serves as the data input pin

    • WE_L is deasserted (High), OE_L is asserted (Low)

      • D is the data output pin

  • Row and column addresses share the same pins (A)

    • RAS_L goes low: Pins A are latched in as row address

    • CAS_L goes low: Pins A are latched in as column address

    • RAS/CAS edge-sensitive

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Key DRAM Timing Parameters

  • tRAC: minimum time from RAS line falling to the valid data output.

    • Quoted as the speed of a DRAM

    • A fast 4Mb DRAM tRAC = 60 ns

  • tRC: minimum time from the start of one row access to the start of the next.

    • tRC = 110 ns for a 4Mbit DRAM with a tRAC of 60 ns

  • tCAC: minimum time from CAS line falling to valid data output.

    • 15 ns for a 4Mbit DRAM with a tRAC of 60 ns

  • tPC: minimum time from the start of one column access to the start of the next.

    • 35 ns for a 4Mbit DRAM with a tRAC of 60 ns

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

DRAM Performance

  • A 60 ns (tRAC) DRAM can

    • perform a row access only every 110 ns (tRC)

    • perform column access (tCAC) in 15 ns, but time between column accesses is at least 35 ns (tPC).

      • In practice, external address delays and turning around buses make it 40 to 50 ns

  • These times do not include the time to drive the addresses off the microprocessor nor the memory controller overhead.

    • Drive parallel DRAMs, external memory controller, bus to turn around, SIMM module, pins…

    • 180 ns to 250 ns latency from processor to memory is good for a “60 ns” (tRAC) DRAM

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

RAS_L

DRAM Write Timing

  • Every DRAM access begins at:

    • The assertion of the RAS_L

    • 2 ways to write: early or late v. CAS

RAS_L

CAS_L

WE_L

OE_L

A

256K x 8

DRAM

D

9

8

DRAM WR Cycle Time

CAS_L

A

Row Address

Col Address

Junk

Row Address

Col Address

Junk

OE_L

WE_L

D

Junk

Data In

Junk

Data In

Junk

WR Access Time

WR Access Time

Early Wr Cycle: WE_L asserted before CAS_L

Late Wr Cycle: WE_L asserted after CAS_L

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

RAS_L

CAS_L

WE_L

OE_L

A

256K x 8

DRAM

D

9

8

RAS_L

DRAM Read Timing

  • Every DRAM access begins at:

    • The assertion of the RAS_L

    • 2 ways to read: early or late v. CAS

DRAM Read Cycle Time

CAS_L

A

Row Address

Col Address

Junk

Row Address

Col Address

Junk

WE_L

OE_L

D

High Z

Junk

Data Out

High Z

Data Out

Read Access

Time

Output Enable

Delay

Early Read Cycle: OE_L asserted before CAS_L

Late Read Cycle: OE_L asserted after CAS_L

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Cycle Time versus Access Time

Cycle Time

Access Time

Time

  • DRAM (Read/Write) Cycle Time >> DRAM (Read/Write) Access Time

    • ­ 2:1; why?

  • DRAM (Read/Write) Cycle Time :

    • How frequent can you initiate an access?

    • Analogy: A little kid can only ask his father for money on Saturday

  • DRAM (Read/Write) Access Time:

    • How quickly will you get what you want once you initiate an access?

    • Analogy: As soon as he asks, his father will give him the money

  • DRAM Bandwidth Limitation analogy:

    • What happens if he runs out of money on Wednesday?

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Increasing Bandwidth - Interleaving

Access Pattern without Interleaving:

CPU

Memory

D1 available

Start Access for D1

Start Access for D2

Memory

Bank 0

Access Pattern with 4-way Interleaving:

Memory

Bank 1

CPU

Memory

Bank 2

Memory

Bank 3

Access Bank 1

Access Bank 0

Access Bank 2

Access Bank 3

We can Access Bank 0 again

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

32

8

8

2

4

1

8

2

4

1

8

2

Fewer DRAMs/System over Time

DRAM Generation

‘86 ‘89 ‘92‘96‘99‘02

1 Mb 4 Mb 16 Mb 64 Mb 256 Mb1 Gb

Memory per

DRAM growth

@ 60% / year

4 MB

8 MB

16 MB

32 MB

64 MB

128 MB

256 MB

16

4

Memory per

System growth

@ 25%-30% / year

Minimum PC Memory Size

  • Increasing DRAM => fewer chips => harder to have banks

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

N cols

RAS_L

Page Mode DRAM: Motivation

Column

Address

  • Regular DRAM Organization:

    • N rows x N column x M-bit

    • Read & Write M-bit at a time

    • Each M-bit access requiresa RAS / CAS cycle

  • Fast Page Mode DRAM

    • N x M “register” to save a row

DRAM

Row

Address

N rows

M bits

M-bit Output

1st M-bit Access

2nd M-bit Access

CAS_L

A

Row Address

Col Address

Junk

Row Address

Col Address

Junk

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

N cols

Fast Page Mode Operation

Column

Address

  • Fast Page Mode DRAM

    • N x M “SRAM” to save a row

  • After a row is read into the register

    • Only CAS is needed to access other M-bit blocks on that row

    • RAS_L remains asserted while CAS_L is toggled

DRAM

Row

Address

N rows

N x M “SRAM”

M bits

M-bit Output

1st M-bit Access

2nd M-bit

3rd M-bit

4th M-bit

RAS_L

CAS_L

A

Row Address

Col Address

Col Address

Col Address

Col Address

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

DRAMs over Time

DRAM Generation

1st Gen. Sample

Memory Size

Die Size (mm2)

Memory Area (mm2)

Memory Cell Area (µm2)

‘84 ‘87 ‘90‘93‘96‘99

1 Mb 4 Mb 16 Mb 64 Mb 256 Mb1 Gb

5585130200300450

304772110165250

28.8411.14.261.640.610.23

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Memory - Summary

  • Two Different Types of Locality:

    • Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon.

    • Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon.

  • By taking advantage of the principle of locality:

    • Present the user with as much memory as is available in the cheapest technology.

    • Provide access at the speed offered by the fastest technology.

  • DRAM is slow but cheap and dense:

    • Good choice for presenting the user with a BIG memory system

  • SRAM is fast but expensive and not very dense:

    • Good choice for providing the user FAST access time.

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Cache Fundamentals

  • cache hit -- an access where the data

  • is found in the cache.

  • cache miss -- an access which isn’t

  • hit time -- time to access the cache

  • miss penalty -- time to move data from

  • further level to closer, then to cpu

  • hit ratio -- percentage of time the data is found in the

  • cache

  • miss ratio -- (1 - hit ratio)

cpu

lowest-level

cache

next-level

memory/cache

  • cache block size or cache line size-- the

  • amount of data that gets transferred on a

  • cache miss.

  • instruction cache -- cache that only holds

  • instructions.

  • data cache -- cache that only caches data.

  • unified cache -- cache that holds both.

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Cacheing Issues

cpu

access

lowest-level

cache

  • On a memory access -

  • How do I know if this is a hit or miss?

  • On a cache miss -

  • where to put the new data?

  • what data to throw out?

  • how to remember what data this is?

miss

next-level

memory/cache

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

A simple cache : similar to the branch prediction table in pipelining ?

an index is used

to determine

which line an address

might be found in

address string:

400000100

800001000

1200001100

400000100

800001000

2000010100

400000100

800001000

2000010100

2400011000

1200001100

800001000

400000100

00000100

tag

data

4 entries, each block holds one word, each word in memory maps to exactly one cache location.

  • A cache that can put a line of data in exactly one place is called direct-mapped

  • Conflict Misses are misses caused by:

    • Different memory locations mapped to the same cache index

      • Solution 1: make the cache size bigger

      • Solution 2: Multiple entries for the same Cache Index

    • Conflict misses = 0 by definition, for fully associative caches.

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Fully associative cache

the tag identifies

the address of

the cached data

address string:

400000100

800001000

1200001100

400000100

800001000

2000010100

400000100

800001000

2000010100

2400011000

1200001100

800001000

400000100

tag

data

4 entries, each block holds one word, any block can hold any word.

  • A cache that can put a line of data anywhere is called fully associative

  • The most popular replacement strategy is LRU (least recently used).

  • How do you find the data ?

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Fully associative cache

  • Fully Associative Cache

    • Forget about the Cache Index

    • Compare the Cache Tags of all cache entries in parallel

    • Example: Block Size = 2 B blocks, we need N 27-bit comparators

  • By definition: Conflict Miss = 0 for a fully associative cache

31

4

0

Cache Tag (27 bits long)

Byte Select

Ex: 0x01

Cache Tag

Valid Bit

Cache Data

:

X

Byte 31

Byte 1

Byte 0

:

X

Byte 63

Byte 33

Byte 32

X

X

:

:

:

X

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

A n-way set associative cache

address string:

400000100

800001000

1200001100

400000100

800001000

2000010100

400000100

800001000

2000010100

2400011000

1200001100

800001000

400000100

00000100

tag

data

tag

data

4 entries, each block holds one word, each word

in memory maps to one of a set of n cache lines

  • A cache that can put a line of data in exactly n places is called n-way set-associative.

  • The cache lines that share the same index are a cache set.

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Cache Data

Cache Tag

Valid

Cache Block 0

:

:

:

Compare

A n-way set associative cache

  • N-way set associative: N entries for each Cache Index

    • N direct mapped caches operates in parallel

  • Example: Two-way set associative cache

    • Cache Index selects a “set” from the cache

    • The two tags in the set are compared in parallel

    • Data is selected based on the tag result

Cache Index

Valid

Cache Tag

Cache Data

Cache Block 0

:

:

:

Adr Tag

Compare

1

0

Mux

Sel1

Sel0

OR

Cache Block

Hit

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Cache Index

Valid

Cache Tag

Cache Data

Cache Data

Cache Tag

Valid

Cache Block 0

Cache Block 0

:

:

:

:

:

:

Adr Tag

Compare

Compare

1

0

Mux

Sel1

Sel0

OR

Cache Block

Hit

Direct vs. Set Associative Caches

  • N-way Set Associative Cache versus Direct Mapped Cache:

    • N comparators vs. 1

    • Extra MUX delay for the data

    • Data comes AFTER Hit/Miss decision and set selection

  • In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:

    • Possible to assume a hit and continue. Recover later if miss.

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Longer cache-blocks

address string:

400000100

800001000

1200001100

400000100

800001000

2000010100

400000100

800001000

2000010100

2400011000

1200001100

800001000

400000100

00000100

tag

data

4 entries, each block holds two words, each word

in memory maps to exactly one cache location

(this cache is twice the total size of the prior caches).

  • Large cache blocks take advantage of spatial locality.

  • Too large of a block size can waste cache space.

  • Longer cache blocks require less tag space

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Increasing block size

  • In general, larger block size take advantage of spatial locality BUT:

    • Larger block size means larger miss penalty:

      • Takes longer time to fill up the block

    • If block size is too big relative to cache size, miss rate will go up

      • Too few cache blocks

  • In general, Average Access Time:

    • = Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Sources of Cache Misses

  • Compulsory (cold start or process migration, first reference): first access to a block

    • “Cold” fact of life: not a whole lot you can do about it

    • Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant

  • Conflict (collision):

    • Multiple memory locations mappedto the same cache location

    • Solution 1: increase cache size

    • Solution 2: increase associativity

  • Capacity:

    • Cache cannot contain all blocks access by the program

    • Solution: increase cache size

  • Invalidation: other process (e.g., I/O) updates memory

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Sources of Cache Misses

Direct Mapped

N-way Set Associative

Fully Associative

Cache Size

Big

Medium

Small

Compulsory Miss

Same

Same

Same

Conflict Miss

High

Medium

Zero

Capacity Miss

Low

Medium

High

Invalidation Miss

Same

Same

Same

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

ICache

Reg

Dcache

Reg

ALU

Accessing a cache

1. Use index and tag to access cache and determine hit/miss.

2. If hit, return requested data.

3. If miss, select a cache block to be replaced, and access memory or next lower cache (possibly stalling the processor).

-load entire missed cache line into cache

-return requested data to CPU (or higher cache)

4. If next lower memory is a cache, goto step 1 for that cache.

IF IDEXMEM WB

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Accessing a cache

  • 64 KB cache, direct-mapped, 32 byte cache block size

31 30 29 28 27 ........... 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

tag

index

word offset

3

11

16

tag

data

valid

0

1

2

...

...

...

...

2045

2046

2047

64 KB / 32 bytes =

2 K cache blocks/sets

256

=

32

hit/miss

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Accessing a cache

  • 32 KB cache, 2-way set-associative, 16-byte block size (cache lines)

31 30 29 28 27 ........... 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

tag

index

word offset

10

18

tag

data

tag

data

valid

valid

0

1

2

...

...

...

...

1021

1022

1023

32 KB / 16 bytes / 2 =

1 K cache sets

=

=

hit/miss

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Cache Alignment

memory address

tagindexblock offset

Memory

  • The data that gets moved into the cache on a miss are all data whose addresses share the same tag and index (regardless of which data gets accessed first).

  • This results in

    • no overlap of cache lines

    • easy mapping of addresses to cache lines (no additions)

    • data at address X always being present in the same location in the cache block (at byte X mod blocksize) if it is there at all.

  • Think of memory as organized into cache-line sized pieces (because in reality, it is!).

  • Recall DRAM page mode architecture !!

0

1

2

3

4

5

6

7

8

9

10

.

.

.

.

.

.

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Basic Memory Hierarchy questions

  • Q1: Where can a block be placed in the upper level? (Block placement)

  • Q2: How is a block found if it is in the upper level? (Block identification)

  • Q3: Which block should be replaced on a miss? (Block replacement)

  • Q4: What happens on a write? (Write strategy)

  • Placement: Cache structure determines the where ?:

    • Fully associative, direct mapped, 2-way set associative

    • S.A. Mapping = Block Number Modulo Number Sets

  • Identification: Tag on each block

    • No need to check index or block offset

  • Replacement: Easy for Direct Mapped

    • Set Associative or Fully Associative:

      • Random

      • LRU (Least Recently Used)

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

What about writes?

  • Stores must be handled differently than loads, because...

    • they don’t necessarily require the CPU to stall.

    • they change the content of cache/memory (creating memory consistency issues)

    • may require a load and a store to complete

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

What about writes?

  • Keep memory and cache identical?

    • write-through => all writes go to both cache and main memory

    • write-back => writes go only to cache. Modified cache lines are written back to memory when the line is replaced.

  • Make room in cache for store miss?

    • write-allocate => on a store miss, bring written line into the cache

    • write-around => on a store miss, ignore cache

  • On a store hit, write the new data to cache. In a write-through cache, write the data immediately to memory. In a write-back cache, mark the line as dirty.

  • On a store miss, initiate a cache block load from memory for a write-allocate cache. Write directly to memory for a write-around cache.

  • On any kind of cache miss in a write-back cache, if the line to be replaced in the cache is dirty, write it back to memory.

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Cache Performance

  • TCPI = BCPI + MCPI

    • where TCPI = Total CPI;

    • BCPI = base CPI, which means the CPI assuming perfect memory;

    • MCPI = the memory CPI, the number of cycles (per instruction) the processor is stalled waiting for memory.

  • MCPI = accesses/instruction * miss rate * miss penalty

    • this assumes we stall the pipeline on both read and write misses, that the miss penalty is the same for both, that cache hits require no stalls.

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Dec Alpha

Example -- DEC Alpha 21164 Caches

Instruction

Cache

Unified

L2

Cache

Off-Chip

L3 Cache

21164 CPU

core

Data

Cache

  • ICache and DCache -- 8 KB, DM, 32-byte lines

  • L2 cache -- 96 KB, 3-way SA, 32-byte lines

  • L3 cache -- 1 MB, DM, 32-byte lines

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

memory address:

tagindex offset

DEC Alpha 21164’s L2 cache:

96 KB, 3-way set associative, 32-Byte lines

64 bit addresses

How many “offset” bits?

How many “index” bits?

How many “tag” bits?

Draw cache picture – how do you tell if it’s a hit?

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

  • 8 KB cache, direct-mapped, 32 byte lines

31 30 29 28 27 ........... 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

tag

index

word offset

3

8

19

tag

data

valid

0

1

2

...

...

...

...

2045

2046

2047

8 KB / 32 bytes =

256 cache blocks/sets

256

=

32

hit/miss

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

  • 96 KB cache, 3-way set-associative, 32-byte block size (cache lines)

  • => Each set has 32 bytes,

  • => 3 bits per set to locate the right 32 bits (word)

  • => each “index” has 3 sets (3-way) (in effect three places one could place the data in)

  • => 96 bytes per “row”

  • => 96KB/96B = 1K rows

  • => index = 10 bits

  • => Tag = 32 – (10+3+2) = 17bits

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Pentium 4: Trace Cache and Instruction Decode

Remember Micro-Instructions??

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Power4

L3: 128M

512 Byte-lines,

8-way SA

Write Back

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Real Caches

  • Often DM at highest level (closest to CPU), associative further away

  • split I and D close to the processor, unified further away (for throughput rather than miss rate).

  • write-through and write-back both common, but never write-through all the way to memory.

  • 32-byte cache lines very common

  • Non-blocking

    • processor doesn’t stall on a miss, but only on the use of a miss (if even then)

    • this means the cache must be able to handle multiple outstanding accesses.

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Recall: Levels of the Memory Hierarchy

Upper Level

Capacity

Access Time

Cost

Staging

Xfer Unit

faster

CPU Registers

100s Bytes

<10s ns

Registers

prog./compiler

1-8 bytes

Instr. Operands

Cache

K Bytes

10-100 ns

$.01-.001/bit

Cache

cache cntl

8-128 bytes

Blocks

Main Memory

M Bytes

100ns-1us

$.01-.001

Memory

OS

512-4K bytes

Pages

Disk

G Bytes

ms

10 - 10 cents

Disk

-4

-3

user/operator

Mbytes

Files

Larger

Tape

infinite

sec-min

10

Tape

Lower Level

-6

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Virtual Memory

  • Main memory can act as a cache for the secondary storage (disk)

  • Advantages:

    • illusion of having more physical memory

    • program relocation

    • protection

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Virtual Memory

  • is just cacheing, but uses different terminology

    • cacheVM

    • blockpage

    • cache misspage fault

    • addressvirtual address

    • indexphysical address (sort of)

  • What happens if another program in the processor uses the same addresses that yours does?

  • What happens if your program uses addresses that don’t exist in the machine?

  • What happens to “holes” in the address space your program uses?

  • So, virtual memory provides

    • performance (through the cacheing effect)

    • protection

    • ease of programming/compilation

    • efficient use of memory

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Virtual Memory

  • is just a mapping function from virtual memory addresses to physical memory locations, which allows cacheing of virtual pages in physical memory.

  • Why is it different ?

  • MUCH higher miss penalty (millions of cycles)!

  • Therefore

    • large pages [equivalent of cache line] (4 KB to MBs)

    • associative mapping of pages (typically fully associative)

    • software handling of misses (but not hits!!)

    • write-through not an option, only write-back

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Page Tables

  • Page table maps virtual memory addresses to physical memory locations

    • Page may be in main memory or may be on disk.

      “Valid bit” says which.

Virtual page number

(high order bits of

virtual address)

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Page Tables

page offset

virtual page number

virtual address

physical page number

valid

page table reg

page

table

This hardware

register tells VM

system where to

find the page table.

physical address

physical page number

page offset

  • Why don’t we need a “tag” field?

  • How does operating system switch to new process???

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Virtual Address and a Cache

miss

VA

PA

Trans-

lation

Cache

Main

Memory

CPU

hit

data

It takes an extra memory access to translate VA to PA

This makes cache access very expensive, and this is the "innermost

loop" that you want to go as fast as possible

ASIDE: Why access cache with PA at all? VA caches have a problem!

synonym / alias problem: two different virtual addresses map to same

physical address => two different cache entries holding data for

the same physical address!

for update: must update all cache entries with same

physical address or memory becomes inconsistent

determining this requires significant hardware, essentially an

associative lookup on the physical address tags to see if you

have multiple hits; or

software enforced alias boundary: same lsb of VA &PA > cache size

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

TLBs

A way to speed up translation is to use a special cache of recently

used page table entries -- this has many names, but the most

frequently used is Translation Lookaside Buffer or TLB

Virtual Address Physical Address Dirty Ref Valid Access

TLB access time comparable to cache access time

(much less than main memory access time)

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Translation lookaside buffer (TLB)

A hardware cache for the page table

Typical size: 256 entries, 1- to 4-way set associative

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Translation Look-Aside Buffers

Just like any other cache, the TLB can be organized as fully associative,

set associative, or direct mapped

TLBs are usually small, typically not more than 128 - 256 entries even on

high end machines. This permits fully associative

lookup on these machines. Most mid-range machines use small

n-way set associative organizations.

hit

miss

VA

PA

TLB

Lookup

Cache

Main

Memory

CPU

Translation

with a TLB

miss

hit

Trans-

lation

data

t

1/2 t

20 t

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Translation Look-Aside Buffers

Machines with TLBs go one step further to reduce # cycles/cache access

- overlap the cache access with the TLB access

Works because high order bits of the VA are used to look in the TLB

while low order bits are used as index into cache

Overlapped access only works as long as the address bits used to

index into the cache do not changeas the result of VA translation

This usually limits things to small caches, large page sizes, or high

n-way set associative caches if you want a large cache

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

All together!

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

  • Virtual memory provides:

    • protection

    • sharing

    • performance

    • illusion of large main memory

  • Virtual Memory requires twice as many memory accesses, so we cache page table entries in the TLB.

  • Four things can go wrong on a memory access: cache miss, TLB miss (but page table in cache), TLB miss and page table not in cache, page fault.

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Summary

  • The Principle of Locality:

    • Program likely to access a relatively small portion of the address space at any instant of time.

      • Temporal Locality: Locality in Time

      • Spatial Locality: Locality in Space

  • Three Major Categories of Cache Misses:

    • Compulsory Misses: sad facts of life. Example: cold start misses.

    • Conflict Misses: increase cache size and/or associativity.Nightmare Scenario: ping pong effect!

    • Capacity Misses: increase cache size

  • Cache Design Space

    • total size, block size, associativity

    • replacement policy

    • write-hit policy (write-through, write-back)

    • write-miss policy

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Summary

  • Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions: 1) Where can block be placed? 2) How is block found? 3) What block is repalced on miss? 4) How are writes handled?

  • Page tables map virtual address to physical address

  • TLBs are important for fast translation

  • TLB misses are significant in processor performance: (funny times, as most systems can’t access all of 2nd level cache without TLB misses!)

  • VIrtual memory was controversial at the time: can SW automatically manage 64KB across many programs?

    • 1000X DRAM growth removed the controversy

  • Today VM allows many processes to share single memory without having to swap all processes to disk; VM protection is more important than memory hierarchy

  • Today CPU time is a function of (ops, cache misses) vs. just f(ops):What does this mean to Compilers, Data structures, Algorithms?

Tarun Soni, Summer’03


Instruction set architectures performance issues alus single cycle cpu

Tarun Soni, Summer’03


  • Login