ceng 3420 computer design spring 2011 lecture 13 memory hierarchy n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
CENG 3420 Computer Design Spring 2011 Lecture 13: Memory Hierarchy PowerPoint Presentation
Download Presentation
CENG 3420 Computer Design Spring 2011 Lecture 13: Memory Hierarchy

Loading in 2 Seconds...

play fullscreen
1 / 40

CENG 3420 Computer Design Spring 2011 Lecture 13: Memory Hierarchy - PowerPoint PPT Presentation


  • 146 Views
  • Uploaded on

CENG 3420 Computer Design Spring 2011 Lecture 13: Memory Hierarchy. XU, Qiang (Johnny) 徐強 [ Adapted from UC Berkeley’s D. Patterson’s and from PSU’s Mary J. Irwin’s slides with additional credits to Y. Xie ]. Cache. Main Memory. Secondary Memory (Disk).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'CENG 3420 Computer Design Spring 2011 Lecture 13: Memory Hierarchy' - mariko-burke


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
ceng 3420 computer design spring 2011 lecture 13 memory hierarchy

CENG 3420Computer DesignSpring 2011Lecture 13: Memory Hierarchy

XU, Qiang (Johnny) 徐強

[Adapted from UC Berkeley’s D. Patterson’s and

from PSU’s Mary J. Irwin’s slides with additional credits to Y. Xie]

review major components of a computer

Cache

Main Memory

Secondary Memory

(Disk)

Review: Major Components of a Computer

Processor

Devices

Control

Input

Memory

Datapath

Output

processor memory performance gap

µProc

55%/year

(2X/1.5yr)

DRAM

7%/year

(2X/10yrs)

Processor-Memory Performance Gap

“Moore’s Law”

Processor-Memory

Performance Gap(grows 50%/year)

the memory wall
The “Memory Wall”
  • Logic vs DRAM speed gap continues to grow

Clocks per DRAM access

Clocks per instruction

memory performance impact on performance
Memory Performance Impact on Performance
  • Suppose a processor executes at
    • ideal CPI = 1.1
    • 50% arith/logic, 30% ld/st, 20% control

and that 10% of data memory operations miss with a 50 cycle miss penalty

  • CPI = ideal CPI + average stalls per instruction = 1.1(cycle) + ( 0.30 (datamemops/instr) x 0.10 (miss/datamemop) x 50 (cycle/miss) ) = 1.1 cycle + 1.5 cycle = 2.6

so 58% of the time the processor is stalled waiting for memory!

  • A 1% instruction miss rate would add an additional 0.5 to the CPI!
the memory hierarchy goal
The Memory Hierarchy Goal
  • Fact: Large memories are slow and fast memories are small
  • How do we create a memory that gives the illusion of being large, cheap and fast (most of the time)?
    • With hierarchy
    • With parallelism
memory hierarchy technologies
Memory Hierarchy Technologies
  • Random Access Memories (RAMs)
    • “Random” is good: access time is the same for all locations
    • DRAM: Dynamic Random Access Memory
      • High density (1 transistor cells), low power, cheap, slow
      • Dynamic: need to be “refreshed” regularly (~ every 4 ms)
    • SRAM: Static Random Access Memory
      • Low density (6 transistor cells), high power, expensive, fast
      • Static: content will last “forever” (until power turned off)
    • Size: DRAM/SRAM ­ ratio of 4 to 8
    • Cost/Cycle time: SRAM/DRAM ­ ratio of 8 to 16
  • “Non-so-random” Access Technology
    • Access time varies from location to location and from time to time (e.g., disk, CDROM)
ram memory uses and performance metrics
RAM Memory Uses and Performance Metrics
  • Caches use SRAM for speed
  • Main Memory is DRAM for density
    • Addresses divided into 2 halves (row and column)
      • RASor Row Access Strobe triggering row decoder
      • CAS or Column Access Strobe triggering column selector
  • Memory performance metrics
    • Latency: Time to access one word
      • Access Time: time between request and when word is read or written (read access and write access times can be different)
      • Cycle Time: time between successive (read or write) requests
      • Usually cycle time > access time
    • Bandwidth: How much data can be supplied per unit time
      • width of the data channel * the rate at which it can be used
a typical memory hierarchy
A Typical Memory Hierarchy
  • By taking advantage of the principle of locality
    • Can present the user with as much memory as is available in the cheapest technology
    • at the speed offered by the fastest technology

On-Chip Components

Control

eDRAM

Secondary

Memory

(Disk)

Instr

Cache

Second

Level

Cache

(SRAM)

ITLB

Main

Memory

(DRAM)

Datapath

Data

Cache

RegFile

DTLB

Speed (%cycles): ½’s 1’s 10’s 100’s 1,000’s

Size (bytes): 100’s K’s 10K’s M’s G’s to T’s

Cost: highest lowest

memory hierarchy technologies1
Memory Hierarchy Technologies
  • Caches use SRAM for speed and technology compatibility
    • Low density (6 transistor cells), high power, expensive, fast
    • Static: content will last “forever” (until power turned off)

21

Address

Chip select

16

SRAM

2M x 16

Output enable

Dout[15-0]

Write enable

Din[15-0]

16

  • Main Memory uses DRAM for size (density)
    • High density (1 transistor cells), low power, cheap, slow
    • Dynamic: needs to be “refreshed” regularly (~ every 8 ms)
      • 1% to 2% of the active cycles of the DRAM
    • Addresses divided into 2 halves (row and column)
      • RASor Row Access Strobe triggering row decoder
      • CAS or Column Access Strobe triggering column selector
classical ram organization square

bit (data) lines

Each intersection represents a

6-T SRAM cell or a 1-T DRAM cell

word (row) line

Classical RAM Organization (~Square)

R

o

w

D

e

c

o

d

e

r

RAM Cell

Array

Column Selector &

I/O Circuits

column

address

row

address

One memory row holds a block of data, so the column address selects the requested bit or word from that block

data bit or word

classical dram organization square planes

RAM Cell

Array

Classical DRAM Organization (~Square Planes)

bit (data) lines

The column address

selects the requested

bit from the row in each

plane

. . .

R

o

w

D

e

c

o

d

e

r

Each intersection represents a

1-T DRAM cell

word (row) line

column

address

Column Selector &

I/O Circuits

row

address

. . .

data bit

data bit

data bit

data word

classical dram operation

N cols

Cycle Time

1st M-bit Access

2nd M-bit Access

RAS

CAS

Row Address

Col Address

Row Address

Col Address

Classical DRAM Operation

Column

Address

  • DRAM Organization:
    • N rows x N column x M-bit
    • Read or Write M-bit at a time
    • Each M-bit access requiresa RAS / CAS cycle

DRAM

Row

Address

N rows

M bit planes

M-bit Output

page mode dram operation

N x M SRAM

M bit planes

Cycle Time

1st M-bit Access

2nd M-bit

3rd M-bit

4th M-bit

RAS

CAS

Row Address

Col Address

Col Address

Col Address

Col Address

Page Mode DRAM Operation

Column

Address

  • Page Mode DRAM
    • N x M SRAM to save a row

N cols

  • After a row is read into the SRAM “register”
    • Only CAS is needed to access other M-bit words on that row
    • RAS remains asserted while CAS is toggled

DRAM

Row

Address

N rows

M-bit Output

synchronous dram sdram operation

N x M SRAM

M bit planes

Synchronous DRAM (SDRAM) Operation

Column

Address

+1

  • After a row is read into the SRAM register
    • Inputs CAS as the starting “burst” address along with a burst length
    • Transfers a burst of data from a series of sequential addresses within that row
      • A clock controls transfer of successive words in the burst – 300MHz in 2004

N cols

DRAM

Row

Address

N rows

M-bit Output

Cycle Time

1st M-bit Access

2nd M-bit

3rd M-bit

4th M-bit

RAS

CAS

Col Address

Row Add

Row Address

other dram architectures
Other DRAM Architectures
  • Double Data Rate SDRAMs – DDR-SDRAMs (and DDR-SRAMs)
    • Double data rate because they transfer data on both the rising and falling edge of the clock
    • Are the most widely used form of SDRAMs
  • DDR2-SDRAMs

http://www.corsairmemory.com/corsair/products/tech/memory_basics/153707/main.swf

dram memory latency bandwidth milestones
DRAM Memory Latency & Bandwidth Milestones
  • In the time that the memory to processor bandwidthdoubles the memorylatency improves by a factor of only 1.2 to 1.4
  • To deliver such high bandwidth, the internal DRAM has to be organized as interleaved memory banks

Patterson, CACM Vol 47, #10, 2004

memory systems that support caches
Memory Systems that Support Caches
  • The off-chip interconnect and memory architecture can affect overall system performance in dramatic ways

on-chip

One word wide organization (one word wide bus and one word wide memory)

CPU

  • Assume
    • 1 clock cycle to send the address
    • 25 clock cycles for DRAM cycle time, 8 clock cycles access time
    • 1 clock cycle to return a word of data
  • Memory-Bus to Cache bandwidth
    • number of bytes accessed from memory and transferred to cache/CPU per clock cycle

Cache

bus

32-bit data

&

32-bit addr

per cycle

Memory

one word wide memory organization
One Word Wide Memory Organization
  • If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall the number of cycles required to return one data word from memory

cycle to send address

cycles to read DRAM

cycle to return data

total clock cycles miss penalty

  • Number of bytes transferred per clock cycle (bandwidth) for a single miss is

bytes per clock

on-chip

CPU

1

25

1

27

Cache

bus

Memory

4/27 = 0.148

one word wide memory organization con t

25 cycles

25 cycles

25 cycles

25 cycles

One Word Wide Memory Organization, con’t
  • What if the block size is four words?

cycle to send 1st address

cycles to read DRAM

cycles to return last data word

total clock cycles miss penalty

  • Number of bytes transferred per clock cycle (bandwidth) for a single miss is

bytes per clock

on-chip

1

4 x 25 = 100

1

102

CPU

Cache

bus

Memory

(4 x 4)/102 = 0.157

one word wide memory organization con t1

25 cycles

8 cycles

8 cycles

8 cycles

One Word Wide Memory Organization, con’t
  • What if the block size is four words and if a fast page mode DRAM is used?

cycle to send 1st address

cycles to read DRAM

cycles to return last data word

total clock cycles miss penalty

  • Number of bytes transferred per clock cycle (bandwidth) for a single miss is

bytes per clock

on-chip

CPU

1

25 + 3*8 = 49

1

51

Cache

bus

Memory

(4 x 4)/51 = 0.314

interleaved memory organization

25 cycles

25 cycles

25 cycles

25 cycles

Interleaved Memory Organization
  • For a block size of four words

cycle to send 1st address

cycles to read DRAM

cycles to return last data word

total clock cycles miss penalty

on-chip

1

25 + 3 = 28

1

30

CPU

Cache

bus

Memory

bank 0

Memory

bank 1

Memory

bank 2

Memory

bank 3

  • Number of bytes transferred per clock cycle (bandwidth) for a single miss is

bytes per clock

(4 x 4)/30 = 0.533

dram memory system summary
DRAM Memory System Summary
  • It’s important to match the cache characteristics
    • caches access one block at a time (usually more than one word)
  • with the DRAM characteristics
    • use DRAMs that support fast multiple word accesses, preferably ones that match the block size of the cache
  • with the memory-bus characteristics
    • make sure the memory-bus can support the DRAM access rates and patterns
    • with the goal of increasing the Memory-Bus to Cache bandwidth
the cache
The Cache
  • Two questions to answer (in hardware):
    • Q1: How do we know if a data item is in the cache?
    • Q2: If it is, how do we find it?
  • Direct mapped
    • For each item of data at the lower level, there is exactly one location in the cache where it might be - so lots of items at the lower level must share locations in the upper level
    • Address mapping:

(block address) modulo (# of blocks in the cache)

    • First consider block sizes of one word
caching a simple first example
Caching: A Simple First Example

Main Memory

0000xx

0001xx

0010xx

0011xx

0100xx

0101xx

0110xx

0111xx

1000xx

1001xx

1010xx

1011xx

1100xx

1101xx

1110xx

1111xx

Two low order bits define the byte in the word (32b words)

Cache

Index

Valid

Tag

Data

00

01

10

11

Q2: How do we find it?

Use next 2 low order memory address bits – the index – to determine which cache block (i.e., modulo the number of blocks in the cache)

Q1: Is it there?

Compare the cache tag to the high order 2 memory address bits to tell if the memory block is in the cache

(block address) modulo (# of blocks in the cache)

direct mapped cache

01

4

11

15

Direct Mapped Cache
  • Consider the main memory word reference string

0 1 2 3 4 3 4 15

Start with an empty cache - all blocks initially marked as not valid

0

miss

1

miss

2

miss

3

miss

00 Mem(0)

00 Mem(1)

00 Mem(0)

00 Mem(0)

00 Mem(1)

00 Mem(2)

00 Mem(0)

00 Mem(1)

00 Mem(2)

00 Mem(3)

miss

3

hit

4

hit

15

miss

4

00 Mem(0)

00 Mem(1)

00 Mem(2)

00 Mem(3)

01 Mem(4)

00 Mem(1)

00 Mem(2)

00 Mem(3)

01 Mem(4)

00 Mem(1)

00 Mem(2)

00 Mem(3)

01 Mem(4)

00 Mem(1)

00 Mem(2)

00 Mem(3)

  • 8 requests, 6 misses
mips direct mapped cache example

Byte offset

31 30 . . . 13 12 11 . . . 2 1 0

Tag

20

Data

10

Hit

Index

Index

Valid

Tag

Data

0

1

2

.

.

.

1021

1022

1023

20

32

MIPS Direct Mapped Cache Example
  • One word/block, cache size = 1K words

What kind of locality are we taking advantage of?

handling cache hits
Handling Cache Hits
  • Read hits (I$ and D$)
    • this is what we want!
  • Write hits (D$ only)
    • allow cache and memory to be inconsistent
      • write the data only into the cache (then write-back the cache contents to the memory when that cache block is “evicted”)
      • need a dirty bit for each cache block to tell if it needs to be written back to memory when it is evicted
    • require the cache and memory to be consistent
      • always write the data into both the cache and the memory (write-through)
      • don’t need a dirty bit
      • writes run at the speed of the main memory - slow! – or can use a write buffer, so only have to stall if the write buffer is full
review why pipeline for throughput

Time (clock cycles)

Inst 0

I

n

s

t

r.

O

r

d

e

r

D$

Reg

D$

D$

D$

D$

Reg

Reg

Reg

Reg

Reg

Reg

Reg

Reg

Reg

I$

I$

I$

I$

I$

ALU

ALU

ALU

ALU

ALU

Inst 1

Inst 2

Inst 3

Inst 4

Review: Why Pipeline? For Throughput!
  • To avoid a structural hazard need two caches on-chip: one for instructions (I$) and one for data (D$)

To keep the pipeline running at its maximum rate both I$ and D$ need to satisfy a request from the datapath every cycle.

What happens when they can’t do that?

another reference string mapping

01

4

00

01

0

4

00

0

01

4

00

0

01

4

Another Reference String Mapping
  • Consider the main memory word reference string

0 4 0 4 0 4 0 4

Start with an empty cache - all blocks initially marked as not valid

miss

miss

miss

miss

0

4

0

4

00 Mem(0)

00 Mem(0)

01 Mem(4)

00 Mem(0)

4

0

4

0

miss

miss

miss

miss

01 Mem(4)

00 Mem(0)

01 Mem(4)

00 Mem(0)

  • 8 requests, 8 misses
  • Ping pong effect due to conflict misses - two memory locations that map into the same cache block
sources of cache misses
Sources of Cache Misses
  • Compulsory (cold start or process migration, first reference):
    • First access to a block, “cold” fact of life, not a whole lot you can do about it
    • If you are going to run “millions” of instruction, compulsory misses are insignificant
  • Conflict (collision)
    • Multiple memory locations mapped to the same cache location
      • Solution 1: increase cache size
      • Solution 2: increase associativity
  • Capacity
    • Cache cannot contain all blocks accessed by the program
      • Solution: increase cache size
handling cache misses
Handling Cache Misses
  • Read misses (I$ and D$)
    • stall the pipeline, fetch the block from the next level in the memory hierarchy, write the word+tag in the cache and send the requested word to the processor, let the pipeline resume
  • Write misses (D$ only)
    • stall the pipeline, fetch the block from next level in the memory hierarchy, install it in the cache (may involve having to evict a dirty block if using a write-back cache), write the word+tag in the cache, let the pipeline resume

or (normally used in write-back caches)

    • Write allocate – just write the word+tag into the cache (may involve having to evict a dirty block), no need to check for cache hit, no need to stall

or (normally used in write-through caches with a write buffer)

    • No-write allocate – skip the cache write (but must invalidate that cache block since it will now hold stale data) and just write the word to the write buffer (and eventually to the next memory level), no need to stall if the write buffer isn’t full
multiword block direct mapped cache

Byte offset

Hit

31 30 . . . 13 12 11 . . . 4 3 2 1 0

Data

20

Tag

8

Block offset

Index

Data

Index

Valid

Tag

0

1

2

.

.

.

253

254

255

20

32

Multiword Block Direct Mapped Cache
  • Four words/block, cache size = 1K words

What kind of locality are we taking advantage of?

taking advantage of spatial locality

0

1

2

3

4

3

11

01

5

15

14

4

4

15

Taking Advantage of Spatial Locality
  • Let cache block hold more than one word

0 1 2 3 4 3 4 15

Start with an empty cache - all blocks initially marked as not valid

miss

hit

miss

00 Mem(1) Mem(0)

00 Mem(1) Mem(0)

00 Mem(1) Mem(0)

00 Mem(3) Mem(2)

hit

miss

hit

00 Mem(1) Mem(0)

00 Mem(1) Mem(0)

01 Mem(5) Mem(4)

00 Mem(3) Mem(2)

00 Mem(3) Mem(2)

00 Mem(3) Mem(2)

hit

miss

01 Mem(5) Mem(4)

01 Mem(5) Mem(4)

00 Mem(3) Mem(2)

00 Mem(3) Mem(2)

  • 8 requests, 4 misses
miss rate vs block size vs cache size
Miss Rate vs Block Size vs Cache Size
  • Miss rate goes up if the block size becomes a significant fraction of the cache size because the number of blocks that can be held in the same size cache is smaller (increasing capacity misses)
block size tradeoff

Average

Access

Time

Miss

Rate

Miss

Penalty

Exploits Spatial Locality

Increased Miss Penalty

& Miss Rate

Fewer blocks

compromises

Temporal Locality

Block Size

Block Size

Block Size

Block Size Tradeoff
  • Larger block sizes take advantage of spatial locality but
    • If the block size is too big relative to the cache size, the miss rate will go up
  • Larger block size means larger miss penalty
    • Latency to first word in block + transfer time for remaining words
  • In general, Average Memory Access Time

= Hit Time + Miss Penalty x Miss Rate

multiword block considerations
Multiword Block Considerations
  • Read misses (I$ and D$)
    • Processed the same as for single word blocks – a miss returns the entire block from memory
    • Miss penalty grows as block size grows
      • Early restart – datapath resumes execution as soon as the requested word of the block is returned
      • Requested word first – requested word is transferred from the memory to the cache (and datapath) first
    • Nonblocking cache – allows the datapath to continue to access the cache while the cache is handling an earlier miss
  • Write misses (D$)
    • Can’t use write allocate or will end up with a “garbled” block in the cache (e.g., for 4 word blocks, a new tag, one word of data from the new block, and three words of data from the old block), so must fetch the block from memory first and pay the stall time
other ways to reduce cache miss rates
Other Ways to Reduce Cache Miss Rates
  • Allow more flexible block placement
    • In a direct mappedcache a memory block maps to exactly one cache block
    • At the other extreme, could allow a memory block to be mapped to any cache block – fully associative cache
    • A compromise is to divide the cache into sets each of which consists of n “ways” (n-way set associative)
  • Use multiple levels of caches
    • Add a second level of caches on chip – normally a unified L2 cache (i.e., it holds both instructions and data)
      • L1 caches focuses on minimizing hit time in support of a shorter clock cycle (smaller with smaller block sizes)
      • L2 cache focuses on reducing miss rate to reduce the penalty of long main memory access times (larger with larger block sizes)
cache summary
Cache Summary
  • The Principle of Locality:
    • Program likely to access a relatively small portion of the address space at any instant of time
      • Temporal Locality: Locality in Time
      • Spatial Locality: Locality in Space
  • Three major categories of cache misses:
    • Compulsory misses: sad facts of life, e.g., cold start misses
    • Conflict misses: increase cache size and/or associativity Nightmare Scenario: ping pong effect!
    • Capacity misses: increase cache size
  • Cache design space
    • total size, block size, associativity (replacement policy)
    • write-hit policy (write-through, write-back)
    • write-miss policy (write allocate, write buffers)
improving cache performance
Improving Cache Performance

Reduce the miss rate

  • bigger cache
  • associative cache
  • larger blocks (16 to 64 bytes)
  • use a victim cache – a small buffer that holds the most recently discarded blocks

Reduce the miss penalty

  • smaller blocks
    • for large blocks fetch critical word first
  • use a write buffer
    • check write buffer (and/or victim cache) on read miss – may get lucky
  • use multiple cache levels – L2 cache not tied to CPU clock rate
  • faster backing store/improved memorybandwidth
    • wider buses
    • SDRAMs

Reduce the hit time

  • smaller cache
  • direct mapped cache
  • smaller blocks
  • for writes
    • no write allocate – just write to write buffer
    • write allocate – write to a delayed write buffer that then writes to the cache