computer organization and architecture chapter 7 large and fast exploiting memory hierarchy n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Computer Organization and Architecture Chapter 7 Large and Fast: Exploiting Memory Hierarchy PowerPoint Presentation
Download Presentation
Computer Organization and Architecture Chapter 7 Large and Fast: Exploiting Memory Hierarchy

Loading in 2 Seconds...

play fullscreen
1 / 32

Computer Organization and Architecture Chapter 7 Large and Fast: Exploiting Memory Hierarchy - PowerPoint PPT Presentation


  • 163 Views
  • Uploaded on

Computer Organization and Architecture Chapter 7 Large and Fast: Exploiting Memory Hierarchy. Yu-Lun Kuo Computer Sciences and Information Engineering University of Tunghai, Taiwan sscc6991@gmail.com. Major Components of a Computer. Processor. Devices. Control. Input. Memory. Datapath.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Computer Organization and Architecture Chapter 7 Large and Fast: Exploiting Memory Hierarchy' - bryanne-kennedy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
computer organization and architecture chapter 7 large and fast exploiting memory hierarchy

Computer Organization and ArchitectureChapter 7 Large and Fast: Exploiting Memory Hierarchy

Yu-Lun Kuo

Computer Sciences and Information Engineering

University of Tunghai, Taiwan

sscc6991@gmail.com

major components of a computer
Major Components of a Computer

Processor

Devices

Control

Input

Memory

Datapath

Output

processor memory performance gap

µProc

55%/year

(2X/1.5yr)

DRAM

7%/year

(2X/10yrs)

Processor-Memory Performance Gap

“Moore’s Law”

Processor-Memory

Performance Gap(grows 50%/year)

introduction
Introduction
  • The Principle of Locality
    • Program access a relatively small portion of the address space at any instant of time.
  • Two Different Types of Locality
    • Temporal Locality (Locality in Time)
      • If an item is referenced, it will tend to be referenced again soon
        • e.g., loop, subrouting, stack, variable of counting
    • Spatial Locality (Locality in Space)
      • If an item is referenced, items whose addresses are close by tend to be referenced soon
        • e.g., array access, accessed sequentially
memory hierarchy
Memory Hierarchy
  • Memory Hierarchy
    • A structure that uses multiple levels of memories; as the distance form the CPU increase, the size of the memories and the access time both increase
    • Locality + smaller HW is faster = memory hierarchy
  • Levels
    • each smaller, faster, more expensive/byte than level below
  • Inclusive
    • data found in top also found in the bottom
three primary technologies
Three Primary Technologies
  • Building Memory Hierarchies
    • Main Memory
      • DRAM (Dynamic random access memory)
    • Caches (closer to the processor)
      • SRAM (static random access memory)
  • DRAM vs. SRAM
    • Speed : DRAM < SRAM
    • Cost: DRAM < SRAM
introduction1
Introduction
  • Cache memory
    • Made by SRAM (Static RAM)
    • Small amount of fast and high speed memory
    • Sits between normal main memory and CPU
    • May be located on CPU chip or module
introduction2
Introduction
  • Cache memory
a typical memory hierarchy c 2008
A Typical Memory Hierarchy c.2008

Split instruction & data primary caches

(on-chip SRAM)

Multiple interleaved memory banks

(off-chip DRAM)

L1 Instruction Cache

Unified L2 Cache

Memory

CPU

Memory

Memory

L1 Data Cache

RF

Memory

Multiported register file (part of CPU)

Large unified secondary cache (on-chip SRAM)

a typical memory hierarchy
A Typical Memory Hierarchy
  • By taking advantage of the principle of locality
    • Can present the user with as much memory as is available in the cheapest technology
    • at the speed offered by the fastest technology

On-Chip Components

Control

eDRAM

Secondary

Memory

(Disk)

Instr

Cache

Second

Level

Cache

(SRAM)

ITLB

Main

Memory

(DRAM)

Datapath

Data

Cache

RegFile

DTLB

Speed (%cycles): ½’s 1’s 10’s 100’s 1,000’s

Size (bytes): 100’s K’s 10K’s M’s G’s to T’s

Cost: highest lowest

characteristics of memory hierarchy

Inclusive– what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM

4-8 bytes (word)

8-32 bytes (block)

1 to 4 blocks

1,024+ bytes (disk sector = page)

Characteristics of Memory Hierarchy

Processor

Increasing distance from the processor in access time

L1$

L2$

Main Memory

Secondary Memory

(Relative) size of the memory at each level

memory hierarchy list
Memory Hierarchy List
  • Registers
  • L1 Cache
  • L2 Cache
  • L3 cache
  • Main memory
  • Disk cache
  • Disk (RAID)
  • Optical (DVD)
  • Tape
the memory hierarchy terminology

Lower Level

Memory

Upper Level

Memory

To Processor

Blk X

From Processor

Blk Y

The Memory Hierarchy: Terminology
  • Hit: data is in some block in the upper level (Blk X)
    • Hit Rate: the fraction of memory accesses found in the upper level
    • Hit Time: Time to access the upper level which consists of

RAM access time + Time to determine hit/miss

the memory hierarchy terminology1

Lower Level

Memory

Upper Level

Memory

To Processor

Blk X

From Processor

Blk Y

The Memory Hierarchy: Terminology
  • Miss: data is not in the upper level so needs to be retrieve from a block in the lower level (Blk Y)
    • Miss Rate = 1 - (Hit Rate)
    • Miss Penalty
      • Time to replace a block in the upper level + Time to deliver the block the processor
      • Hit Time << Miss Penalty
how is the hierarchy managed
How is the Hierarchy Managed?
  • registers  memory
    • by compiler (programmer?)
  • cache  main memory
    • by the cache controller hardware
  • main memory  disks
    • by the operating system (virtual memory)
    • virtual to physical address mapping assisted by the hardware (TLB)
    • by the programmer (files)
7 2 the basics of caches
7.2 The basics of Caches
  • Simple cache
    • The processor requests are each one word
    • The block size is one word of data
  • Two questions to answer (in hardware):
    • Q1: How do we know if a data item is in the cache?
    • Q2: If it is, how do we find it?
caches
Caches
  • Direct Mapped
    • Assign the cache location based on the address of the word in memory
    • Address mapping:

(block address) modulo (# of blocks in the cache)

      • First consider block sizes of one word
caches1
Caches
  • Tag
    • Contain the address information required to identify whether a word in the cache corresponds to the requested word
  • Valid bit
    • After executing many instructions, some of the cache entries may still be empty
    • Indicate whether an entry contains a valid address
      • If valid bit = 0, there cannot be a match for this block
direct mapped cache

01

4

11

15

Direct Mapped Cache
  • Consider the main memory word reference string

0 1 2 3 4 3 4 15

Start with an empty cache - all blocks initially marked as not valid

0

miss

1

miss

2

miss

3

miss

00 Mem(0)

00 Mem(1)

00 Mem(2)

00 Mem(0)

00 Mem(1)

00 Mem(0)

00 Mem(0)

00 Mem(1)

00 Mem(2)

00 Mem(3)

miss

3

hit

4

hit

15

miss

4

01 Mem(4)

00 Mem(1)

00 Mem(2)

00 Mem(3)

01 Mem(4)

00 Mem(1)

00 Mem(2)

00 Mem(3)

01 Mem(4)

00 Mem(1)

00 Mem(2)

00 Mem(3)

00 Mem(0)

00 Mem(1)

00 Mem(2)

00 Mem(3)

  • 8 requests, 6 misses
hits vs misses
Hits vs. Misses
  • Read hits
    • this is what we want!
  • Read misses
    • stall the CPU, fetch block from memory, deliver to cache, restart
  • Write hits
    • can replace data in cache and memory (write-through)
    • write the data only into the cache (write-back the cache later)
  • Write misses
    • read the entire block into the cache, then write the word
what happens on a write
What happens on a write?
  • Write work somewhat differently
    • Suppose on a store instruction
      • Write the data into only the data cache
      • Memory would have different value
        • The cache & memory are “inconsistent”
    • Keep the main memory & cache
      • Always write the data into both the memory and the cache
      • Called write-through(直接寫入)
what happens on a write1
What happens on a write?
  • Although this design handles writes simple
    • Not provide very good performance
      • Every write causes the data to be written to main memory
      • Take a long time
      • Ex. 10% of the instructions are stores

CPI without cache miss: 1.0

spending 100 extra cycles on every write

CPI = 1.0 + 100 x 10% = 11

reducing performance

write buffer for write through

Cache

Processor

DRAM

Write Buffer

Write Buffer for Write Through
  • A Write Buffer is needed between the Cache and Memory (TLB: Translation Lookaside Buffer 轉譯旁觀緩衝區)
    • A queue that holds data while the data are waiting to be written to memory
    • Processor:
      • writes data into the cache and the write buffer
    • Memory controller:
      • write contents of the buffer to memory
what happens on a write2
What happens on a write?
  • Write back (間接寫入)
    • New value only written only to the block in the cache
    • The modified block is written to the lower level of the hierarchy when it is replaced
what happens on a write3
What happens on a write?
  • Write Through
    • All writes go to main memory as well as cache
    • Multiple CPUs can monitor main memory traffic to keep local (to CPU) cache up to date
    • Lots of traffic
    • Slows down writes
  • Write Back
    • Updates initially made in cache only
    • Update bit for cache slot is set when update occurs
    • If block is to be replaced, write to main memory only if update bit is set
    • Other caches get out of sync
memory system to support caches
Memory System to Support Caches
  • It is difficult to reduce the latency to fetch the first word from memory
    • We can reduce the miss penalty if increase the bandwidth from the memory to the cache

CPU

CPU

CPU

Multiplexor

Cache

Cache

Cache

bus

bus

bus

Memory

Memory

bank 0

Memory

bank 1

Memory

bank 2

Memory

bank 3

Memory

one word wide memory organization
One-word-wide memory organization
  • Assume
    • A cache block for 4 words
    • 1 memory bus clock cycle to send the address
    • 15 clock cycles for DRAM access initiated
    • 1 memory bus clock cycle to return a word of data
    • Miss penalty: 1+ 4x15 + 4x1 = 65 clock cycles
    • Number of bytes transferred per bus clock cycle for a single miss
      • 4 x 4 / 65 = 0.25

CPU

Cache

bus

Memory

wide memory organization
Wide memory organization
  • Assume
    • A cache block for 4 words
    • 1 memory bus clock cycle to send the address
    • 15 clock cycles for DRAM access initiated
    • 1 memory bus clock cycle to return a word of data
    • Two word wide
      • 1 + 2 x 15 + 2 x 1 = 33 clock cycles
      • 4 x 4 / 33 = 0.48
    • Four word wide
      • 1 + 1 x 15 + 1 x 1 = 17 clock cycles
      • 4 x 4 / 17 = 0.94

CPU

Multiplexor

Cache

bus

Memory

interleaved memory organization
Interleaved memory organization
  • Assume
    • A cache block for 4 words
    • 1 memory bus clock cycle to send the address
    • 15 clock cycles for DRAM access initiated
    • 1 memory bus clock cycle to return a word of data
    • Each memory bank: 1 word wide
      • Advance: One latency time
    • 1 + 1 x 15 + 4 x 1 = 20 clock cycle
    • 4 x 4 / 20 = 0.8 byte/clock
      • 3 times for one-word-wide

CPU

Cache

bus

Memory

bank 0

Memory

bank 3

Memory

bank 1

Memory

bank 2