Chapter 5 memory hierarchy design
1 / 93

Chapter 5 Memory Hierarchy Design - PowerPoint PPT Presentation

  • Updated On :

Chapter 5 Memory Hierarchy Design. Introduction. The necessity of memory-hierarchy in a computer system design is enabled by the following two factors: Locality of reference: The nature of program behavior Large gap in speed between CPU and mass storage devices such a DRAM.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Chapter 5 Memory Hierarchy Design' - gaetane

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Chapter 5 memory hierarchy design l.jpg
Chapter 5Memory Hierarchy Design

Introduction l.jpg

  • The necessity of memory-hierarchy in a computer system design is enabled by the following two factors:

    • Locality of reference: The nature of program behavior

    • Large gap in speed between CPU and mass storage devices such a DRAM.

  • Level of memory hierarchy

    • High level <--- --> Low level

    • CPU Register, Cache, Main-memory, Disk

    • The levels of the hierarchy subset one another: all data in one level is also found in the level below.

Memory hierarchy difference between desktops and embedded processors l.jpg
Memory Hierarchy Difference between Desktops and Embedded Processors

  • Memory hierarchy for desktops

    • Speed

  • Memory hierarchy for Embedded Processors

    • Real-time applications need to care about worst-case performance.

    • Concerning about power consumption.

    • No memory hierarchy actually needed for simple and fix applications running on embedded processors.

    • Main memory itself may be quite small.

Abcs of caches l.jpg
ABCs of Caches Processors

  • Recalling some terms

    • Cache: The name given to the first level of the memory hierarchy encountered once the address leaves the CPU.

    • Miss rate: The fraction of accesses not in the cache.

    • Miss penalty: The additional time to service the miss.

    • Block: The minimum unit of information that can be present in the cache.

  • Four questions about any level of the hierarchy:

    • Q1: Where can a block be placed in the upper level? (Block placement)

    • Q2: How is a block found if it is in the upper level? (Block identification)

    • Q3: Which block should be replaced on a miss? (Block replacement)

    • Q4: What happens on a write? (Write strategy)

Cache performance l.jpg
Cache Performance Processors

  • Formula for performance evaluation

    • CPU execution time = (CPU clock cycles + Memory stall cycles) * Clock cycle time =IC *(CPIexecution + Memory stall clock cycles/IC)*Clock cycle time

    • Memory stall cycles = IC * Memory reference per instruction *miss rate *miss penalty

    • Measure of memory-hierarchy performance

      Average memory access time = Hit time + Miss rate * Miss penalty

  • Example on page 395.

  • Example on page 396.

Four memory hierarchy questions l.jpg
Four Memory Hierarchy Questions Processors

Q1: Where can a block be placed in the upper level? ( block placement)

Q2: How is a block found if it is in the upper level? ( block identification)

Q3: Which block should be replaced on a miss? ( block replacement)

Q4: What happens on a write? ( write strategy)

Block placement 1 l.jpg
Block Placement (1) Processors

  • Q1: Where can a block be placed in a cache?

    • Direct mapped: Each block has only one place it can appear in the cache. The mapping is usually

      (Block address) MOD (Number of blocks in cache)

    • Fully associative: A block can be placed anywhere in the cache.

    • Set associative: A block can be placed in a restricted set of places in the cache. A set is a group of blocks in the cache. A block is first mapped onto a set, and then the block can be placed anywhere within that set. The set is usually obtained by

      (block address) MOD (Number of sets in a cache)

      • If there are n blocks in a set, the cache is called n-way set associative.

Block identification l.jpg
Block Identification Processors

  • Q2: How is a block found if it is in the cache

    • Each cache block consists of

      • Address tag: Give the block address

      • Valid bit: Indicate whether or not the associated entry contains a valid address.

      • Data

    • Relationship of a CPU address to the cache

      • Address presented by CPU

      • Block address ## Block offset

        • Index: Select the set

        • Block offset: Select the desired data from the block.

Identification steps l.jpg
Identification Steps Processors

  • Index field of the CPU address is used to select a set.

  • Tag field presented by the CPU is compared in parallel to all address tags of the blocks in the selected set.

  • If any address tag matches the tag field of the CPU address and its valid bit is true, it is a cache hit.

  • Offset field is used to select the desired data.

Associativity versus index field l.jpg
Associativity versus Index Field Processors

  • If the total cache size is kept the same,

    • Increasing associativity increases the number of blocks per set, thereby decreasing the size of the index and increasing the size of the tag.

  • The following formula characterized this property:

    2index = (cache size)/(block size *set associativity).

Block replacement l.jpg
Block Replacement Processors

  • Q3: Which block should be replaced on a cache miss?

    • For direct mapped cache, the answer is obvious.

    • For set associative or fully associative cache, the following two strategies can be used:

      • Random

      • Least-recently used (LRU)

      • First in, first out (FIFO)

Comparison of miss rate between random and lru l.jpg
Comparison of Miss Rate between Random and LRU Processors

  • Fig. 5.6 on page 400

Write strategy l.jpg
Write Strategy Processors

  • Q4: What happens on a write?

    • Traffic patterns

      • “Writes” take about 7% of the overall memory traffic and take about 25% of the data cache traffic.

      • Though “read “ dominates processor cache traffic, “write” still can not be ignored in a high performance design.

    • “Read” can be done faster than “write”

      • In reading, the block data can be read at the same time that the tag is read and compared.

      • In writing, modifying a block cannot begin until the tag is checked to see if the address is a hit.

Write policies and write miss options l.jpg
Write Policies and Write Miss Options Processors

  • Write policies

    • Write through (or store through)

      • Write to both the block in the cache and the block in the lower-level memory.

    • Write back

      • Write only to the block in the cache. A dirty bit, attached to each block in the cache, is set when the block is modified. When a block is being replaced and the dirty bit is set, the block is copy back to main memory. This can reduce bus traffic.

  • Common options on a write miss

    • Write allocate

      • The block is loaded on a write miss, followed by the write-hit.

    • No-write allocate (write around)

      • The block is modified in the lower level and not loaded into the cache.

  • Either write miss option can be used with write through or write back, but write-back caches generally use write allocate and write-through cache often use no-write allocate.

Comparison between write through and write back l.jpg
Comparison between Write Through and Write Back Processors

  • Write back can reduce bus traffic, but the content of cache blocks can be inconsistent with that of the blocks in main memory at some moment.

  • Write through increases bus traffic, but the content is consistent all the time.

  • Reduce write stall

    • Use a writing buffer. As soon as the CPU places the write data into the writing buffer, the CPU is allowed to continue.

  • Example on page 402

An example the alpha 21264 data cache l.jpg
An Example: the Alpha 21264 Data Cache Processors

  • Features

    • 64K bytes of data in 64-byte blocks.

    • Two-way set associative.

    • Write back with a dirty bit.

    • Write allocate on a write miss.

  • The CPU address

    • 48-bit virtual address

    • 44-bit physical address

      • 38-bit block address

        • 29-bit tag address

        • 9-bit index, obtained by 2index = 512= 65536/(64*2)

      • 6-bit block offset

  • FIFO replacement strategy

  • What happen on a miss?

    • 64-byte block is fetched from main memory in four transfer, each takes 5 clock cycles.

Unified versus split caches l.jpg
Unified versus Split Caches Processors

  • Unified cache: A cache contains instructions and data.

  • Spit caches: Data is contained only in data cache, while instruction is contained in instruction cache.

    • Fig. 5.8 on page 406.

Cache performance25 l.jpg
Cache Performance Processors

  • Average memory access time for processors with in-order execution

    Average memory access time = Hit time + Miss rate * Miss penalty

    • Examples on pages 408 and 409

  • Miss penalty and out-of-order execution processors

    Memory stall cycles / instruction = Misses/instruction * (Total miss latency – Overlapped miss latency)

    • Length of memory latency: Time between the start and the end of a memory reference in an out-of-order processor.

    • Length of latency overlap: A time period of memory latency overlapping the operations of the processor.

Improving cache performance l.jpg
Improving Cache Performance Processors

  • Reduce the miss rate

  • Reduce the miss penalty

  • Reduce the hit time

  • Reduce the miss penalty or miss rate via parallelism

Reducing cache miss penalty l.jpg
Reducing Cache Miss Penalty Processors

  • Multilevel caches

  • Critical word first and early restart

  • Giving priority to read misses over writes

  • Merging write buffers

  • Victim caches

Multilevel caches l.jpg
Multilevel Caches Processors

  • Question:

    • Larger cache or faster cache? A contradictory scenario.

    • Solution:

      • Adding another level of cache.

    • Second level cache complicates performance evaluation of cache memory.

      Average memory access time = Hit timeL1 + Miss rateL1 *Miss penaltyL1


      Miss penaltyL1 = Hit timeL2 + Miss rateL2 * Miss penaltyL2

Local and global miss rates l.jpg
Local and Global Miss Rates Processors

  • The second-level miss rate is measured on the leftovers from the first-level cache.

    • Local miss rate (Miss rateL2)

      • The number of misses in the cache divided by the total number of memory accesses to this cache.

    • Global miss rate (Miss rateL1 *Miss rateL2)

      • The number of misses in the cache divided by the total number of memory accesses generated by the CPU.

Two insights and questions l.jpg
Two Insights and Questions Processors

  • Two insights from the observation of the results shown above:

    • The global cache miss rate is very similar to the single cache miss rate of the second-level cache.

    • The local cache miss rate is not a good measure of secondary caches; The global cache miss rate should be used because the effectiveness of second-level cache is a function of the miss rate of the first-level cache.

  • Two questions for the design of the second-level cache:

    • Will it lower the average memory access time portion of the CPI, and how much it cost?

Example p417 l.jpg
Example (P417) Processors

Early restart and critical word first l.jpg
Early Restart and Critical Word First Processors

  • Basic idea: Don’t wait for the full block to be loaded before sending the requested word and restarting the CPU.

    • Two strategies:

      • Early restart: As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution.

      • Critical word first: Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block.

    • Example on page 419.

Given priority to read miss over writes l.jpg
Given Priority to Read Miss over Writes Processors

  • A write buffer can free the CPU from waiting for the completion of write, but it could hold the updated value of a location needed on a read miss. This complicates memory access, i.e., it may cause a RAW hazard.

    • Two solutions:

      • The read miss waits until the write buffer is empty. This certainly increases miss penalty. Or,

      • Check the contents of the write buffer on a read miss, and let the read miss fetch the data from the write buffer.

    • Example on page 419

Victim caches 1 l.jpg
Victim Caches (1) Processors

  • Victim cache

    • A small fully associative cache contains only blocks that are discarded from a cache because of a miss -- “victim”.

      • The blocks of the victim cache is checked on a miss to see if they have the desired data before going to the next lower-level memory. If it is found there, the victim block and cache block are swapped.

      • A four entry victim cache can remove 20% to 95% of the conflict misses in a 4-KB direct mapped data cache.

Victim caches 2 l.jpg
Victim Caches (2) Processors

Reducing miss rate l.jpg
Reducing Miss Rate Processors

  • Larger block size

  • Larger caches

  • Higher associativity

  • Way prediction and psudoassociative caches

  • Compiler optimizations

Miss categories l.jpg
Miss Categories Processors

  • Compulsory miss

    • The first access to a block is not in the cache.

  • Capacity miss

    • Occur because of blocks being discarded and later retrieved if the cache cannot contain all the blocks needed during execution of a program.

  • Conflict miss

    • Occur because a block can be discarded and later retrieved if two many blocks map to its set for direct mapped or set associative caches.

  • What can a designer do with the miss rate?

    • Reduce conflict miss is the easiest: Fully associativity, but very expensive.

    • Reduce capacity miss: Use large cache.

    • Reduce compulsory miss: Use large block.

Larger block size l.jpg
Larger Block Size Processors

  • Reduce compulsory miss by taking advantage of spatial locality.

  • Increase miss penalty

  • Increase capacity miss if cache is small.

  • The selection of block size depends on both the latency and bandwidth of the lower-level memory:

    • High latency and high bandwidth encourages larger block sizes.

    • Low latency and low bandwidth encourages smaller block sizes.

  • Example on page 426.

Example p426 l.jpg
Example (P426) Processors

Larger caches l.jpg
Larger Caches Processors

  • Drawbacks

    • Longer hit time

    • Higher cost

Higher associativity l.jpg
Higher Associativity Processors

  • Two general rules of thumb

    • 8-way set associative is for practical purposes as effective in reducing misses as fully associative.

    • 2:1 cache rule of thumb

      • A direct mapped cache of size N has about the same miss rate as a 2-way set-associative cache of size N/2.

  • The pressure of a fast processor clock cycle encourages simple cache, but the increasing miss penalty rewards associativity

  • Example on page 429.

Average memory access time versus associativity l.jpg
Average Memory Access Time versus Associativity Processors

  • Fig. 5.19 on page 430

Way prediction l.jpg
Way Prediction Processors

  • Reduce conflict misses and yet maintain the hit speed of a direct-mapped cache.

  • Way prediction

    • Extra bits are kept in the cache to predict the way, or block within the set of the next cache access.

    • It means the MUX can be set early to select desired block.

    • A miss results in checking the other blocks for matches.

    • Alpha 21264 uses such technique.

      • Hits take 1 cycle

      • Misses take 3 cycles

    • Can also be used to reduce power consumption.

Pseudoassociative caches l.jpg
Pseudoassociative Caches Processors

  • Access proceed just as in the direct-mapped cache for a hit.

  • On a miss, a second cache entry is checked to see if it matches there.

Compiler optimizations l.jpg
Compiler Optimizations Processors

  • Loop intercahnge

    • Reduce misses by improving spatial locality

  • Blocking

    • Reducing capacity miss

Blocking l.jpg
Blocking Processors

Reducing cache miss penalty or miss rate via parallelism l.jpg
Reducing Cache Miss Penalty or Miss Rate via Parallelism Processors

  • Nonblocking caches to reduce stalls on cache misses

  • Hardware prefetching of instructions and data

  • Compiler-controlled prefetching

Nonblocking caches to reduce stalls on cache misses l.jpg
Nonblocking Caches to Reduce Stalls on Cache Misses Processors

  • For pipeline machines that implement Tomasulo’s algorithm, allowing out-of-order completion, the CPU need not stall on a cache miss.

  • A nonblocking cache can escalates the potential benefits of such a scheme by allowing the data cache to continue to supply cache hits during a miss. This is called hit under-miss. When the allowable misses are more than one, it is called hit under multiple misses.

  • Example on page 436.

Hardware prefetching of instructions and data l.jpg
Hardware Prefetching of Instructions and Data Processors

  • A processor fetches two (consecutive) blocks on a miss.

    • The requested block is placed in the instruction (data) cache when it returns.

    • The prefetched block is placed into instruction (data) stream buffer.

    • When the requested block can be found and read from the stream buffer, the next prefetch request is issued.

  • With four instruction (data) stream buffers, the hit rate improves to 50% (43%).

Controller controlled prefetching l.jpg
Controller-Controlled Prefetching Processors

  • Compiler inserts prefetch instructions to request the data before they are needed.

    • Register prefetch

    • Cache prefetch

Reducing hit time l.jpg
Reducing Hit Time Processors

  • Hit time is critical because it affects the clock rate of the processor.

  • Strategies to reduce hit time

    • Small and simple cache: direct mapped

    • Avoid address translation during indexing of the cache

    • Pipelined cache access

    • Trace cache

Summary of cache optimizations l.jpg
Summary of Cache Optimizations Processors

  • Fig. 5.26

Main memory and organization for improving performance l.jpg
Main Memory and Organization for Improving Performance Processors

  • Performance measures of main memory emphasizes both latency and bandwidth.

    • Traditionally, latency is the primary concern of the cache, while the bandwidth is the primary concern of I/O. However, with a second-level cache and their larger block size, bandwidth becomes important to caches as well.

    • It is easier improve the memory bandwidth with new organization.

Techniques for improving bandwidth l.jpg
Techniques for Improving Bandwidth Processors

  • Techniques

    • Wider main memory

    • Simple interleaved memory

    • Independent memory banks

  • Assume the performance of the basic organization is

    • 4 clock cycles to send address

    • 56 clock cycles for the access time per word (8 bytes)

    • 4 clock cycle to send a word of data

      • Given a cache block of four words, the miss penalty is 4*(4+56+4)=256 clock cycles.

Wider main memory 1 l.jpg
Wider Main Memory (1) Processors

  • With a main memory width of two words, the miss penalty for the above example would drop from 256 cycles to 128 cycles.

    • Drawbacks:

      • Increase the critical path timing by introducing a multiplexer in between the CPU and the cache.

      • Memory with error correction has difficulties with writes to a portion of the protected block (e.g. a write of a byte).

Simple interleaved memory l.jpg
Simple Interleaved Memory Processors

  • Basic concept

    • Memory chips can be organized in banks to read or write multiple words at a time rather than a single word. The addresses are sent to several banks permits them all to read at the same time.

      • The miss penalty with this scheme becomes 4+56+4*4= 76 cycles.

    • The mapping of addresses to banks affects the behavior of the memory system. Usually, The addresses are interleaved at word level.

  • Example on page 452.

Independent memory banks l.jpg
Independent Memory Banks Processors

  • Multiple memory controllers allow banks to operate independently. Each bank needs separate address lines and possibly a separate data bus.

    • Such a design enables the use of nonblocking cache.

Memory technology l.jpg
Memory Technology Processors

  • Performance metrics

    • Latency: two measures

      • Access time: The time between when a read is requested and when the desired word arrives.

      • Cycle time: The minimum time between requests to memory.

    • Usually cycle time > access time

Slide73 l.jpg
DRAM Processors

  • Refresh time < 5%; slow increase in speed.

Sram rom and flash technology l.jpg
SRAM, ROM and Flash Technology Processors

  • SRAM

    • No refresh

    • 8 to 16 times faster than DRAM

    • 8 to 16 times more expensive than DRAM

    • Suitable for embedded applications

  • ROM and flash

    • Non-volatile

    • Best suit the embedded processors

Improving memory performance in a standard dram chip l.jpg
Improving Memory Performance in a Standard DRAM Chip Processors

  • Use of multi-bank organization provides larger bandwidth

  • Other three methods to increase bandwidth

    • Fast page mode

      • Repeated accesses to a row without another row access time.

    • Synchronous DRAM

      • Have a programmable register to hold the number of bytes requested and hence can send many bytes over several cycles per request with the overhead of synchronizing the controller.

    • Double Data Rate (DDR) DRAM

      • Use falling and rising edges of the clock for transfering data.

Rambus dram rdarm l.jpg

  • Each chip has interleaved memory and a high-speed interface and acts more like a memory system.

  • RDARM: First generation RAMBUS DRAM

    • Drop RAS/CAS, replacing it with a bus that allows other accesses over the bus between the sending of the address and return of the data (called packet-switched bus or split-transaction bus).

    • Use double edges of the clock.

    • Runs at 300MHZ.

  • Direct RDRAM (DRDRAM): Second generation

    • Separate data, row, column buses such that three transactions on these buses can be performed simultaneously.

    • Runs at 400 MHZ.

  • Comparing RAMBUS and DDRSDRAM

    • Both increase memory bandwidth.

    • None help in reducing latency.

Virtual memory vm l.jpg
Virtual Memory (VM) Processors

  • VM divides physical memory into blocks and allocates them to different processes, each of which has its own address space.

  • Need a protection scheme that restricts a process to the blocks belonging only to that process.

  • With VM, not all code and data are needed to be in physical memory before a program can begin.

  • VM provides process (program) relocation.

  • Virtual address

    • Given by CPU

  • Physical address

    • For having an access to main memory

  • Address translation

    • Convert a virtual address to a physical address.

    • Can easily form the critical path that limits the clock cycle time.

Types of vm l.jpg
Types of VM Processors

  • Paged

  • Segmented

  • Paged segment

  • Fig. 5.34 on 463

Differences between caches and vm l.jpg
Differences between Caches and VM Processors

  • Replacement

    • On cache is managed by hardware, while

    • On VM is managed by OS

  • The size

    • Of VM is determined by the size of processor address.

    • Of cache is independent of processor address size.

  • Second storage in VM occupied by file system is not normally in the address space.

Parameter ranges of caches versus vm l.jpg
Parameter Ranges of Caches versus VM Processors

  • Fig. 5.32 on page 462.

Four memory hierarchy questions83 l.jpg
Four Memory Hierarchy Questions Processors

  • Q1: Where can a block be placed in main memory?

    • Anywhere (fully associative)

  • Q2: How is a block found if it is in main memory?

    • Use page table, or

    • Use inverted page table to reduce the size of page table by hashing.

    • Problems

      • Need two memory accesses to obtain requested data.

      • Solution is to use translation lookaside buffer.

  • Q3: Which block should be replaced on a VM miss?

    • LRU

  • Q4: What happens on a write?

    • Write back

Techniques for fast address translation l.jpg
Techniques for Fast Address Translation Processors

  • Problem with pure page table translation

    • Needs to have two memory accesses

  • Translation Lookaside Buffer (TLB) solves the problem.

    • Use locality of page table references.

    • A fully associative memory whose entries record the most recently used base addresses of the pages.

    • Each entry consists

      • Tag

      • Physical page frame number

      • Protection field

      • Valid bit, dirty bit, and used bit

      • ASN (Address Space Number): to identify which process owns the corresponding page.

Alpha 21264 tlb l.jpg
Alpha 21264 TLB Processors

Protection and examples of vm l.jpg
Protection and Examples of VM Processors

  • Process

    • A running program plus any state needed to continue running it.

  • Process (context) switch

    • One process is stop execution and another process is brought into execution.

  • Requirements for context switches

    • Be able to save CPU states for continue execution

      • A computer designer’s responsibility

    • Protect a process from been interfered by another process

      • OS’s responsibility

  • Computer designers can make protection easily implemented by the OS via VM design.

Protecting process l.jpg
Protecting Process Processors

  • The simplest mechanism

    • Use base and bound registers

      • An access is valid if Base <= Address <= Bound

    • To enable this protection, computer designers have the following three responsibilities:

      • Provide at least two execution modes: user or kernel (OS, supervisor) modes

      • Provide a portion of the CPU state that a user process can use but not write.

      • Provide mechanisms whereby the CPU can go from user mode to kernel modes.

  • More Sophisticated Mechanisms

    • Ring

    • Capabilities

      • A program can’t unlock access to the data unless it has keys (capabilities).

The alpha memory management and the 21264 tlb l.jpg
The Alpha Memory Management and the 21264 TLB Processors

  • Alpha VM architecture

    • A combination of segmentation and paging, providing protection while minimizing page table size

      • 64-bit address space, but with 48-bit virtual addresses

      • Three segments, each of which is paged

        • seg0 (bits 63~46 = 0…00): hold user processes

        • seg1 (bits 63~46 = 1…11):

        • kseg (bits 63~46 = 0…10): reserved for operating system kernel

Mapping of an alpha virtual address l.jpg
Mapping of an Alpha Virtual Address Processors

  • Each page table

    is held in a page.

Memory protection in alpha 21264 l.jpg
Memory Protection in Alpha 21264 Processors

  • Each page table entry(PTE)has 64 bits

    • The first 32 bits contain the physical page frame number.

    • The other half includes the following protection fields:

      • Valid

      • User read enable

      • Kernel read enable

      • User write enable

      • Kernel write enable

  • The Alpha obeys only the protection requirements imposed by the bottom-level PTEs.

Concluding remarks l.jpg
Concluding Remarks Processors

  • The primary challenge for the memory hierarchy designer is in choosing parameters that work well together, not in inventing new techniques (already enough).