Memory Hierarchies: Bringing It All Together

Memory Hierarchies:Bringing It All Together Fortunate is he, who can understand the causes of things.

Use and Distribution Notice • Possession of any of these files implies understanding and agreement to this policy. • The slides are provided for the use of students enrolled in Jeff Six's Computer Architecture class (CMSC 411) at the University of Maryland Baltimore County. They are the creation of Mr. Six and he reserves all rights as to the slides. These slides are not to be modified or redistributed in any way. All of these slides may only be used by students for the purpose of reviewing the material covered in lecture. Any other use, including but not limited to, the modification of any slides or the sale of any slides or material, in whole or in part, is expressly prohibited. • Most of the material in these slides, including the examples, is derived from Computer Organization and Design, Second Edition. Credit is hereby given to the authors of this textbook for much of the content. This content is used here for the purpose of presenting this material in CMSC 411, which uses this textbook.

Design Decisions in Memory Hierarchies • Now that we have seen caches and virtual memory in detail, it should be obvious that the different types of memory hierarchies share a great deal in common. • Many policies and features are common to all levels of a hierarchy – we can derive four questions that apply to each level… • Where can a block be placed? • How is a block found? • What block should be replaced on a cache miss? • What happens on a write?

Typical Design Parameters • The three levels of memory hierarchy we have discussed vary in their typical values…

Question 1:Where can a Block be Placed? • The entire range of schemes can be thought of as variations on a set-associative scheme where the number of sets and the number of blocks per set vary…

Block Placement • The key advantage of increasing the degree of associativity is that is usually decreases the miss rate by reducing misses that compete for the same location. • Let’s look at how much for caches of different sizes. • The following graph is based on a SPEC92 benchmark set running on a machines with caches from 1K to 128K, ranging from direct mapped to 8-way set associative.

1 5 % 1 2 % 9 % e t a r s s i M 6 % 3 % 0 % O n e - w a y T w o - w a y F o u r - w a y E i g h t - w a y A s s o c i a t i v i t y 1 K B 1 6 K B 2 K B 3 2 K B 4 K B 6 4 K B 8 K B 1 2 8 K B Associativity and Performance

How Associativity Affects Performance • The largest performance gains came from going from direct mapped to 2-way set associative. • As the cache size grows, the miss rate is lower and the opportunity to improve it by increasing the associativity decreases. • Keep in mind that by increasing the associativity, we decrease the miss rate but also increase the cost of the system and increase the access time (hit time).

Question 2:How is a Block Found? • Each associativity strategy has a corresponding block location method…

Costs of Higher Associativity • Generally, the choice between the three mapping strategies will depend on the cost of a miss versus the cost of implementing associativity, both in time and extra hardware. • A high associativity is generally not worthwhile because the cost of comparators grows which the miss rate does not improve at a comparable rate. • Fully associative caches are not practical except in very small designs, where the cost of the comparators is not overwhelming and the absolute miss rate improvements are greatest.

Virtual Memory and Fully Associative Mapping • That being said, virtual memory systems introduce a separate mapping table (the page table) to index the entire memory – this requires storage for the table plus an extra memory access for each reference. • Why is this done in VM systems?

Why use the Page Table in Virtual Memory Systems? • Four reasons… • Misses are VERY expensive, so full associativity is quite beneficial. • Full associativity allows the OS to use sophisticated and complex replacement strategies to reduce the miss rate. • The full page table can be indexed and accessed with no additional hardware and no searching is required. • Page sizes are large – this allows the page table size overhead to be relatively small (the small block size of caches make this approach impractical for anything except for paged memory).

Design Decisions for the Different Levels of the Hierarchy • The choice of full associativity for virtual memory systems is clear. • For caches and TLBs, set-associative designs are typically used – this combines indexing and the search of a small set. • Some systems use direct mapped caches due to their simplicity and small access times.

Hardware Considerationsin Cache Design • Some design choices are motivated by hardware considerations… • How important is the cache access time in determining the processor cycle time? • If the cache located on-chip or off-chip? • Is there more than one level of cache? • What technology is used to implement the cache(s) and how fast is it?

Question 3:What Block Should Be Replaced? • Direct-mapped caches are easy – replace the only candidate block. • There are two primary strategies for associative caches… • Least Recently Used (LRU) – The block replaced in the one (from the set of candidate blocks) that has been unused for the longest time. • Random – Random selection from candidate blocks, possibly using some hardware assistance

LRU versus Random in Caches • In practice, LRU is impractical for more than a small degree of associativity (2 to 4, normally) due to the amount of hardware needed to keep track of LRU. • For anything bigger, LRU is either approximated or a random replacement strategy is used. • For 2-way associative caches, random has a miss rate about 1.1 times higher than LRU. • As caches become larger, the different in miss rates between the two strategies becomes very small.

LRU versus Random in VM • In virtual memory, LRU is always approximated or used, since even a tiny reduction in miss rate is important since the cost of a miss is enormous. • The hardware normally provides reference bits or equivalent features to assist the operating system in tracking the set of less recently used pages.

Question 4:What Happens on a Write? • There are two primary strategies for dealing with a write… • Write-Through – The data is written to both the block in the cache (the higher-level memory) and the block in the lower-level memory. • Write-Back – The information is written to the block in the cache. The modified block is written into the lower level only when it is replaced. This strategy is sometimes called copy-back.

Write-Through Advantages • The key advantages of a write-through strategy are… • Misses are simpler to handle and cheaper in miss time. A miss never requires a block to be written back to the lower level. • Write-through is simpler to implement. However, to be practical, a write-through system will require a write buffer, which makes this system somewhat more complex.

Write-Back Advantages • The key advantages of a write-back strategy are… • Individual words can be written by the processor at the rate of the cache, not the (slower) main memory. • Multiple writes within a block require only one write the the lower and slower level. • When blocks are written back, the system can make effective use of high-bandwidth transfer since the entire block is written at once.

Complications of Write Misses with Write-Through Caches • Writes introduce several complications that do not apply to reads. • For a miss on a write to a write-through cache (we write to a block that is not in the cache), we could follow one of many strategies… • Fetch-on-miss – allocates a cache block to the address that missed and fetches the rest of the block before writing the data and continuing execution. • No-fetch-on-write – We write the data into a newly allocated cache block but do not fetch the rest of the block. • Write-around – write the data to the main memory but not into the cache (no cache block allocation).

Complications of Write Misses with Write-Through Caches • Why deal with these mode advanced and complex strategies? • Programs sometimes write entire blocks of data before reading them. If so, the fetch associated with the initial miss is avoided. • There are a number of issues when such a strategy is used in a system with multiword blocks – they generally lead to the addition of mechanisms similar to those used in write-back caches. • The DECStation 3100 cache we have seen is a special case with one word blocks. It can do a fetch-on-write without actually doing a fetch.

Complications of Write Misses with Write-Back Caches • Write-back caches make things even more difficult. • When a miss occurs, we cannot simply overwrite a block as that block may be dirty and need to be written back into main memory. • Stores could require two cycles - one to check for a hit in the cache and one to actually write the data. • Alternatively, a store buffer could hold the data. Here, the processor does the cache lookup and places the data in the buffer at the same time. If the cache hits, the data is written from the buffer into the cache on the next unused cache access cycle.

Types of Misses • So, what causes a miss? There are three types… • Compulsory Misses – these are cache misses caused by the first access to a block that has never been in the cache. • Capacity Misses – these are misses caused when a cache cannot hold all of the blocks needed during execution of a program. These occur when blocks are replaced and then retrieved later when accessed. • Conflict Misses – these are misses that occur in set-associative or direct-mapped caches when multiple blocks compete for the same set. Here, the cache is not full but misses still occur.

1 4 % 1 2 % 1 0 % e p y t r 8 % e p e t a r s 6 % s i M 4 % 2 % C a p a c i t y 0 % 1 2 4 8 1 6 3 2 6 4 1 2 8 C a c h e s i z e ( K B ) O n e - w a y F o u r - w a y T w o - w a y E i g h t - w a y Miss Rates and Their Sources

Compulsory Misses • Compulsory misses are generated by the first reference to a block. The best way to reduce these is to increase the block size. • This will reduce the number of references required to touch each block in the program. • This works because the entire program is now composed of a fewer number of cache blocks. • When we do this, we need to worry about increasing the block size too much causing an increase in the miss penalty.

Capacity Misses • Capacity misses can be dealt with by increasing the size of the cache. • When we do this, we need to worry about increasing the access time, which could lead to lower overall performance.

Conflict Misses • These can be handled by increasing the associativity of the cache. • These misses arise from contention for the same cache block. By increasing the number of entries each block can by placed in, conflict misses can be avoided. • When we do this, we need to worry about slowing the access time, leading to lower overall performance.

The Challenge in Memory Hierarchy Design • Every change that potentially improves the miss rate can also reduce overall performance.

Case Study: The Pentium Pro and the PowerPC 604 • The Pentium Pro and the PowerPC 604 (both around 1997) both offer a secondary (L2) cache that is either 256K or 512K, depending on which model you bought. • For the Pentium Pro, the L1 cache is on the same die as the microprocessor. The L2 cache is on a separate die but in the same package as the microprocessor. • For the PowerPC 604, the L1 cache in on the same die as the microprocessor. The L2 cache is implemented using off-die SRAM.

First-Level (L1) Caches

Additional Optimizations • Both chips implement special optimizations to make the caches faster… • On a miss, both chips have their cache return the requested word first and start processing once it’s back – it does not wait for the entire block to load. • The PowerPC 604 continues to fetch and execute instructions while the instruction that caused a miss waits for its data. It does not allow execution of instructions which access the data cache to proceed. • The PPro allows similar execution, but allows the later instructions to access the data cache as well. It allows hit under miss (which allows cache hits during a miss) and miss under miss (which allows additional cache misses during a miss).

Address Translation • The address translation differs between the processors, as the PowerPC has a 52-bit virtual address and the PPro has a 32-bit virtual address.

Simultaneous Instruction Execution Complications • Recall that the Pentium Pro’s pipeline can allow both a load and a store operation to take place on the same clock cycle. • The cache must be designed to support this… • One alternative is to make the cache multiported (as the MIPS register file is). This gets expensive and is only practical for small (tens of entries) caches. • Instead, the PPro has multiple banks of caches and only allows accesses to two different banks at the same time. When a conflict occurs, loads take priority over stores. A buffer stores the pending store operations to avoid processor stalls.

Memory Hierarchies: Bringing It All Together

Memory Hierarchies: Bringing It All Together

Presentation Transcript

High Performance Programming on a Single Processor: Memory Hierarchies Matrix Multiplication Automatic Performance Tu

Memory Hierarchy

Memory

Optimizing the Fast Fourier Transform over Memory Hierarchies for Embedded Digital Systems: A Fully In-Cache Algorithm

Memory Hierarchy Design

Lecture 20: Cache Hierarchies, Virtual Memory

Memory

Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory

On the Power of Semidefinite Programming Hierarchies

Memory

Inheritance, Polymorphism, Class Hierarchies and GENERICS

How Fragile is Your Memory?

Memory and Intelligence

Platform-based Design 5kk70

Virtual Memory

Memory and Concentration

Memory

Code Reuse Through Hierarchies

Parallel Memory Allocation

Role Activation Hierarchies

Nettverk Software Protocol Hierarchies

File Organization