1 / 68

Shannon Tauro/Jerry Lebowitz Portions provided by Ellen Spertus, Mills College

Computer Organization Improving Performance of MicroArchitecture Level Tannenbaum 4.5. Shannon Tauro/Jerry Lebowitz Portions provided by Ellen Spertus, Mills College. Looking into CPU Design. Looking at alternative ways to improve performance Two rough categories

snowy
Download Presentation

Shannon Tauro/Jerry Lebowitz Portions provided by Ellen Spertus, Mills College

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer OrganizationImproving Performance of MicroArchitecture LevelTannenbaum 4.5 Shannon Tauro/Jerry Lebowitz Portions provided by Ellen Spertus, Mills College

  2. Looking into CPU Design • Looking at alternative ways to improve performance • Two rough categories • Implementation Improvements • A new way to build a CPU or memory without changing the architecture • i.e. you can still run older programs • Architectural Improvements • Add new instructions • Add new registers Need to modify compiler

  3. Previously Computer = CPU + Memory + I/O • Reducing Execution Path Length (merging Main loop) • Added A-Bus (Reduced several micro-instruction sequences) • Instruction Fetch Unit (retrieved opcodes and operands from memory ahead of time) • Pipelining the data path (increased throughput by adding registers and speeding up clock)

  4. Now… Focusing on Memory Computer = CPU + Memory + I/O Remember Memory Hierarchy…. Initially, we will focus on top portion:caching Next… virtual memory (combo of main memory and hard drive)

  5. Registers On the processor chip Fast (one cycle) Small Main memory Off the processor chip Slow (4-50 cycles) Big Characteristics of Memory main memory central processing unit (CPU)

  6. Memory Demand • Modern processors place overwhelmingdemands on a memory system in terms of • Latency • The delay in supplying an operand • Bandwidth • The amount of data supplied per unit of time • Latency and bandwidth • Competing metrics • Increasing bandwidth usually increases latency

  7. Cache Memory • Helps solve both latency and bandwidth metrics • Holds recently used memory in a small, fast memory, speeding up access • If a large percentage of the needed memory words are in cache, latency is reduced • An effective way to improve latency and bandwidth is to use multiple caches

  8. CPU cache Solution: Cache Memory • Include someextra memory on the processor chip • Store data that will be needed soon in the cache so it’s easy to access • Effective to use multiple levels of cache main memory

  9. Cache Memory • When the processor needs to read from or write to a location in main memory, it first checks whether a copy of that data is in the cache • If so, the processor immediately reads from or writes to the cache, which is much faster than reading from or writing to main memory

  10. Types of Cache • Most modern computers have at least three independent caches • An instruction cache to speed up executable instruction fetch • A data cache to speed up data fetch and store • A translation lookaside buffer (TLB) used to speed up virtual-to-physical address translation for both executable instructions and data

  11. Cache Levels Unified; Several Megabytes Three levels Memory operations can be initiated independently Effectively doubles the bandwidth of the memory system Generally Unified Typical Size: 512 KB to 1 MB

  12. Cache Properties Predicting memory usage • Assume: Location n accessed at time t • Temporal locality • Location n may be accessed again soon • Spatial locality • Locations near n may be accessed soon

  13. Using Cache (1) • When the processor needs to read or write a location in main memory, it first checks for a corresponding entry in the cache

  14. Using Cache (2) • The cache checks for the contents of the requested memory location in any cache lines that might contain that address • If the processor finds that the memory location is in the cache, a cache hit has occurred • The processor immediately reads or writes the data in the cache line • If the processor does not find the memory location in the cache, a cache miss has occurred • The cache allocates a new entry, and copies in data from main memory, the request is fulfilled from the contents of the cache

  15. Cache Performance The proportion of accesses that result in a cache hit is known as the hit rate • Is a measure of the effectiveness of the cache • Read misses • Delay execution because they require data to be transferred from memory much more slowly than the cache itself • Write misses • May occur without such penalty, since the processor can continue execution while data is copied to main memory in the background

  16. Replacement Policy • Depends on the type of cache • In order to make room for the new entry on a cache miss, the cache may have to evict one of the existing entries • Replacement policy • The fundamental problem with any replacement policy is that it must predict which existing cache entry is least likely to be used in the future • One common replacement policy, least-recently used (LRU) • Replaces the least recently accessed entry • .

  17. Write Policies • If data is written to the cache, at some point it must also be written to main memory

  18. Write-Through • Update cache and main memory simultaneously on every write • Keeps cache main memory consistent at the same time • All writes require main memory access (bus transaction) • Slows down the system - If the there is another read request for main memory due to miss in cache, the read request has to wait until the earlier write was serviced

  19. Write Back or Copy Back • Data that is modified is written back to main memory when the cache block is going to be removed from cache • Faster than write-through • Time is not spent accessing main memory • Writes to multiple words within a block require only one write to the main-memory • Need extra bit (dirty bit) in cache to indicate which block has been modified • Adds to size of the cache

  20. Associatively • The replacement policy decides where in the cache a copy of a particular entry of main memory will go • If each entry in main memory can go in just one place in the cache, the cache is direct mapped • Best (fastest) hit times • The best tradeoff for "large" caches • If the replacement policy is free to choose any entry in the cache to hold the copy, the cache is called fully associative • Many caches implement a compromise in which each entry in main memory can go to any one of k places in the cache, known as k-way set associative

  21. Cache • Main memory is divided up into fixed-size blocks called cache lines • Typically consists of 4 to 64 consecutive bytes • Valid bit: indicates whether there is any valid data in the entry • Tag: 16-bit value identifying the corresponding line of memory from which the data came • Data: contains a copy of the data in memory - is transferred between memory and cache in blocks of fixed size Contains 2048 entries (2048 x 32 bytes = 64 KB)

  22. Direct Mapped Cache • For storing and retrieving data from cache, the memory address is divided into four components • TAG – Corresponds to the TAG in cache (65,536) • LINE – Indicates which cache entry holds the corresponding data (2048) • Word – Which word within the line is referenced • Bytes – (not normally used) – If a single byte is requested, it tells which byte within the word is needed. For a cache supplying 32 bits words, this field will be always 0 # of bits 16 11 3 2 TAG LINE Word Bytes

  23. Direct Mapped Cache • When the CPU generates an address • The LINE (11 bits) determines the cache entry • The two TAG fields are compared (address vs. cache) • If they agree, a cache hit occurs (no need to read memory) • If not, a cache miss occurs • The 32 byte cache is fetched from memory and stored in cache

  24. Direct Mapped Cache • Let "x" be block number in cache, "y" be block number of memory, and "n" be number of blocks in cache, then mapping is done with the help of the equation x = y mod n • If we had 10 blocks of cache, block 7 of cache may hold blocks 7, 17, 27, or 37, … of main memory • If a program accesses data at location x and x + 65,536 (or a multiple of 65,536), the second instruction forces the cache entry to be reloaded since they have the same LINE value • Cache is swapped in and out of memory • This could result in poor performance

  25. Direct Mapped Cache This means that if two locations map to the same entry, they may continually knock each other out

  26. Direct Mapped Example • Suppose memory consists of 214 (16384) locations or words and cache has 24 =16 cache lines and each cache line holds 8 (23) words of data • Main memory is divided into 214 / 23 = 211 cache lines • Of the 14 bit addresses, we need 7 bits for TAG, 4 bits for the LINE, and 3 bits for the word 7 4 3 # of bits Word TAG LINE

  27. Direct Mapped Cache with 16 Entries

  28. Direct Mapped Example • Suppose a program generates address 1AA • In 14 bit binary, this address is 000001 1010 1010 • The first seven bits go in the TAG, the next 4 go in the LINE, and the final three go in the word TAG LINE Word 010 0000011 0101

  29. Direct Mapped Example

  30. Direct Mapped Example • However if the program generates the address 3AB (000011 1010 1011) • Tag will be 0000111 • LINE will be 0101 (same as 1AA) • Word will be 011 • The block loaded for 1AA would be removed from cache and replaced by the blocks associated with the 3AB reference

  31. Direct Mapped Example

  32. Address • Address breakup is done this way due to spatial locality • Data from consecutive addresses are brought into cache • If the higher order bits were used, then values from consecutive address would map to the same location in cache • Using the middle bits cause less thrashing 7 4 3 Word TAG LINE

  33. Fully Associative Cache • Another scheme is placing memory blocks in anylocation in cache • Cache has to fill up before any cache entries are evicted • Slow • Costly compared to direct-mapped cache • Memory address is partitioned into only two blocks • Suppose we have 14-bit memory 11 3 # of bits TAG Word

  34. Fully Associative Cache • When cache is searched, all tags are searched in parallel to retrieve data quickly • Need “n” comparators where n = number of cache lines

  35. Evicting Blocks • A block that is evicted is called a victim block • The replacement policy depends upon the locality that is being optimized • If one is interested in temporal locality (referenced memory is likely to be reference again) • Keep the most recently used blocks • Common replacement policy, least-recently used (LRU), replaces the least recently accessed entry • Need to maintain an access history that slows down cache

  36. Fully Associated Cache

  37. Set-Associative Cache • Set associative cache combines the ideas of direct mapped cache and fully associative cache • Similar to direct mapped cache in that a memory reference maps to a particular location in cache but that cache location can hold more than one main memory block • The cache location is then called a set • Instead of mapping anywhere in the entire cache (fully associative), a memory reference can map only to the subset of cache

  38. Set-Associative Cache • The number of blocks per set in set associative cache varies according to overall system design • For example, a 2-way set associative cache contains two different memory blocks

  39. Set-Associative Cache • For example, a 2-way set associative cache each set contains two different memory blocks

  40. Set-Associative Cache • Like direct-mapped cache except, middle bits of the main memory address indicate the set in cache Word TAG SET

  41. Advantage of Set Associative • Unlike direct mapped cache, if an address maps to a set, there is choice for placing the new block • If both slots are filled, then we need an algorithm that will decide which old block to evict (like fully associative) • Two-way and four-way caches perform well

  42. Disadvantage of Set Associative • Tags of each block in a set need to be matched (in parallel) to figure out whether the data is present in cache • Need k comparators • Although, the hardware cost for matching is less than fully associative (need n comparators, where n = # blocks), but it is more than direct mapped (need only one comparator)

  43. A k-way set associative cache is like having n-entries/k different direct-mapped caches 4-way set associative cache

  44. What Affects Performance of Cache? • Programs that exhibit bad locality • E.g. Spatial Locality with matrix operations • Suppose a matrix data kept in memory is by rows (known as row-major) i.e. offset = row*NUMCOLS + column • Poor code: • for (j = 0; j < numcols; j++) • for(i = 0; i < numrows; i++) • i.e. x[i][j] followed by x[i + 1][j] • The array is being accessed by column and we going to miss in the cache every time • Solution: switch the for loops • C/C++ are row-major, FORTRAN & MATLAB are column-major

  45. Cache Performance Total cache size

  46. Summary of… Cache Variables • Total size (typically 128K to 2M) • Block size (typically 16-64 words) • Replacement strategy (typically least recently used (LRU)) • Write policy • Write-through (write changes immediately) • Write-back (write changes on flush) • Separate or unified code and data caches

  47. Summary…Improving cache performance • Multi-level caches • Level 1 (L1) cache on-chip (32K-256K) • Level 2 (L2) cache on-chip (64K-512K) • Associativity

  48. Backup

  49. Analogies • Baking ingredients • On counter (registers) • On shelf (cache) • In pantry (main memory) • Library Books • On desk (registers) • On bookshelves (cache) • In library (main memory)

  50. Direct-mapped Cache (1) • Let k be the number of blocks in the cache • Address n can onlybe stored in locationn mod k • Examples: • 1010two • 1111two

More Related