55:035 Computer Architecture and Organization

55:035Computer Architecture and Organization Lecture 8 55:035 Computer Architecture and Organization

Outline • Virtual Memory • Basics • Address Translation • Cache vs VM • Paging • Replacement • TLBs • Segmentation • Page Tables 55:035 Computer Architecture and Organization

The Full Memory Hierarchy Capacity Access Time Cost Upper Level Staging Xfer Unit faster CPU Registers 100s Bytes <10s ns Registers prog./compiler 1-8 bytes Instr. Operands Cache K Bytes 10-100 ns 1-0.1 cents/bit Cache cache cntl 8-128 bytes Blocks Main Memory M Bytes 200ns- 500ns $.0001-.00001 cents /bit Memory OS 4K-16K bytes Pages Disk G Bytes, 10 ms (10,000,000 ns) 10 - 10 cents/bit Disk -5 -6 user/operator Mbytes Files Larger Tape infinite sec-min 10 Tape Lower Level -8 55:035 Computer Architecture and Organization

Virtual Memory • Some facts of computer life… • Computers run lots of processes simultaneously • No full address space of memory for each process • Must share smaller amounts of physical memory among many processes • Virtual memory is the answer! • Divides physical memory into blocks, assigns them to different processes 55:035 Computer Architecture and Organization

Virtual Memory • Virtual memory (VM) allows main memory (DRAM) to act like a cache for secondary storage (magnetic disk). • VM address translation a provides a mapping from the virtual address of the processor to the physical address in main memory or on disk. Compiler assigns data to a “virtual” address. VA translated to a real/physical somewhere in memory… (allows any program to run anywhere; where is determined by a particular machine, OS) 55:035 Computer Architecture and Organization

VM Benefit • VM provides the following benefits • Allows multiple programs to share the same physical memory • Allows programmers to write code as though they have a very large amount of main memory • Automatically handles bringing in data from disk 55:035 Computer Architecture and Organization

Virtual Memory Basics • Programs reference “virtual” addresses in a non-existent memory • These are then translated into real “physical” addresses • Virtual address space may be bigger than physical address space • Divide physical memory into blocks, called pages • Anywhere from 512 to 16MB (4k typical) • Virtual-to-physical translation by indexed table lookup • Add another cache for recent translations (the TLB) • Invisible to the programmer • Looks to your application like you have a lot of memory! 55:035 Computer Architecture and Organization

VM: Page Mapping Process 1’s Virtual Address Space Page Frames Process 2’s Virtual Address Space Disk Physical Memory 55:035 Computer Architecture and Organization

VM: Address Translation 20 bits 12 bits Log2 of pagesize Virtual page number Page offset Per-process page table Valid bit Protection bits Dirty bt Reference bit Page Table base Physical page number Page offset To physical memory 55:035 Computer Architecture and Organization

Virtual Address Physical Address Physical Main Memory 0 4 8 12 A B C D 0 4K 8K 12K C 16K 20K 24K 28K A B Virtual Memory D Disk Example of virtual memory • Relieves problem of making a program that was too large to fit in physical memory – well….fit! • Allows program to run in any location in physical memory • (called relocation) • Really useful as you might want to run same program on lots machines… Logical program is in contiguous VA space; here, consists of 4 pages: A, B, C, D; The physical location of the 3 pages – 3 are in main memory and 1 is located on the disk 55:035 Computer Architecture and Organization

Cache terms vs. VM terms So, some definitions/“analogies” • A “page” or “segment” of memory is analogous to a “block” in a cache • A “page fault” or “address fault” is analogous to a cache miss “real”/physical memory so, if we go to main memory and our data isn’t there, we need to get it from disk… 55:035 Computer Architecture and Organization

More definitions and cache comparisons • These are more definitions than analogies… • With VM, CPU produces “virtual addresses” that are translated by a combination of HW/SW to “physical addresses” • The “physical addresses” access main memory • The process described above is called “memory mapping” or “address translation” 55:035 Computer Architecture and Organization

Cache VS. VM comparisons (1/2) 55:035 Computer Architecture and Organization

Cache VS. VM comparisons (2/2) • Replacement policy: • Replacement on cache misses primarily controlled by hardware • Replacement with VM (i.e. which page do I replace?) usually controlled by OS • Because of bigger miss penalty, want to make the right choice • Sizes: • Size of processor address determines size of VM • Cache size independent of processor address size 55:035 Computer Architecture and Organization

Virtual Memory • Timing’s tough with virtual memory: • AMAT = Tmem + (1-h) * Tdisk • = 100nS + (1-h) * 25,000,000nS • h (hit rate) had to be incredibly (almost unattainably) close to perfect to work • so: VM is a “cache” but an odd one. 55:035 Computer Architecture and Organization

Physical Memory 32 32 CPU page offset frame offset page table page frame Paging Hardware How big is a page? How big is the page table? 55:035 Computer Architecture and Organization

Virtual Address Page # Offset Frame # Offset Register Page Table Ptr Page Table Offset Page Frame P# + Frame # Program Paging Main Memory Address Translation in a Paging System 55:035 Computer Architecture and Organization

0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 How big is a page table? • Suppose • 32 bit architecture • Page size 4 kilobytes • Therefore: Page Number 220 Offset 212 55:035 Computer Architecture and Organization

PTR 0x100 VPN OFFSET Memory Reference 0x010 0x020 Test Yourself A processor asks for the contents of virtual memory address 0x10020. The paging scheme in use breaks this into a VPN of 0x10 and an offset of 0x020. PTR (a CPU register that holds the address of the page table) has a value of 0x100 indicating that this process’s page table starts at location 0x100. The machine uses word addressing and the page table entries are each one word long. 55:035 Computer Architecture and Organization

PTR 0x100 VPN OFFSET Memory Reference 0x010 0x020 Test Yourself ADDR CONTENTS 0x00000 0x00000 0x00100 0x00010 0x00110 0x00022 0x00120 0x00045 0x00130 0x00078 0x00145 0x00010 0x10000 0x03333 0x10020 0x04444 0x22000 0x01111 0x22020 0x02222 0x45000 0x05555 0x45020 0x06666 • What is the physical address calculated? • 10020 • 22020 • 45000 • 45020 • none of the above 55:035 Computer Architecture and Organization

PTR 0x100 VPN OFFSET Memory Reference 0x010 0x020 Test Yourself ADDR CONTENTS 0x00000 0x00000 0x00100 0x00010 0x00110 0x00022 0x00120 0x00045 0x00130 0x00078 0x00145 0x00010 0x10000 0x03333 0x10020 0x04444 0x22000 0x01111 0x22020 0x02222 0x45000 0x05555 0x45020 0x06666 • What is the physical address calculated? • What is the contents of this address returned to the processor? • How many memory accesses in total were required to obtain the contents of the desired address? 55:035 Computer Architecture and Organization

00 01 10 11 101 110 001 010 01 01 110 01 Another Example Physical memory 0 1 2 3 4 i 5 j 6 k 7 l 8 m 9 n 10 o 11 p 12 13 14 15 16 17 18 19 20 a 21 b 22 c 23 d 24 e 25 f 26 g 27 h 28 29 30 31 Logical memory 0 a 1 b 2 c 3 d 4 e 5 f 6 g 7 h 8 i 9 j 10 k 11 l 12 m 13 n 14 o 15 p Page Table 5 6 1 2 0 1 2 3 55:035 Computer Architecture and Organization

Replacement policies 55:035 Computer Architecture and Organization

Block replacement • Which block should be replaced on a virtual memory miss? • Again, we’ll stick with the strategy that it’s a good thing to eliminate page faults • Therefore, we want to replace the LRU block • Many machines use a “use” or “reference” bit • Periodically reset • Gives the OS an estimation of which pages are referenced 55:035 Computer Architecture and Organization

Writing a block • What happens on a write? • We don’t even want to think about a write through policy! • Time with accesses, VM, hard disk, etc. is so great that this is not practical • Instead, a write back policy is used with a dirty bit to tell if a block has been written 55:035 Computer Architecture and Organization

Mechanism vs. Policy • Mechanism: • paging hardware • trap on page fault • Policy: • fetch policy: when should we bring in the pages of a process? • 1. load all pages at the start of the process • 2. load only on demand: “demand paging” • replacement policy: which page should we evict given a shortage of frames? 55:035 Computer Architecture and Organization

Replacement Policy • Given a full physical memory, which page should we evict?? • What policy? • Random • FIFO: First-in-first-out • LRU: Least-Recently-Used • MRU: Most-Recently-Used • OPT: (will-not-be-used-farthest-in-future) 55:035 Computer Architecture and Organization

Replacement Policy Simulation • example sequence of page numbers • 0 1 2 3 42 2 37 1 2 3 • FIFO? • LRU? • OPT? • How do you keep track of LRU info? (another data structure question) 55:035 Computer Architecture and Organization

Page tables and lookups… • 1. it’s slow! We’ve turned every access to memory into two accesses to memory • solution: add a specialized “cache” called a “translation lookaside buffer (TLB)” inside the processor • 2. it’s still huge! • even worse: we’re ultimately going to have a page table for every process. Suppose 1024 processes, that’s 4GB of page tables! 55:035 Computer Architecture and Organization

Paging/VM (1/3) Disk Operating System Physical Memory CPU 356 42 356 page table i 55:035 Computer Architecture and Organization

page table i Paging/VM (2/3) Disk Operating System Physical Memory CPU 42 356 356 Place page table in physical memory However: this doubles the time per memory access!! 55:035 Computer Architecture and Organization

page table i Paging/VM (3/3) Operating System Disk Physical Memory CPU 42 356 356 Cache! Special-purpose cache for translations Historically called the TLB: Translation Lookaside Buffer 55:035 Computer Architecture and Organization

Translation Cache • Just like any other cache, the TLB can be organized as fully associative, • set associative, or direct mapped • TLBs are usually small, typically not more than 128 - 256 entries even on • high end machines. This permits fully associative lookup on these machines. • Most mid-range machines use small n-way set associative organizations. • Note: 128-256 entries times 4KB-16KB/entry is only 512KB-4MB • the L2 cache is often bigger than the “span” of the TLB. hit miss VA PA TLB Lookup Cache Main Memory CPU miss hit Translation with a TLB Trans- lation data 55:035 Computer Architecture and Organization

Translation Cache A way to speed up translation is to use a special cache of recently used page table entries -- this has many names, but the most frequently used is Translation Lookaside Buffer or TLB Virtual Page # Physical Frame # Dirty Ref Valid Access tag Really just a cache (a special-purpose cache) on the page table mappings TLB access time comparable to cache access time (much less than main memory access time) 55:035 Computer Architecture and Organization

An example of a TLB Page frame address Page Offset Read/write policies and permissions… <30> <13> 1 2 V <1> R <2> W <2> Tag <30> Phys. Addr. <21> (Low-order 13 bits of addr.) <13> ... … 4 34-bit physical address … (High-order 21 bits of addr.) 32:1 Mux 3 <21> 55:035 Computer Architecture and Organization

The “big picture” and TLBs • Address translation is usually on the critical path… • …which determines the clock cycle time of the mP • Even in the simplest cache, TLB values must be read and compared • TLB is usually smaller and faster than the cache-address-tag memory • This way multiple TLB reads don’t increase the cache hit time • TLB accesses are usually pipelined b/c its so important! 55:035 Computer Architecture and Organization

The “big picture” and TLBs Virtual Address TLB access No Yes TLB Hit? No Yes Try to read from page table Write? Try to read from cache Set in TLB Page fault? Cache/buffer memory write Yes No No Yes Cache hit? Replace page from disk TLB miss stall Cache miss stall Deliver data to CPU 55:035 Computer Architecture and Organization

Pages are Cached in a Virtual Memory System • Can Ask the Same Four Questions we did about caches • Q1: Block Placement • choice: lower miss rates and complex placement or vice versa • miss penalty is huge • so choose low miss rate ==> place page anywhere in physical memory • similar to fully associative cache model • Q2: Block Addressing - use additional data structure • fixed size pages - use a page table • virtual page number ==> physical page number and concatenate offset • tag bit to indicate presence in main memory 55:035 Computer Architecture and Organization

Normal Page Tables • Size is number of virtual pages • Purpose is to hold the translation of VPN to PPN • Permits ease of page relocation • Make sure to keep tags to indicate page is mapped • Potential problem: • Consider 32bit virtual address and 4k pages • 4GB/4KB = 1MW required just for the page table! • Might have to page in the page table… • Consider how the problem gets worse on 64bit machines with even larger virtual address spaces! • Might have multi-level page tables 55:035 Computer Architecture and Organization

Inverted Page Tables • Similar to a set-associative mechanism • Make the page table reflect the # of physical pages (not virtual) • Use a hash mechanism • virtual page number ==> HPN index into inverted page table • Compare virtual page number with the tag to make sure it is the one you want • if yes • check to see that it is in memory - OK if yes - if not page fault • If not - miss • go to full page table on disk to get new entry • implies 2 disk accesses in the worst case • trades increased worst case penalty for decrease in capacity induced miss rate since there is now more room for real pages with smaller page table 55:035 Computer Architecture and Organization

Inverted Page Table Page Offset • Only store entries • for pages in physical • memory Hash Page Frame V = OK Frame Offset 55:035 Computer Architecture and Organization

Address Translation Reality • The translation process using page tables takes too long! • Use a cache to hold recent translations • Translation Lookaside Buffer • Typically 8-1024 entries • Block size same as a page table entry (1 or 2 words) • Only holds translations for pages in memory • 1 cycle hit time • Highly or fully associative • Miss rate < 1% • Miss goes to main memory (where the whole page table lives) • Must be purged on a process switch 55:035 Computer Architecture and Organization

Back to the 4 Questions • Q3: Block Replacement (pages in physical memory) • LRU is best • So use it to minimize the horrible miss penalty • However, real LRU is expensive • Page table contains a use tag • On access the use tag is set • OS checks them every so often, records what it sees, and resets them all • On a miss, the OS decides who has been used the least • Basic strategy: Miss penalty is so huge, you can spend a few OS cycles to help reduce the miss rate 55:035 Computer Architecture and Organization

Last Question • Q4: Write Policy • Always write-back • Due to the access time of the disk • So, you need to keep tags to show when pages are dirty and need to be written back to disk when they’re swapped out. • Anything else is pretty silly • Remember – the disk is SLOW! 55:035 Computer Architecture and Organization

Page Sizes • An architectural choice • Large pages are good: • reduces page table size • amortizes the long disk access • if spatial locality is good then hit rate will improve • Large pages are bad: • more internal fragmentation • if everything is random each structure’s last page is only half full • Half of bigger is still bigger • if there are 3 structures per process: text, heap, and control stack • then 1.5 pages are wasted for each process • process start up time takes longer • since at least 1 page of each type is required to prior to start • transfer time penalty aspect is higher 55:035 Computer Architecture and Organization

More on TLBs • The TLB must be on chip • otherwise it is worthless • small TLB’s are worthless anyway • large TLB’s are expensive • high associativity is likely • ==> Price of CPU’s is going up! • OK as long as performance goes up faster 55:035 Computer Architecture and Organization

Selecting a Page Size • Reasons for larger page size • Page table size is inversely proportional to the page size; therefore memory saved • Fast cache hit time easy when cache size < page size (VA caches); bigger page makes this feasible as cache size grows • Transferring larger pages to or from secondary storage, possibly over a network, is more efficient • Number of TLB entries are restricted by clock cycle time, so a larger page size maps more memory, thereby reducing TLB misses • Reasons for a smaller page size • Want to avoid internal fragmentation: don’t waste storage; data must be contiguous within page • Quicker process start for small processes - don’t need to bring in more memory than needed 55:035 Computer Architecture and Organization

Memory Protection • With multiprogramming, a computer is shared by several programs or processes running concurrently • Need to provide protection • Need to allow sharing • Mechanisms for providing protection • Provide Base and Bound registers: Base ฃAddress ฃBound • Provide both user and supervisor (operating system) modes • Provide CPU state that the user can read, but cannot write • Branch and bounds registers, user/supervisor bit, exception bits • Provide method to go from user to supervisor mode and vice versa • system call : user to supervisor • system return : supervisor to user • Provide permissions for each flag or segment in memory 55:035 Computer Architecture and Organization

Pitfall: Address space to small • One of the biggest mistakes than can be made when designing an architecture is to devote to few bits to the address • address size limits the size of virtual memory • difficult to change since many components depend on it (e.g., PC, registers, effective-address calculations) • As program size increases, larger and larger address sizes are needed • 8 bit: Intel 8080 (1975) • 16 bit: Intel 8086 (1978) • 24 bit: Intel 80286 (1982) • 32 bit: Intel 80386 (1985) • 64 bit: Intel Merced (1998) 55:035 Computer Architecture and Organization

Virtual Memory Summary • Virtual memory (VM) allows main memory (DRAM) to act like a cache for secondary storage (magnetic disk). • The large miss penalty of virtual memory leads to different stategies from cache • Fully associative, TB + PT, LRU, Write-back • Designed as • paged: fixed size blocks • segmented: variable size blocks • hybrid: segmented paging or multiple page sizes • Avoid small address size 55:035 Computer Architecture and Organization

55:035 Computer Architecture and Organization