Linux Memory Management

Linux Memory Management Dan Furlani, Justin Polchlopek, & John de Raismes

Basics • The Memory Management Subsystem Provides: • Large Address Spaces • Protection • Memory Mapping • Fair Physical Memory Mapping • Shared Virtual Memory • Most important Code rule: OPTIMIZE!!!!! • Large use of #define’s and inline functions • if-then-else, loops, break, goto used excessively to optimize the generated asm code. /* Goto-purists beware: the only reason for goto's here is that it results * in better assembly code.. The "default" path will see no jumps at all. */ Tasty spaghetti code follows... very obfuscated.

VM Abstract

VM Abstract • In a virtual memory system all of these addresses are virtual addresses and not physical addresses. • These virtual addresses are converted into physical addresses by the processor based on information held in a set of tables maintained by the operating system. • To make this translation easier, virtual and physical memory are divided into handy sized chunks called pages. These pages are all the same size • Each of these pages is given a unique number; the page frame number (PFN).

VM Abstract • Valid flag: • This indicates if this page table entry is valid, • The physical page frame number that this entry is describing, • Access control information. This describes how the page may be used. • Page fault: • If the entry is invalid, the process has accessed a non-existent area of its virtual memory. • This is known as a page fault and the operating system is notified of the faulting virtual address and the reason for the page fault.

Demand Paging • Linux uses demand paging to load executable images into a processes virtual memory • Whenever a command is executed, the file containing it is opened and its contents are mapped into the processes virtual memory • This is done by modifying the data structures describing this processes memory map and is known as memory mapping • However, only the first part of the image is actually brought into physical memory. The rest of the image is left on disk • As the image executes, it generates page faults and Linux uses the processes memory map in order to determine which parts of the image to bring into memory for execution

Memory Mapping

Memory Mapping • Every processes virtual memory is represented by an mm_struct data structure • Each vm_area_struct data structure describes the start and end of the area of virtual memory, the processes access rights to that memory and a set of operations for that memory • Each vm_area_structdata structure represents a part of the executable image; the executable code, initialized data (variables), unitialized data and so on

Memory mapping: some details... Start and end of the area of virtual memory Access rights to this area of memory Set of operations for this area of memory (see next slide) vm_area_struct describes an area of a process’s virtual memory. A VM area is any part of the virtual memory space that has a special rule for the page-fault handlers (ie a shared library, the executable area etc). struct vm_area_struct { struct mm_struct * vm_mm; /* VM area parameters */ unsigned long vm_start; unsigned long vm_end; /* linked list of VM areas per task, sorted by address */ struct vm_area_struct *vm_next; pgprot_t vm_page_prot; unsigned short vm_flags; /* AVL tree of VM areas per task, sorted by address */ short vm_avl_height; struct vm_area_struct * vm_avl_left; struct vm_area_struct * vm_avl_right; /* For areas with inode, the list inode->i_mmap, for shm areas, * the list of attaches, otherwise unused. */ struct vm_area_struct *vm_next_share; struct vm_area_struct **vm_pprev_share; struct vm_operations_struct * vm_ops; unsigned long vm_offset; struct file * vm_file; unsigned long vm_pte; /* shared mem */ };

Memory mapping: some details... vm_operations_struct describes the virtual MM functions:opening of an area, closing and unmapping it, functions to be called when a no-page or a wp-page exception occurs, etc. struct vm_operations_struct { void (*open)(struct vm_area_struct * area); void (*close)(struct vm_area_struct * area); void (*unmap)(struct vm_area_struct *area, unsigned long, size_t); void (*protect)(struct vm_area_struct *area, unsigned long, size_t, unsigned int newprot); int (*sync)(struct vm_area_struct *area, unsigned long, size_t, unsigned int flags); void (*advise)(struct vm_area_struct *area, unsigned long, size_t, unsigned int advise); unsigned long (*nopage)(struct vm_area_struct * area, unsigned long address, int write_access); unsigned long (*wppage)(struct vm_area_struct * area, unsigned long address, unsigned long page); int (*swapout)(struct vm_area_struct *, struct page *); pte_t (*swapin)(struct vm_area_struct *, unsigned long, unsigned long); }; These are all pointers to functions that will be called when this area of VM needs to be manipulated. Different uses of this area of VM will need different functions to be called when a page fault occurs. E.g. if it is discovered that a page is not in physical memory, the nopage operation needs to be called -- the actual implementation of nopage may differ for different mappings. /* Shared mappings need to be able to do the right thing at * close/unmap/sync. They will also use the private file as * backing-store for swapping.. */ static struct vm_operations_struct file_shared_mmap = { NULL, /* no special open */ NULL, /* no special close */ filemap_unmap, /* unmap - we need to sync the pages */ NULL, /* no special protect */ filemap_sync, /* sync */ NULL, /* advise */ filemap_nopage, /* nopage */ NULL, /* wppage */ filemap_swapout, /* swapout */ NULL, /* swapin */1287 }; There is also a set of operations defined for file_private_map (less interesting). In an OO implementation of the kernel (god forbid), a class could be defined for these operations, and subclasses could be defined where operations that need a specific implementation could override the inherited method.

Swapping • Linux uses a Least Recently Used (LRU) page aging technique to fairly choose pages which might be removed from the system • This scheme involves every page in the system having an age which changes as the page is accessed • The more that a page is accessed, the younger it is; the less that it is accessed the older and more stale it becomes. • Old pages are good candidates for swapping

More on Memory • Shared Virtual Memory • All memory access are made via page tables and each process has its own separate page table. • For two processes sharing a physical page of memory, its physical page frame number must appear in a page table entry in both of their page tables. • Physical and Virtual Addressing Modes • It does not make much sense for the operating system itself to run in virtual memory. • Physical addressing mode requires no page tables and the processor does not attempt to perform any address translations in this mode • The Linux kernel is linked to run in physical address space

Access Control • V • Valid, if set this PTE is valid, • FOE • ``Fault on Execute'', Whenever an attempt to execute instructions in this page occurs, the processor reports a page fault and passes control to the operating system, • FOW • ``Fault on Write'', as above but page fault on an attempt to write to this page • FOR • ``Fault on Read'', as above but page fault on an attempt to read from this page, • ASM • Address Space Match. This is used when the operating system wishes to clear only some of the entries from the Translation Buffer, • KRE • Code running in kernel mode can read this page,

Access Control • URE • Code running in user mode can read this page, • GH • Granularity hint used when mapping an entire block with a single Translation Buffer entry rather than many, • KWE • Code running in kernel mode can write to this page, • UWE • Code running in user mode can write to this page, • page frame number • For PTEs with the V bit set, this field contains the physical Page Frame Number (page frame number) for this PTE. For invalid PTEs, if this field is not zero, it contains information about where the page is in the swap file. • _PAGE_DIRTY • if set, the page needs to be written out to the swap file, • _PAGE_ACCESSED • Used by Linux to mark a page as having been accessed.

Access Control COW COW refers to Copy On Write. If 20 users are accessing one data file, they will share an area of virtual memory which contains the memory mapped file - there is no need to have 20 separate copies in memory. Now suppose one user modifies the file. All users can no longer share one copy in memory. By marking a page COW, when a write occurs to that page it page faults and creates a copy for that user. A simple example demonstrating COW: /* * We special-case the C-O-W ZERO_PAGE, because it's such * a common occurrence (no need to read the page to know * that it's zero - better for the cache and memory subsystem). */ static inline void copy_cow_page(unsigned long from, unsigned long to) { if (from == ZERO_PAGE(to)) { clear_page(to); return; } copy_page(to, from); }

Access Control

Caches • Buffer Cache • The buffer cache contains data buffers that are used by the block device drivers • Page Cache • This is used to speed up access to images and data on disk. • Swap Cache • Only modified (or dirty) pages are saved in the swap file. • Hardware Caches • In this case, the processor does not always read the page table directly but instead caches translations for pages as it needs them. These are the Translation Look-aside Buffers and contain cached copies of the page table entries from one or more processes in the system

Page Tables

Page Tables • Linux assumes that there are three levels of page tables • To translate a virtual address into a physical one, the processor must take the contents of each level field, convert it into an offset into the physical page containing the Page Table and read the page frame number of the next level of Page Table • This is repeated three times until the page frame number of the physical page containing the virtual address is found. Now the final field in the virtual address, the byte offset, is used to find the data inside the page

Page Tables: some details... The linux memory manager always uses a 3-tier page table interface, although the implementation varies by platform. On the i386, the middle-level is folded into the top-level page table to fit the machine’s architecture. /* * The "pgd_xxx()" functions here are trivial for a folded two-level * setup: the pgd is never bad, and a pmd always exists (as it's folded * into the pgd entry) */ extern inline int pgd_none(pgd_t pgd) { return 0; } extern inline int pgd_bad(pgd_t pgd) { return 0; } extern inline int pgd_present(pgd_t pgd) { return 1; } extern inline void pgd_clear(pgd_t * pgdp) { } The levels of page table translation are implemented as preprocessor macros or inline functions so that their execution is as fast as possible. The relevant code is shown on the next slide...

/* * Conversion functions: convert a page and protection to a page entry, * and a page entry and page directory to the page they refer to. */ #define mk_pte(page, pgprot) \ ({ pte_t __pte; pte_val(__pte) = __pa(page) + pgprot_val(pgprot); __pte; }) /* This takes a physical page address that is used by the remapping functions */ #define mk_pte_phys(physpage, pgprot) \ ({ pte_t __pte; pte_val(__pte) = physpage + pgprot_val(pgprot); __pte; }) extern inline pte_t pte_modify(pte_t pte, pgprot_t newprot) { pte_val(pte) = (pte_val(pte) & _PAGE_CHG_MASK) | pgprot_val(newprot); return pte; } #define pte_page(pte) \ ((unsigned long) __va(pte_val(pte) & PAGE_MASK)) #define pmd_page(pmd) \ ((unsigned long) __va(pmd_val(pmd) & PAGE_MASK)) /* to find an entry in a page-table-directory */ #define pgd_offset(mm, address) \ ((mm)->pgd + ((address) >> PGDIR_SHIFT)) /* to find an entry in a kernel page-table-directory */ #define pgd_offset_k(address) pgd_offset(&init_mm, address) /* Find an entry in the second-level page table.. */ extern inline pmd_t * pmd_offset(pgd_t * dir, unsigned long address) { return (pmd_t *) dir; } /* Find an entry in the third-level page table.. */ #define pte_offset(pmd, address) \ ((pte_t *) (pmd_page(*pmd) + ((address>>10) & ((PTRS_PER_PTE-1)<<2)))) Page table translation code

Page Allocation and Deallocation • All of the physical pages in the system are described by the mem_map data structure which is a list of mem_map_t • count - This is a count of the number of users of this page. The count is greater than one when the page is shared between many processes, • age -This field describes the age of the page and is used to decide if the page is a good candidate for discarding or swapping, • map_nr - This is the physical page frame number that this mem_map_t describes. • The free_area vector is used by the page allocation code to find and free pages

Page Allocation

Page Allocation Source code comment about the Buddy Algorithm: /* * Buddy system. Hairy. You really aren't expected to understand this * * Hint: -mask = 1+~mask */ • Linux uses the Buddy algorithm to effectively allocate and deallocate blocks of pages • The allocation algorithm first searches for blocks of pages of the size requested. It follows the chain of free pages that is queued on the list element of the free_area data structure. If no blocks of pages of the requested size are free, blocks of the next size (which is twice that of the size requested) are looked for. This process continues until all of the free_area has been searched or until a block of pages has been found. If the block of pages found is larger than that requested it must be broken down until there is a block of the right size. Because the blocks are each a power of 2 pages big then this breaking down process is easy as you simply break the blocks in half. The free blocks are queued on the appropriate queue and the allocated block of pages is returned to the caller

Page Deallocation • Allocating blocks of pages tends to fragment memory with larger blocks of free pages being broken down into smaller ones. • The page deallocation code recombines pages into larger blocks of free pages whenever it can. • Whenever a block of pages is freed, the adjacent or buddy block of the same size is checked to see if it is free. If it is, then it is combined with the newly freed block of pages to form a new free block of pages for the next size block of pages

References References used for slides and paper: • Cross-Referencing Linux • http://lxr.linux.no/ • Virtual Memory Tutorial • http://cne.gmu.edu/modules/VM/ • The Linux Kernel, Chapter 3 • http://www.linuxdoc.org/LDP/tlk/mm/memory.html • Linux Memory Mangement • http://linux.dn.ua/docs/dp/KHG/node93.html

Linux Memory Management