Virtual Machines

Virtual Machines Arkaprava Basu

What is (system) virtual machine? • Virtual machine: Complete compute environment with its own isolated processing capabilities, memory and communication channels. • Virtual machine is an efficient isolated duplicate of the physical machine [Goldberg, Popek] • Virtual Machine Monitor/Hypervisor: System software that creates and manages virtual machines • Desirable qualities of a VMM [Goldberg/Popek]: • Equivalence: Virtual m/c interface similar to real m/c • Safety/Isolation: Each VM should be isolated from other • Low performance overhead: Perf. close to real m/c

Without virtualization (bare metal) Application 1 Application 2 OS Hardware Slide Sorav Bansal, IIT-Delhi

With virtualization App App App App Guest OS 2 Guest OS 1 Virtual Hardware Virtual Hardware Virtual Machine Monitor (VMM)/Hypervisor Hardware Slide Sorav Bansal, IIT-Delhi

Why Virtual Machines? • Operating system diversity: Can run both Linux and Windows on same h/w

Different OS on same H/W App App App App Windows Linux Virtual Hardware Virtual Hardware Virtual Machine Monitor (VMM)/Hypervisor Hardware Slide Sorav Bansal, IIT-Delhi

Why Virtual Machines? • Operating system diversity: Can run both Linux and Windows on same h/w • Security/Isolation: Hypervisor separates the VMs from each other and isolates VMs from H/W • Rapid provisioning/Cloud/ Server consolidation: On demand provisioning of hardware resources

Demand is dynamic Slide Sorav Bansal, IIT-Delhi

Underutilization wastes resource Slide Sorav Bansal, IIT-Delhi

Loss of revenue if under-provisioned Slide Sorav Bansal, IIT-Delhi

Economics of Cloud • Ideal: Elastically increase resource on demand • Ideal could be approximated if multiple demand stream shares same physical resource  Move to cloud (e.g., Amazon EC2) • Transfer the risk of under/over-provisioning to cloud provider • The cloud provider (e.g., Amazon) could consolidate demands on to its physical resources • Need to run computation of different enterprise on same physical machine  Needs isolation/protection Needs VMs

Why Virtual Machines? • Operating system diversity: Can run both Linux and Windows on same h/w • Security/Isolation: Hypervisor separates the VMs from each other and isolates VMs from H/W • Rapid provisioning/Server consolidation: On demand provisioning of hardware resources • High availability/Load balancing: Ability to live-migrate a VM to other physical server • Encapsulation: The execution environment of an application is encapsulated within VM

Other “virtual machines(?)” we will not discuss • Language VM: Language runtime focused on running a single application • e.g., Java Virtual Machine, Microsoft Common Language runtime, Javascripts • “Lightweight” VM: Does not run guest OS; Isolate applications from other • e.g., Docker, FreeBSD’s Jail, Google’s Native Client • Our focus is System virtual machine: presents a “copy” of whole machine

Types of system virtual machine • Type 1 (Bare metal): VMM runs on bare metal; directly control the physical machine • e.g., Xen, VMWare ESX server • Type 2 (Hosted): VMM runs as part of/on top of an OS • e.g., KVM, VMWare workstation

Implementing a VMM • Three pieces of a system: • Instructions -- defined by the ISA (e.g., x86, ARM) • Memory • I/O (network, disk) • We will see how each piece is virtualized!

Techniques to Virtualize instruction SET ARCHItecture

Software emulation 1 • Implement each instruction as a C function • Interpret each instruction and emulate it • e.g., Each instruction is implemented by a C function • incl (%eax): • r = regs[EAX]; • tmp = read_mem(r); • tmp++; • write_mem(r, tmp); • Memory is emulated as an array • I/O is emulated in software • Slowdown: 50x or more • Example: bochs

Binary Translation 2 • Translate each VM instruction to minimal set of host instructions on the fly • Example: incl(%eax) • »leal mem0(%eax), %esi • »incl(%esi) • CPU state is kept in software structure (e.g., VCPU) • Memory is emulated/relocated • Privileged instructions gets translated to calls to emulation routine • Example: Qemu

Optimizing Binary translation • Observation: Code is executed in basic blocks • Basic block: straight line code that ends with branch/jump • Optimization: Cache translated basic block  no need to translate every instruction • Optimization: Direct jumps/function calls to translated basic block

Direct Execution 3 • Idea: Run most instructions of the VM directly on the hardware • Advantage: Performance close to native execution • Challenge: How to ensure isolation/protection ? • If all instructions of the VM are directly executed then VMM has no control !

Direct Execution + Trap and Emulate • Idea: • Trap to hypervisor when the VM tries to execute an instruction that could change the state of the system/take control (i.e., impact safety/isolation). • Emulate execution of these instruction in hypervisor • Direct execution of any other innocuous instructions on h/w that cannot impact other VMs or the hypervisor

How to do Trap and Emulate? • Generally two categories of instructions: • User instructions: • Typically compute instructions • e.g., add, mult, ld, store, jmp • System instructions: • Typically for system management • e.g., iret, invlpg, hlt, in, out • Two mode of CPU operation: • User mode (in x86-64 typically ring 3) • Privileged mode (in x86-64 ring 0) • Attempt to execute system instructions in user mode generates trap/general protection fault (gpf)

How to do Trap and Emulate? • System state: • Example, control registers like cr3 (remember?) • Access to system state in user mode trigger gpf • Idea: • Run the VM in user mode (ring 3), while the hypervisor in privileged mode (ring 0) • Anytime VM tries to execute an system instr. or tries to access system state, trap to VMM

Formalizing Direct exec. + Trap & emulate • Goldberg & Popek theorem: • Control sensitive intr.: Instructions that can update system state • Behavior sensitive instr.: If instructions behavior depends upon system state • Requirement for architecture/ISA to be virtualizable via trap & emulate: {control sensitive} U {behavior sensitive} {privileged} UI

x86 is not virtualizable via trap & emulate • Example: popf (pop flags) • Can be used in user mode to change ALU flags • Can be used in privileged mode to change system state flag (e.g., interrupt delivery flag) • Trouble: No trap is popf is attempted to alter interrupt flag in user mode --- CPU just ignores it !  Behavior sensitive instruction but is not privileged • There are about 17 such instructions in x86

4 x86-32 workaround in VMWare Workstation • Idea: Direct execution + Dynamic Binary translation • Direct execution of most user level instruction • Dynamic binary translation for most system code • Dynamically decide whether to use direct execution or binary translation Picture from Sorav Bansal/Scott Devine

4 Workaround for x86-32: Paravirtualization • Co-design of guest OS and hypervisor • Advantage: Simplicity, work around corner cases • Disadvantage: Need modified guest OS • Example: Xen hypervisor • Trick to virtualize x86-32: • Block all 17 instructions that are sensitive but not priviledged • Just modify the guest OS to #undef them • Instead provide hypercalls from guest OS to VMM • Hypercalls are like system calls but from guest OS to VMM

5 Hardware assisted virtualization: Intel VT-x/AMD-v for x86-64 • Two challenges of virtualizing x86-64 ISA: • How to hide system/privileged state from the VM? • How to ensure a VM cannot directly change system state (e.g., interrupt flags) of the processor?

Hardware assisted virtualization: Intel VT-x/AMD-v for x86-64 • Solution idea: • Two new modes of operation: root and non-root • Each mode has complete set of execution rings (0-3) • New instructions to switch between modes • H/W state is duplicated for each operation mode • Hypervisor runs in root mode and VMs in non-root mode • When any “sensitive” instruction executed in non-root mode it either (1) executed by processor on duplicated state or (2) trap to hypervisor

Hardware assisted virtualization: Intel VT-x/AMD-v for x86-64 VM 1 VM 0 Ring 3 Ring 3 Guest app Guest app Ring 2 Ring 2 Non- root mode Ring 1 Ring 1 Ring 0 Ring 0 Guest OS Guest OS (Legacy/un-virtualized) User program Ring 3 Ring 2 Root mode Ring 1 Hypervisor/VMM Ring 0

Using Intel VT-x/AMD-v (KVM way) VMM initializes/modifies VMCS VMM executes vmlaunch/vmresume instr. The VM runs until it hits a vmexit condition VMM analyzes and takes action on vmexit Ring 3 Guest app Ring 2 Non- root mode Ring 1 Ring 0 Guest OS vmexit vmlaunch/vmresume VMCS Root mode, ring 0 vmread vmwrite VMM Memory

Current status of H/W assist • Almost all VMMs make use of h/w assist today • KVM is integrated part of Linux and makes heavy use of h/w assist • Even Xen and VMWare workstation uses it

Virtualizing memory

Recap: Virtual Memory in un-virtualized system Process A’s virtual address space Process B’s virtual address space Page table Page table Managed by the OS Physical memory

Recap: Performing address translation in unvirtualized machine VA TLB Virtual Address Hit? Miss? PTW CR3 PA Physical Address Up to mem accesses = 4

Requirement for memory virtualization • No direct access to physical memory from VM • Only hypervisor should manage physical memory • Why? • But, fool the guest OS to think that it is accessing physical memory

Virtualizing Virtual Memory Managed by guest OS Virtual Machine 2 Virtual Machine 1 App 2 App 1 App 2 App 1 Guest Page table Guest physical memory Managed by hypervisor Nested Page table System (real)Physical memory

Virtualizing Virtual Memory Guest Virtual Address System Physical Address Guest Physical Address 2 1 gVA gPA sPA Guest Page Table Nested Page Table Two levels of address translation on each memory access by an application running inside a VM Based on "Efficient Memory Virtualization" Gandhi et. al., MICRO'14

Implementing two level address translation 1 • Shadow page table • Software only technique • Nested/Extended page table • Hardware support 2

Shadow page table • Idea: Let hypervisor “create” a shadow page table that maps guest VA to system PA directly • Made by combining guest page table w/ system page table • Hypervisor makes the cr3 point to the shadow page table

Shadow page table Guest Virtual Address System Physical Address Guest Physical Address 2 1 gVA gPA sPA Guest Page Table Nested Page Table Shadow Page Table

Challenge of shadow page table • How to create a shadow page table? • Anytime guest OS modifies guest page table hypervisor needs to update shadow page table • Solution: Write protect guest page table • Any write access to guest page table would generate page fault  trap to hypervisor • Drawback: Many page faults for application that alters page tables

Challenge of shadow page table • For every guest application there is one shadow page table • Every time guest application context switches trap to hypervisor to change cr3 to point to new shadow page table

Nested/Extended page table • Idea: Make hardware aware of two levels of address translation and guest and nested page table • Make hardware page walker walk both guest page table and nested page table on a TLB miss  Two dimensional page walk • Two cr3s : gCR3  gPT; nCR3  nPT • Eliminates the need of shadow page table

Two dimensional page walk in h/w gVA Guest Physical memory gcr3 sPA System Physical memory 5 + 5 + 5 + 5 + 4 = 24 Maximum possible memory accesses for one two dimensional page walk Based on "Efficient Memory Virtualization" Gandhi et. al., MICRO'14

Two dimensional page walk Based on "Accelerating two dimensional page walks for Virtualized systems“ –ASPLOS’08

Advantages and Challenges of two dimensional page walk • Advantages: • No shadow page table per guest application • No trap when guest page table is updated • Challenge: • Two dimensional page walks are costly – 24 memory access !!

Comparison of Shadow page table and two dimensional page walks • Shadow page table has lower address translation cost but high cost (traps) if page table is modified • Two dimensional page walks generally have large TLB miss latency but good when page tables are modified often

Comparison of Shadow page table and two dimensional page walks Can you get best of both world? Agile paging [ISCA’2016]

Paging and memory management under virtualization

Virtual Machines