VMMs / Hypervisors

VMMs / Hypervisors Intel Corporation 21 July 2008

Agenda • Xen internals • High level architecture • Paravirtualization • HVM • Others • KVM • VMware • OpenVZ

Xen Overview

Xen Project bio • Xen project was created in 2003 at the University of Cambridge Computer Laboratory in what's known as the Xen Hypervisor project • Led by Ian Pratt with team members Keir Fraser, Steven Hand, and Christian Limpach. • This team along with Silicon Valley technology entrepreneurs Nick Gault and Simon Crosby founded XenSource which was acquired by Citrix Systems in October 2007 • The Xen® hypervisor is an open source technology, developed collaboratively by the Xen community and engineers (AMD, Cisco, Dell, HP, IBM, Intel, Mellanox, Network Appliance, Novell, Red Hat, SGI, Sun, Unisys, Veritas, Voltaire, and of course, Citrix) • Xen is licensed under the GNU General Public License • Xen supports Linux 2.4, 2.6, Windows and NetBSD 2.0

Domain U Paravirtual Guest Domain U HVM Guest Domain U Paravirtual Guest Domain U HVM Guest Domain U Paravirtual Guest Domain U HVM Guest Xen Components • A Xen virtual environment consists of several modules that provide the virtualization environment: • Xen Hypervisor - VMM • Domain 0 • Domain Management and Control • Domain User, can be one of: • Paravirtualized Guest: the kernel is aware of virtualization • Hardware Virtual Machine Guest: the kernel runs natively Domain 0 Domain Management and Control Hypervisor - VMM

Xen Hypervisor - VMM • The hypervisor is Xen itself. • It goes between the hardware and the operating systems of the various domains. • The hypervisor is responsible for: • Checking page tables • Allocating resources for new domains • Scheduling domains. • Booting the machine enough that it can start dom0. • It presents the domains with a VirtualMachine that looks similar but not identical to the native architecture. • Just as applications can interact with an OS by giving it syscalls, domains interact with the hypervisor by giving it hypercalls. The hypervisor responds by sending the domain an event, which fulfills the same function as an IRQ on real hardware. • A hypercall is to a hypervisor what a syscall is to a kernel.

Applications Guest kernel (dom0 and dom U) Hypervisor Restricting operations with Privilege Rings • The hypervisor executes privileged instructions, so it must be in the right place: • x86 architecture provides 4 privilege levels / rings • Most OSs were created before this implementation, so only 2 levels are used • Xen provides 2 modes: • In x86 the applications are run at ring 3, the kernel at ring 1 and Xen at ring 0 • In x86 with VT-x, the applications run at ring 3, the guest at ring non-root-0 and Xen at ring root-0 (-1) Paravirtual x86 Native HVM x86 The Guest is moved to ring 1 3 3 3 1 0 0 0 The Hypervisor is moved to ring -1

Domain 0 • Domain 0 is a Xen required Virtual Machine running a modified Linux kernel with special rights to: • Access physical I/O devices • Two drivers are included in Domain 0 to attend requests from Domain U PV or HVM guests • Interact with the other Virtual Machines (Domain U) • Provides the command line interface for Xen daemons • Due to its importance, the minimum functionality should be provided and properly secured • Some Domain 0 responsibilities can be delegated to Domain U (isolated driver domain) Domain 0 PV Communicates directly with the local networking hardware to process all virtual machines requests Network backend driver Communicates with the local storage disk to read and write data from the drive based upon Domain U requests Block backend driver HVM Supports HVM Guests for networking and disk access requests Qemu-DM

Domain Management and Control - Daemons • The Domain Management and Control is composed of Linux daemons and tools: • Xm • Command line tool and passes user input to Xend through XML RPC • Xend • Python application that is considered the system manager for the Xen environment • Libxenctrl • A C library that allows Xend to talk with the Xen hypervisor via Domain 0 (privcmd driver delivers the request to the hypervisor) • Xenstored • Maintains a registry of information including memory and event channel links between Domain 0 and all other Domains • Qemu-dm • Supports HVM Guests for networking and disk access requests

Domain U – Paravirtualized guests The Domain U PV Guest is a modified Linux, Solaris, FreeBSD or other UNIX system that is aware of virtualization (no direct access to hardware) No rights to directly access hardware resources, unless especially granted Access to hardware through front-end drivers using the split device driver model Usually contains XenStore, console, network and block device drivers There can be multiple Domain U in a Xen configuration Domain U - PV Similar to a registry Console driver XenStore driver Communicates with the Network backend driver in Domain 0 Network front-end driver Communicates with the Block backend driver in Domain 0 Block front-end driver

Domain U – HVM guests The Domain U HVM Guest is a native OS with no notion of virtualization (sharing CPU time and other VMs running) An unmodified OS doesn’t support the Xen split device driver, Xen emulates devices by borrowing code from QEMU HVMs begin in real mode and gets configuration information from an emulated BIOS For an HVM guest to use Xen features it must use CPUID and then access the hypercall page Domain U - HVM Simulates the BIOS for the unmodified operating system to read it during startup Xen virtual firmware

Pseudo-Physical to Memory Model • In an operating system with protected memory, each application has it own address space. A hypervisor has to do something similar for guest operating systems. • The triple indirection model is not necessarily required but it is more convenient from the performance point of view and modifications needed in the guest kernel. • If the guest kernel needs to know anything about the machine pages, it has to use the translation table provided by the shared info page (rare) … … Virtual Application … … Pseudo-physical Kernel Hypervisor … … Machine

Pseudo-Physical to Memory Model • There are variables at various places in the code identified as MFN, PFN, GMFN and GPFN

Virtual Ethernet interfaces Xen creates, by default, seven pair of "connected virtual ethernet interfaces" for use by dom0 For each new domU, it creates a new pair of "connected virtual ethernet interfaces", with one end in domU and the other in dom0 Virtualized network interfaces in domains are given Ethernet MAC addresses (by default xend will select a random address) The default Xen configuration uses bridging (xenbr0) within domain 0 to allow all domains to appear on the network as individual hosts

The Virtual Machine lifecycle • Xen provides 3 mechanisms to boot a VM: • Booting from scratch (Turn on) • Restoring the VM from a previously saved state (Wake) • Clone a running VM (only in XenServer) PAUSED Stop Resume Start (paused) Pause Turn on OFF RUNNING Migrate Turn off Wake Sleep Turn off SUSPENDED

A project: provide VMs for instantaneous/isolated execution • Goal: determine a mechanism for instantaneous execution of applications in sandboxed VMs • Approach: • Analyze current capabilities in Xen • Implement a prototype that addresses the specified goal: VM-Pool • Technical specification of HW and SW used: • Intel® Core™ Duo T2400 @ 1.83GHz 1828 MHz • Motherboard Properties • Motherboard ID: <DMI> • Motherboard Name: LENOVO 1952D89 • 2048 MB RAM • Software: • Linux Fedora Core 8 Kernel 2.6.3.18 • Xen 3.1 • For the Windows images Windows XP SP2

Analyzing Xen spawning mechanisms • Booting from scratch HVM WinXP varying the #CPU PV Fedora 8 varying the #CPU • Restoring from a saved state • HVM WinXP 4GB disk / 1CPU PV Fedora 8 varying the #CPU • Cloning a running VM HVM WinXP 4GB disk / 1CPU

Dynamic Spawning with a VM-Pool • To have a pool of virtual machines already booted and ready for execution, but in a “paused” state • These virtual machines keep their RAM but they don’t use processor time, interrupts and other resources • Simple interface defined: • get: retrieves and unpauses a virtual machine from the pool • release: gives back a virtual machine to the pool and restarts the VM • High level description:

VMPool Initialization Time 300 250 S e c o n d s 200 From scratch 150 Resume 100 50 0 VM Booting Mode Results with the VM-Pool • The VM is ready to run in less than half a second (~350 milliseconds) • Preferred spawning method is resuming, although it uses additional disk storage

Virtual Machines Scheduling • The hypervisor is responsible for ensuring that every running guest receives some CPU time. • Most used scheduling mechanisms in Xen: • Simple Earliest Deadline First – SEDF (being deprecated): • Each domain runs for an n ms slice every m ms (n and m are configured per-domain) • Credit Scheduler: • Each domain has a couple of properties: a cap and a weight • Weight: determines the share of the physical CPU time that the domain gets, weights are relative to each other • Cap: represents the maximum, it’s an absolute value • Default work-conserving; if no other VMs needs to use CPU, then the running one will be given more time to execute • Uses a fixed-size 30ms quantum, and ticks every 10 ms • Xen provides a simple abstract interface to schedulers: • struct scheduler { • char *name; /* full name for this scheduler */ • char *opt_name; /* option name for this scheduler */ • unsigned int sched_id; /* ID for this scheduler */ • void (*init) (void); • int (*init_domain) (struct domain *); • void (*destroy_domain) (struct domain *); • int (*init_vcpu) (struct vcpu *); • void (*destroy_vcpu) (struct vcpu *); • void (*sleep) (struct vcpu *); • void (*wake) (struct vcpu *); • struct task_slice (*do_schedule) (s_time_t); • int (*pick_cpu) (struct vcpu *); • int (*adjust) (struct domain *, struct xen_domctl_scheduler_op *); • void (*dump_settings) (void); • void (*dump_cpu_state) (int); • };

Xen Para-Virtual functionality

Paravirtual Guest Domain 0 Frontend device driver Real device driver Backend device driver Shared Ring Buffers Hypervisor Hardware Block devices Paravirtualized architecture • We’ll review the PV mechanisms that support this architecture: • Kernel Initialization • Hypercalls creation • Event channels • XenStore (some kind of registry) • Memory transfers between VMs • Split device drivers

Initial information for booting a PV OS • First things the OS needs to know when boots: • Available RAM, connected peripherals, access to the machine clock. • An OS running on a PV Xen environment does not have access to real firmware • The information required is provided by the SHARED INFO PAGES. • The “domain builder” is in charge of mapping the shared info pages in the guest’s address space prior its boot. • Example: launching dom0 in a i386 architecture: • Refer to function construct_dom0in xen/arch/x86/domain_build.c • The shared info pages does not completely replace a BIOS • The console device is available via the start info page for debugging purposes; debugging output from the kernel should be available as early as possible. • Other devices must be found using the XenStore

The start info page • The start info page is loaded in the guest’s address space at boot time. The way this page is transferred is architecture-dependent; x86 uses the ESI register. • The content of this page is defined by the C structure start_info which is declared in xen/include/public/xen.h • A portion of the fields in the start info page are always available for the guest domain and are updated every time the virtual machine is resumed because some of them contain machine addresses (subject to change

start_info structure overview struct start_info { /* THE FOLLOWING ARE FILLED IN BOTH ON INITIAL BOOT AND ON RESUME. */ char magic[32]; /* "xen-<version>-<platform>". */ unsigned long nr_pages; /* Total pages allocated to this domain. */ unsigned long shared_info; /* MACHINE address of shared info struct. */ uint32_t flags; /* SIF_xxx flags. */ xen_pfn_t store_mfn; /* MACHINE page number of shared page. */ uint32_t store_evtchn; /* Event channel for store communication. */ union { struct { xen_pfn_t mfn; /* MACHINE page number of console page. */ uint32_t evtchn; /* Event channel for console page. */ } domU; struct { uint32_t info_off; /* Offset of console_info struct. */ uint32_t info_size; /* Size of console_info struct from start.*/ } dom0; } console; /* THE FOLLOWING ARE ONLY FILLED IN ON INITIAL BOOT (NOT RESUME). */ unsigned long pt_base; /* VIRTUAL address of page directory. */ unsigned long nr_pt_frames; /* Number of bootstrap p.t. frames. */ unsigned long mfn_list; /* VIRTUAL address of page-frame list. */ unsigned long mod_start; /* VIRTUAL address of pre-loaded module. */ unsigned long mod_len; /* Size (bytes) of pre-loaded module. */ int8_t cmd_line[MAX_GUEST_CMDLINE]; }; typedef struct start_info start_info_t;

start_infofields char magic[32]; /*"xen-<version>-platform>"*/ • The magic number is the first thing the guest domain must check from its start info page. • If the magic string does not start with “xen-” something is seriously wrong and the best thing to do is abort. • Also, minor and major versions must be checked in order to determine if the guest kernel had been tested in this Xen version. unsigned long nr_pages; /*Total pages allocated to this domain.*/ • The amount of available RAM is determined by this field. It contains the number of memory pages available to the domain.

start_infofields (2) unsigned long shared_info; /*MACHINE address of shared info struct.*/ • Contains the address of the machine page where the shared info structure is. The guest kernel should map it to retrieve useful information for its initialization process. uint32_t flags; /* SIF_xxx flags.*/ • Contains any optional settings for this domain. (defined in xen.h) • SIF_PRIVILEGED, SIF_INITDOMAIN xen_pfn_t store_mfn; /* MACHINE page number of shared page.*/ • Machine address of the shared memory page used for communication with the XenStore. uint32_t store_evtchn; /* Event channel for store communication.*/ • Event channel used for notifications.

start_infofields (3) union { struct { xen_pfn_t mfn; /* MACHINE page number of console page.*/ uint32_t evtchn; /* Event channel for console page.*/ }domU; struct { uint32_t info_off; /*Offset of console_info struct. */ uint32_t info_size; /*Size of console_info struct from start.*/ }dom0; }console; • Domain 0 guests uses the dom0 part, which contains the memory offset and size of the structure used to define the Xen console. • For unprivileged domains the domU part of the union is used .The fields in this represent a shared memory page and event channel used to identify the console device.

The shared Info Page • The shared info contains information that is dynamically updated as the system runs. • It is explicitly mapped by the guest. • The content of this page is defined by the C structure shared_info which is declared in xen/include/public/xen.h

shared_infofields struct vcpu_info_t vcpu_info[MAX_VIRT_CPUS] • This array contains one entry per virtual CPU assigned to the domain. Each array element is a vcpu_info_t structure containing CPU specific information: • Each virtual CPU has 3 flags relating to virtual interrupts (asynchronously delivered events). • uint8_t evtchn_upcall_pending: it is used by Xen to notify the running system that there are upcalls currently waiting for delivery on this virtual CPU. • uint8_t evtchn_upcall_mask: This is the mask for the previous field. This mask prevents any upcalls being delivered to the running virtual CPU. • unsigned long evtchn_pending_sel: Indicates which event is waiting. The event bitmap is an array of machine words, and this value indicates which word in the evtchn_pending field of the parent structure indicates the raised event. • arch • Architecture-specific information. • On x86, this include the virtual CR2 register, that contains the linear address of the last page fault, but can only be read from ring 0. This is automatically copied by the hypervisor’s page fault handler before raising the event with the guest domain. • time • This field, along with a number of fields sharing the wc_ (wall clock) prefix, is used to implement time keeping in paravirtualized Xen guests.

shared_infofields (2) unsigned long evtchn_pending[sizeof(unsigned long) * 8]; • This is a bitmap that indicates which event channels have events waiting. (256 and 512 event channels on a 32 and 64-bit systems respectively) • Bits are set by the hypervisor and cleared by the guest. unsigned long evtchn_mask[sizeof(unsigned long) * 8]; • This bitmap determines whether an event on a particular channel should be delivered asynchronously • Every time an event is generated, the corresponding bit in evtchn_pending is set to 1. If the corresponding bit in evtchn_mask is set to 0, the hypervisor issues an upcall and delivers the event asynchronously. This allows the guest kernel to switch between interrupt-driven and polling mechanisms on a per-channel basis. struct arch_shared_info arch; • On x86 arch the arch_shared_info structure contains two fields; max_pfn and pfn_to_mfn_frame_list_list related to pseudo-physical to machine memory mapping.

An exercise: The simplest Xen kernel

The simplest Xen kernel • Bootstrap • Each Xen guest kernel must start with a section __xen_guest for the bootloader, with key-value pairs • GUEST_OS: name of the running kernel • XEN_VER: specifies the Xen version for which the guest was implemented • VIRT_BASE: guest’s address space this allocation is mapped (0 for kernels) • ELF_PADDR_OFFSET: value subtracted from addresses in ELF headers (0 for kernels) • HYPERCALL_PAGE: specifies the page number where the hypercall trampolines will be set • LOADER: special boot loaders (currently only generic is available) • After mapping everything into memory at the right places, Xen passes control to the guest kernel • A trampoline is defined _start • Clears the direction flag, sets up the stack and calls the kernel start passing the start info page address in the ESI register as a parameter • A guest kernel is expected to setup handlers to receive events at boot time, otherwise the kernel is not able to respond to the outside world (it is ignored in the book’s example) • Kernel.c • The start_kernel routine takes the start info page as the parameter (passed through the ESI) • The stack is reserved in this file, although it was referenced in bootstrap as well for creating the trampoline routine • If the hypervisor was compiled with debugging, then the HYPERVISOR_console_io will send the string to the console (otherwise the hypercall fails) • Debug.h • The hypercall takes three arguments: the command (write), the length of the string and the string pointer • The hypercall # is 18 (xen/include/public/xen.h)

Hypercalls

Executing Privileged instructions from apps Because guest kernels don’t run at ring 0 they’re not allowed to execute privileged instructions, a mechanism is needed to execute them in the right ring, supose exit(0): push dword 0 mov eax, 1 push eac int 80h Paravirtualized Native Hypervisor Kernel Ring 0 The Hypervisor has the interrupts table Kernel Ring 1 Ring 2 Application Application Ring 3 System Call Hypercall Direct System Call (Xen specific)

Replacing Privileged instructions with Hypercalls • Unmodified guests use privileged instructions which require transition to ring 0, causing performance penalty if resolved by the hypervisor • Paravirtual guests replace their privilege instructions by hypercalls • Xen uses 2 mechanisms for hypercalls: • An int 82h is used as the channel similar to system calls (deprecated after Xen 3.0) • Issued indirectly using the hypercall page provided when the guest is started • For the second mechanism, macros are provided to write hypercalls • #define _hypercall2(type, name, a1, a2) \ • ({ \ • long __res, __ign1, __ign2; \ • asm volatile ( \ • "call hypercall_page + ("STR(__HYPERVISOR_##name)" * 32)"\ • : "=a" (__res), "=b" (__ign1), "=c" (__ign2) \ • : "1" ((long)(a1)), "2" ((long)(a2)) \ • : "memory" ); \ • (type)__res; \ • }) • A PV Xen guest uses the HYPERVISOR_sched_op function with SCHEDOP_yield argument instead of using the privileged instruction HLT, in order to relinquish CPU time to guests with running tasks • static inline int HYPERVISOR_sched_op(int cmd, void *arg) • { • return _hypercall2(int, sched_op, cmd, arg); • } • extras/mini-os/include/x86/x86_32/hypercall-x86_32.h, implemented at xen/common/schedule.c

Event Channels

Event Channels • Event channels are the basic primitive provided by Xen for event notifications, equivalent of a hardware interrupt valid for paravirtualized OSs • Events are one bit of information signaled by transitioning from 0 to 1 • Physical IRQs: mapped from real IRQs used to communicate with hardware devices • Virtual IRQs: similar to PIRQs, but related to virtual devices such as the timer, debug console • Interdomain events: bidirectional interrupts that notify domains about certain event • Intradomain events: special case of interdomain events Domain 0 Domain U Paravirtual Guest Domain Management and Control Event Channel driver Hypervisor - VMM Hardware

HYPERVISOR_event_channel_op Callback Event Channel Interface Guests configure the Event Channel and send interrupts by issuing a specific hypercall: HYPERVISOR_event_channel_op (...) Guests are notified about pending events through callbacks installed during initialization, these events can be masked dynamically HYPERVISOR_set_callbacks(…) Domain 0 Domain U Paravirtual Guest Domain Management and Control Event Channel driver Hypervisor - VMM Hardware

HYPERVISOR_event_channel_op – 1/2 • HYPERVISOR_event_channel_op(int cmd, void *arg) // defined at xen-3.1.0-src\linux-2.6-xen-sparse\include\asm-i386\mach-xen\asm\hypercall.h • EVTCHNOP_alloc_unbound: Allocate a new event channel port, ready to be connected to by a remote domain • Specified domain must exist • A free port must exist in that domain • EVTCHNOP_bind_interdomain: Bind an event channel for interdomain communications • Caller domain must have a free port to bind. • Remote domain must exist. • Remote port must be allocated and currently unbound. • Remote port must be expecting the caller domain as the remote. • EVTCHNOP_bind_virq: Allocate a port and bind a VIRQ to it • Caller domain must have a free port to bind. • VIRQ must be valid. • VCPU must exist. • VIRQ must not currently be bound to an event channel • EVTCHNOP_bind_ipi: Allocate and bind a port for notifying other virtual CPUs. • Caller domain must have a free port to bind. • VCPU must exist. • EVTCHNOP_bind_pirq: Allocate and bind a port to a real IRQ. • Caller domain must have a free port to bind. • PIRQ must be within the valid range. • Another binding for this PIRQ must not exist for this domain.

HYPERVISOR_event_channel_op – 2/2 • HYPERVISOR_event_channel_op(int cmd, void *arg) /* defined at xen-3.1.0-src\linux-2.6-xen-sparse\include\asm-i386\mach-xen\asm\hypercall.h */ • EVTCHNOP_close: Close an event channel (no more events will be received). • Port must be valid (currently allocated). • EVTCHNOP_send: Send a notification on an event channel attached to a port. • Port must be valid. • EVTCHNOP_status: Query the status of a port; what kind of port, whether it is bound, what remote domain is expected, what PIRQ or VIRQ it is bound to, what VCPU will be notified, etc. • Unprivileged domains may only query the state of their own ports. • Privileged domains may query any port.

Issuing event channel hypercalls • Structures defined at xen-3.1.0-src\xen\include\public\event_channel.h • Hypervisor handlers defined at xen-3.1.0-src\xen\common\event_channel.c • Allocating an unbound event channel • evtchn_alloc_unbound_t op; • op.dom = DOMID_SELF; • op.remote_dom = remote_domain; /* an integer representing the domain */ • if(HYPERVISOR_event_channel_op(EVTCHOP_alloc_unbound, &op) != 0) • { • /* Error */ • } • Binding an event channel for interdomain communication • evtchn_bind_interdomain_t op; • op.remote_dom = remote_domain; • op.remote_port = remote_port; • if(HYPERVISOR_event_channel_op(EVTCHOP_bind_interdomain, &op) != 0) • { • /* Error */ • }

HYPERVISOR_set_callbacks • Hypercall to configure the notification handlers • HYPERVISOR_set_callbacks( • unsigned long event_selector, unsigned long event_address, • unsigned long failsafe_selector, unsigned long failsafe_address) • /* defined at xen-3.1.0-src\linux-2.6-xen-sparse\include\asm-i386\mach-xen\asm\hypercall.h */ • event_selector + event_address: make the callback address for notifications • failsafe_selector + failsafe_address: make the callback if anything goes wrong with the event • Notifications can be prevented at a VCPU level or at an event level because they’re contained in the shared info page: • struct shared_info {… • struct vcpu_info vcpu_info[MAX_VIRT_CPUS] {… • uint8_t evtchn_upcall_mask;…}; • unsigned long evtchn_mask[sizeof(unsigned long) * 8]; • …};

Setting the notifications handler Handler and masks configuration /* Locations in the bootstrapping code */ extern volatile shared_info_t shared_info; void hypervisor_callback(void); void failsafe_callback(void); static evtchn_handler_t handlers[NUM_CHANNELS]; void EVT_IGN(evtchn_port_t port, struct pt_regs * regs) {}; /* Initialise the event handlers */ void init_events(void) { /* Set the event delivery callbacks */ HYPERVISOR_set_callbacks( FLAT_KERNEL_CS, (unsigned long)hypervisor_callback, FLAT_KERNEL_CS, (unsigned long)failsafe_callback); /* Set all handlers to ignore, and mask them */ for(unsigned int i=0 ; i<NUM_CHANNELS ; i++) { handlers[i] = EVT_IGN; SET_BIT(i,shared_info.evtchn_mask[0]); } /* Allow upcalls. */ shared_info.vcpu_info[0].evtchn_upcall_mask = 0; }

Implementing the callback function /* Dispatch events to the correct handlers */ void do_hypervisor_callback(struct pt_regs *regs) { unsigned int pending_selector, next_event_offset; vcpu_info_t *vcpu = &shared_info.vcpu_info[0]; /* Make sure we don't lose the edge on new events... */ vcpu->evtchn_upcall_pending = 0; /* Set the pending selector to 0 and get the old value atomically */ pending_selector = xchg(&vcpu->evtchn_pending_sel, 0); while(pending_selector != 0) { /* Get the first bit of the selector and clear it */ next_event_offset = first_bit(pending_selector); pending_selector &= ~(1 << next_event_offset); unsigned int event; /* While there are events pending on unmasked channels */ while(( event = (shared_info.evtchn_pending[pending_selector] & ~shared_info.evtchn_mask[pending_selector])) != 0) { /* Find the first waiting event */ unsigned int event_offset = first_bit(event); /* Combine the two offsets to get the port */ evtchn_port_t port = (pending_selector << 5) + event_offset; /* 5 -> 32 bits */ /* Handle the event */ handlers[port](port, regs); /* Clear the pending flag */ CLEAR_BIT(shared_info.evtchn_pending[0], event_offset); } } } Maps a bit with an index in the callback matrix

XenStore

Xen Store • XenStore is a hierarchical namespace (similar to sysfs or Open Firmware) which is shared between domains • The interdomain communication primitives exposed by Xen are very low-level (virtual IRQ and shared memory) • XenStore is implemented on top of these primitives and provides some higher level operations (read a key, write a key, enumerate a directory, notify when a key changes value) • General Format • There are three main paths in XenStore: • /vm - stores configuration information about domain • /local/domain - stores information about the domain on the local node (domid, etc.) • /tool - stores information for the various tools • Detailed information at http://wiki.xensource.com/xenwiki/XenStoreReference

Ring buffers for split driver model • The ring buffer is a fairly standard lockless data structure for producer-consumer communications • Xen uses free-running counters • Each ring contains two kinds of data, a request and a response, updated by the two halves of the driver • Xen only allows responses to be written in a way that overwrites requests

Xen Split Device Driver Model (for PV guests) • Xen delegates hardware support typically to Domain 0, and device drivers typically consist of four main components: • The real driver • The back end split driver • A shared ring buffer (shared memory pages and events notification) • The front end split driver Paravirtual Guest Domain 0 Frontend device driver Real device driver Backend device driver Shared Ring Buffers Hypervisor Hardware Block devices

Xen HVM functionality

VMMs / Hypervisors

VMMs / Hypervisors

Presentation Transcript

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

A Comparison of Software and Hardware Techniques for x86 Virtualization

The Gathering Storm: OSs and VMMs meet Many-Core

A Comparison of Software and Hardware Techniques for x86 Virtualization

VMMs: DISCO and XEN