1 / 55

An Introduction to the V3VEE Project and the Palacios Virtual Machine Monitor

An Introduction to the V3VEE Project and the Palacios Virtual Machine Monitor. Peter A. Dinda Department of Electrical Engineering and Computer Science Northwestern University http://v3vee.org. Overview. We are building a new, public, open-source VMM for modern architectures

pekelo
Download Presentation

An Introduction to the V3VEE Project and the Palacios Virtual Machine Monitor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Introduction to the V3VEE Project and the Palacios Virtual Machine Monitor Peter A. Dinda Department of Electrical Engineering and Computer Science Northwestern University http://v3vee.org

  2. Overview • We are building a new, public, open-source VMM for modern architectures • You can download it now! • We are leveraging it for research in HPC, systems, architecture, and teaching • You can join us

  3. Outline • V3VEE Project • Why a new VMM? • Palacios VMM • Research leveraging Palacios • Virtualization for HPC • Virtual Passthrough I/O • Adaptation for Multicore and HPC • Overlay Networking for HPC • Alternative Paging • Participation

  4. V3VEE Overview • Goal: Create a new, public, open source virtual machine monitor (VMM) frameworkfor modern x86/x64 architectures (those with hardware virtualization support) that permits the compile-time/run-time composition of VMMs with structures optimized for… • High performance computing research and use • Computer architecture research • Experimental computer systems research • Teaching Community resource development project (NSF CRI) X-Stack exascale systems software research (DOE)

  5. Example Structures App App LibOS App App LibOS LibOS App App Guest OS VMM API VMM VMM | MUX VMM | MUX VMM | MUX VMM | MUX | Driver HW HW HW HW HW App App LibOS App App LibOS LibOS App Guest OS VMM API Driver VMM | MUX | Driver HW

  6. V3VEE People and Organizations • Northwestern University • Peter Dinda, Jack Lange, Russ Joseph, Lei Xia, Chang Bae, Giang Hoang, Fabian Bustamante, Steven Jaconette, Andy Gocke, Mat Wojick, Peter Kamm, Robert Deloatch, Yuan Tang, Steve Chen, Brad Weinberger Mahdav Suresh, and more… • University of New Mexico • Patrick Bridges, Zheng Cui, Philip Soltero, Nathan Graham, Patrick Widener, and more… • Sandia National Labs • Kevin Pedretti, Trammell Hudson, Ron Brightwell, and more… • Oak Ridge National Lab • Stephen Scott, GeoffroyVallee, and more… • University of Pittsburgh • Jack Lange, and more…

  7. Support (2006-2013)

  8. Why a New VMM? Research in Systems Commercial Space Research and Use In High Performance Computing Teaching Virtual Machine Monitor Research in Computer Architecture Open Source Community broad user and developer base

  9. Why a New VMM? Research in Systems Commercial Space Research and Use In High Performance Computing Teaching Virtual Machine Monitor Research in Computer Architecture Open Source Community Well-served broad user and developer base

  10. Why a New VMM? Research in Systems Commercial Space Research and Use In High Performance Computing Teaching Virtual Machine Monitor Research in Computer Architecture Open Source Community Ill-served broad user and developer base

  11. Why a New VMM? • Code Scale • Compact codebase by leveraging hardware virtualization assistance • Code decoupling • Make virtualization support self-contained • Linux/Windows kernel experience unneeded • Make it possible for researchers and students to come up to speed quickly

  12. Why a New VMM? • Freedom • BSD license for flexibility • Raw performance at large scales

  13. Under development since early 2007 • Third major release (1.2) in January, 2010 • Public git access to current development branches • Successfully used on Cray XT supercomputers, clusters (Infiniband and Ethernet), servers, desktops, etc.

  14. What Does It Do? • System virtualization • No paravirtualization necessary • Run an OS as an application • Run multiple OS environments on a single machine • Start, stop, pause • Trap and emulate model • Privileged instructions/events are trapped by VMM using hardware mechanisms • Emulated in software • Palacios acts as an event handler loop • Minimal Linux boot: ~2 million events (“exits”) Page Tables CPU state Hardware Application Application Application Guest Application Guest OS Guest OS Guest OS OS Page Tables CPU state Hardware HostOS/VMM VMM Emulate Hardware Hardware Hardware

  15. Leveraging Modern Hardware • Requires and makes extensive use of modern x86/x64 virtualization support • AMD SVM (AMD-V) • Intel VT-x • Nested paging support in addition to several varieties of shadow paging • Straightforward passthrough PCI • IOMMU/PCI-SIG in progress

  16. Design Choices • Minimal interface • Suitable for use with a lightweight kernel (LWK) • Compile- and run-time configurability • Create a VMM tailored to specific environments • Low noise • No deferred work • Contiguous memory pre-allocation • No surprising memory system behavior • Passthrough resources and resource partitioning

  17. Embeddability • Palacios compiles to a static library • Clean, well-defined host OS interface allows Palacios to be embedded in different Host OSes • Palacios adds virtualization support to an OS • Palacios + lightweight kernel = traditional “type-I” “bare metal” VMM • Current embeddings • Kitten, GeekOS, Minix 3, Linux [partial]

  18. Sandia Lightweight Kernel Timeline 1991 –Sandia/UNM OS (SUNMOS), nCube-2 1991 – Linux 0.02 1993 –SUNMOS ported to Intel Paragon (1800 nodes)‏ 1993 –SUNMOS experience used to design Puma First implementation of Portals communication architecture 1994 – Linux 1.0 1995 –Puma ported to ASCI Red (4700 nodes)‏ Renamed Cougar, productized by Intel 1997 –Stripped down Linux used on Cplant (2000 nodes)‏ Difficult to port Puma to COTS Alpha server Included Portals API 2002 –Cougar ported to ASC Red Storm (13000 nodes)‏ Renamed Catamount, productized by Cray Host and NIC-based Portals implementations 2004 – IBM develops LWK (CNK) for BG/L/P (106000 nodes)‏ 2005 – IBM & ETI develop LWK (C64) for Cyclops64 (160 cores/die)‏

  19. Kitten: An Open Source LWK http://software.sandia.gov/trac/kitten • Better match for user expectations • Provides mostly Linux-compatible user environment • Including threading • Supports unmodified compiler toolchains and ELF executables • Better match vendor expectations • Modern code-base with familiar Linux-like organization • Drop-in compatible with Linux • Infiniband support • End-goal is deployment on future capability system

  20. A Compact Type-I VMM KVM: 50k-60k lines + Kernel dependencies (??) + User level devices (180k) Xen: 580k lines (50k – 80k core)

  21. Outline • V3VEE Project • Why a new VMM? • Palacios VMM • Research leveraging Palacios • Virtualization for HPC • Virtual Passthrough I/O • Adaptation for Multicore and HPC • Overlay Networking for HPC • Alternative Paging • Participation

  22. HPC Performance Evaluation • Virtualization is very useful for HPC, but… Only if it doesn’t hurt performance Virtualizing a supercomputer? Are we crazy? • Virtualized RedStorm with Palacios • Evaluated with Sandia’s system evaluation benchmarks • Initial (48 node) evaluation in IPDPS 2010 paper • Large scale (>4096 node) evaluation in submission

  23. Cray XT3 38208 cores ~3500 sq ft 2.5 MegaWatts $90 million 17th fastest supercomputer

  24. Virtualized performance(Catamount) Within 5% Scalable HPCCG: conjugant gradient solver

  25. Different Guests Need Different Paging Models Shadow Paging Compute Node Linux Catamount HPCCG: conjugant gradient solver

  26. Different Guests Need Different Paging Models Compute Node Linux Catamount CTH: multi-material, large deformation, strong shockwave simulation

  27. Large Scale Study • Evaluation on full RedStorm system • 12 hours of dedicated system time on full machine • Largest virtualization performance scaling study to date • Measured performance at exponentially increasing scales • 4096 nodes • Publicity • New York Times • Slashdot • HPCWire • Communications of the ACM • PC World

  28. Scalability at Large Scale(Catamount) Within 3% Scalable CTH: multi-material, large deformation, strong shockwave simulation

  29. Infiniband on Commodity Linux (Linux guest on IB cluster) 2 node Infiniband Ping Pong bandwidth measurement

  30. Observations • Virtualization can scaleon the fastest machines on in the world running tightly coupled parallel applications • Best virtualization approach depends on the guest OS and the workload • Paging models, I/O models, scheduling models, etc… • Useful for VMM to have asynchronous and synchronous access to guest knowledge • Symbiotic Virtualization (Lange’s thesis)

  31. VPIO: Virtual Passthrough I/O • A modeling-based approach to high performance I/O virtualization for commodity devices • Intermediate option between passthrough I/O, self-virtualizing devices, and emulated I/O • Security of emulated I/O • …with some of the performance of self-virtualizing devices or passthrough I/O • …on commodity devices 31 [WIOV 2008, Operating Systems Review 2009]

  32. I/O virtualization – full emulated I/O • No guest software change • All New Device Drivers • High performance overhead [Sugerman01] 32

  33. I/O virtualization – paravirtualized I/O • Performance • Reuse Device Driver • Change guest device driver [Barham03, Levasseur04]

  34. I/O virtualization – Self-virtualizing Devices • Native performance • Guest responsible for device driver • Specialized hardware support • Self-virtualizing devices [Liu06, Raj07,Shafer07]

  35. I/O virtualization – Passthrough I/O Guest OS • Reusablity/multiplexing • Security Issue • Makes sense in HPC contexts, not so much in other contexts VMM Mapping Device 35

  36. VMM maintains a formal model of device • keeps track of the physical device status • driven by guest/device interactions • simpler than a device driver/virtual device • Model determines • Reusable state – whether device is currently serially reusable • DMA – whether a DMA is about to starting and what memory locations will be used • Other interesting states Core Idea of VPIO

  37. VPIO In The Context Of Palacios Guest OS Guest OS Unmodified Driver Unmodified Driver Hooked I/O Unhooked I/O Interrupt DMA Palacios VMM Device Modeling Monitor (DMM) Physical Device

  38. Model is small ~700 LOC Example Model: NE2K NIC Checking Function Transitions driven by guest’s interactions with hooked I/O resources and interrupts 38

  39. Comparisons

  40. Virtuoso Project (2002-2007)virtuoso.cs.northestern.edu • “Infrastructure as a Service” distributedgridcloud virtual computing system • Particularly for HPC and multi-VM scalable apps • First adaptive virtual computing system • Drives virtualization mechanisms to increase the performance of existing, unmodified apps running in collections of VMs • Focus on wide-area computation R. Figueiredo, P. Dinda, J. Fortes, A Case for Grid Computing on Virtual Machines, Proceedings of the 23rd International Conference on Distributed Computing (ICDCS 2003), May, 2003. Tech report version: August, 2002.

  41. Virtuoso: Adaptive Virtual Computing • Providers sell computational and communication bandwidth • Users run collections of virtual machines (VMs) that are interconnected by overlay networks • Replacement for buying machines • That continuously adapts…to increase the performance of your existing, unmodified applications and operating systems See virtuoso.cs.northwestern.edu for many papers, talks, and movies

  42. Core Results (Patent Covered) • Monitor application traffic: Use application’s own traffic to automatically and cheaply produce a view of the application’s network and CPU demands and of parallel load imbalance • Monitor physical network using the application’s own traffic to automatically and cheaply probe it, and then use the probes to produce characterizations • Formalize performance optimization problems in clean, simple ways that facilitate understanding their asymptotic difficulty. • Adapt the application to the network to make it run faster or more cost-effectively with algorithms that make use of monitoring information to drive mechanisms like VM->host mapping, scheduling of VMs, overlay network topology and routing, etc. • Adapt the network to the application through automatic reservations of CPU (incl. gang scheduling) and optical net paths • Transparently add network services to unmodified applications and OSes to fix design problems

  43. Extending Virtuoso: Adaptation at the Small Scale • Targeting current and future multicores • Example: specialized multicore guest environment suitable for parallel language run-time • Avoid general-purpose Intel SMP-spec model or NUMA models • Specific assists for run-times implemented in Palacios • Run parallel components on the “bare metal”, while sequential components run on top of the traditional stack • Example: joint software/hardware inference of communication/memory system activity • Extending hardware performance counters • Example: automatic configuration of routing/forwarding on a network-on-chip multicore processor • “Dial in” topology suited for application

  44. Extending Virtuoso:Adaptation at the Large Scale • Targeting a parallel supercomputer • Extending Virtuoso-style adaptation into tightly coupled, performance critical computing environments • Example: seamless, performance-neutral, migration of parallel application from WAN to LAN to Cluster to Super depending on communication behavior • Leverage our VNET overlay network concept

  45. Illustration of VNET Overlay Topology Adaptation in Virtuoso Fast-path links between the VNETs hosting VMs Resilient Star Backbone User’s LAN Foreign host LAN 1 VM 1 Host 1 + VNET IP network Proxy+ VNET Merged matrix as inferred by VTTIF Foreign host LAN 2 VM 2 VM 4 VM 3 Host 3 + VNET Host 4 + VNET Host 2 + VNET Foreign host LAN 4 Foreign host LAN 3

  46. VNET Overlay Networking for HPC • Virtuoso project built “VNET/U” • User-level implementation • Overlay appears to host to be simple L2 network (Ethernet) • Ethernet in UDP encapsulation (among others) • Global control over topology and routing • Works with any VMM • Plenty fast for WAN (200 Mbps), but not HPC • V3VEE project is building “VNET/P” • VMM-level implementation in Palacios • Goal: Near-native performance on Clusters and Supers for at least 10 Gbps • Also: easily migrate existing apps/VMs to specialized interconnect networks

  47. Current VNET/P Node Architecture

  48. Current Performance (2 node, ttcp)(4x speedup of VNET/P over VNET/U, Native Bandwidth on 1 Gbps Ethernet)

  49. Current Performance (2 node, latency)(<0.5 ms latency, but room for improvement)

  50. Current Performance (2 node, ttcp)(We are clearly not done yet!)

More Related