Virtualisation Front Side Buses SMP systems

Virtualisation Front Side Buses SMP systems COMP311 2007 Jamie Curtis

Virtualisation • Allows a “Host” OS to execute a “Guest” OS as a task • No need for “Guest” OS’s cooperation • Host is often called a Hypervisor or VMM • VMM controls devices • Often presents virtual devices to guest OS’s

Virtualisation cont. • PC architecture makes this tricky • Protected mode introduced privilege rings • 4 Levels – 0,1,2,3 with decreasing privilege • Operating systems assume full control (ring 0) • and userspace at ring 3 • There is no control on reading this state • Some instructions silently fail if not run at the correct privilege level instead of causing a fault

Virtualisation cont. • 4 approaches to fix this • Emulation • Paravirtualisation • Binary Translation • Hardware Assisted • Emulation • Fake the entire machine • Very slow but doesn’t require the same host architecture as the guest

Paravirtualisation • Alter the guest so that it doesn’t use the “bad” instructions. • Guest instead calls out to the VMM • Very fast with minimal overhead • Requires support from all guest OS’s • Effectively limited to open source OS’s • Championed by Xen

Binary Translation • The VMM now dynamically translates the byte stream before it’s executed, replacing “bad” instructions as it goes • Lots of optimisations make this not nearly as bad as it sounds. • Allows running un-modified OS’s • Performance hit can be anything from 5 – 60% depending on workload. • Championed by VMware

Hardware Virtualisation • Adds another privilege level for the VMM • Hardware maintains state for each guest • VMM gets very flexible control of what causes faults • Intel and AMD again have similar specs (but incompatible !) • AMD – “Pacifica”, Intel – “Vanderpool”

Hardware Virtualisation cont. • Initial solutions don’t deal with enough in hardware, causing them often to be slower than BT • Transition into and out of guests is very slow • VMware and Xen both have added support • Hardware support is getting more complete in newer versions

Virtual I/O Devices • Typically VMMs provide virtual hardware drivers to the guests • These may then map onto real hardware inside the VMM • Allowing secure and separated direct access to hardware is difficult

AMD 10h Family • AMD’s latest architecture adds a number of important virtualisation enhancements • Primarily focused around making fewer calls into the VMM • Three main enhancements • Nested Page Tables • Tagged TLB • Device Exclusion Vector (DEV)

Nested Page Tables • Current designs use shadowed page tables • The MMU is setup for the current guest / process • Changes to the MMU are trapped by the VMM • Nested page tables replicate the page table per VM. • Looks similar to the virtual address space provided by the OS to processes, just one more level up • Makes lookups through the page table slower

Tagged TLB • TLB’s cache lookups through the page tables • Typically they are flushed each time you switch process or VM • Tagged TLB’s add a tag to reference which VM this tag relates to • No longer have to flush TLB’s when switching VM • Makes VM – VMM – VM transitions cheaper • Reduces the performance hit of nested page tables

Device Exclusion Vector • Contains a mapping of what pages a device can access • Allows the VMM to give DMA access to specific pages • Allows a device to do DMA directly into a specific VM’s memory space

Virtualisation Future • Linux now has 4 full virtualisation packages • VMWare , Xen, KVM, Lguest • Intel and AMD working on next-gen hardware support • Dense, multi-core solutions making virtualisation very attractive • Power and space efficient • Keeps getting more and more important

Traditional Intel Architecture Northbridge Front Side Bus Southbridge

Clock vs. Data rates • Clock rates no longer equal data rates • Watch specifications as both can look identical • e.g. • Intel's FSB is normally referred to as an 1333MHz bus • It is actually a 333MHz clock with a quad-pumped data rate • Should really be 1333MT/s

Frontside Bus & Dual Core • 64 bit, quad-pumped 333MHz • 8 * 4 * 333 = 10.6GB/s • Half Duplex • Traditional P4 FSB has been point to point • Xeon’s SMP requires chipsets support a proper multipoint bus for the FSB. • How did a dual core Pentium D work ?

Intel Dual Core • Slap two cores on the same die ! • How do they communicate ? • Over the FSB like a Xeon ! • Requires a new chipset to support it.

Caches • There are two sets of L1 and L2 cache, one for each core. • What happens if both cores are caching the same memory location ? • Need a protocol to make sure this doesn’t cause problems • Cache Coherency Protocol

Cache Coherency • Intel use the MESI Protocol • Modified • 1 cache contains a modified copy of the location • Exclusive • 1 cache contains a un-modified copy of the location • Shared • 2 or more caches contain un-modified copies of the location • Invalid • Another cache contains a modified copy of the location

MESI • What happens if a core has an Invalid entry but tries to access it ? • The cache with the Modified entry needs to write the entry back to memory and become a Shared entry • This means an Intel Pentium D has to involve the FSB in all of it’s cache coherency updates even though they are on the same die

FSB L2 Cache Bus Control L2 Control Core 1 Core 2 Intel Core • Core architecture is designed for dual-core • Cache is now shared between cores • Cache coherency is now between each L1 and L2

Intel Core 2 Quad • Current quad core design adds two dual core designs together • Cache coherency between dies again happens over the FSB

AMD K8 Architecture • Integrated Memory Controller • Very low latency • CPU determines memory technology • Requires both CPU and Motherboard to be changed for a new type of memory • Athlon 64, Socket 754 • Single channel, 64bit DDR at 200MHz • 8 * 2 * 200 = 3.2GB/s

AMD K8 Memory Interfaces • Athlon 64, Socket 939 • Dual channel, 64bit DDR at 200MHz • 2 * 8 * 2 * 200 = 6.4 GB/s • Athlon 64, Socket AM2 • Dual channel, 64bit DDR2 at 400MHz • 2 * 8 * 2 * 400 = 12.8 GB/s • Opteron, Socket 940 • Dual channel, 64bit DDR at 200MHz • 2 * 8 * 2 * 200 = 6.4 GB/s

AMD K8 FSB • AMD use the HyperTransport technology • Don’t confuse with the very different Intel Hyper Threading Technology • Packet based, point to point link that provides the lowest possible latency • Full duplex bi-directional link • Available in 2, 4, 8, 16 or 32 bits wide • 50MHz – 1.4GHz clock rates • Clock rates and bit widths can be asymmetric • Double-pumped data rate • 1.4GHz * 2 * 4 = 11.2GB/s per direction

Athlon 64 FSB • The Athlon 64 has a single HyperTransport link to connect to I/O subsystem • 16bit, bi-directional, DDR at 800MHz (754) or 1GHz (939) • 2 * 2 * 2 * 1000 = 8GB/s • Can have either a single chip solution or stay with two chip, northbridge, southbridge combination

Opteron • Uses HT for both the I/O interconnect and CPU interconnect • CPU interconnect requires an additional Cache Coherency protocol addition to HT • Come in three variants • 1xx – Memory controller and 3 HT links • 2xx – Memory controller and 3 HT links • 8xx – Memory controller and 3 HT links

Opteron • What makes them different ? • Number of HT busses that support the CC protocol • 1xx – Zero • 2xx – One • 8xx – Three • This allows you to scale to different numbers of processors • 1xx – 1 way, 2xx – 2way, 8xx – 4 or 8 way

AMD MOESI • Cache Coherency slightly different to Intel’s • Owner – This cache owns this memory location and it (not memory) services all requests for it from other caches • The request goes across the high speed dedicated HT bus

AMD Dual Core • In the dual core situation, it becomes even better:

AMD Dual Core • Requests now go across the System Request Interface • This runs at CPU core frequency • System Request Interface also controls HT – Memory access as well as HT – HT communication in 2xx and 8xx Opterons

AMD Quad Core • AMD again targeted native quad core instead of dual-die, dual core • Introduced shared L3 cache

Virtualisation Front Side Buses SMP systems