310 likes | 501 Views
Disco. Running Commodity Operating Systems on Scalable Multiprocessors. FLASH. cache coherent non-uniform multiprocessor developed at Stanford not available at the time paper was written. Problems with other approaches.
E N D
Disco Running Commodity Operating Systems on Scalable Multiprocessors
FLASH • cache coherent non-uniform multiprocessor • developed at Stanford • not available at the time paper was written
Problems with other approaches • many other experimental systems require large changes to uniprocessor OSes • development lags behind delivery of hardware • high costs mean system will likely introduce instabilities, which may break legacy applications • often hardware vendors are not the software vendors (think Intel/Microsoft)
Virtual Machine Monitorsnanokernel • software layer between hardware and heavyweight OS • by running multiple copies of traditional OSes, scalability issues are confined to the much smaller VM monitor • VM are natural software fault boundaries and again the size of the monitor makes hardware fault tolerance easier to implement
Virtual Machine Monitorsnanokernel • monitor handles all the NUMA related issues so that UMA OSes do not need to be made aware of non-uniformity • multiple OSes allow legacy applications to continue to run while newer versions are phased in. Could allow more experimentation with new technologies.
Virtual Machine Challenges • overhead • runtime • privileged instructions must be emulated inside monitor • I/O requests must be intercepted and de/re -virtualized by monitor • memory • code/data must be replicated for the different copies of OS • each OS may have its own file system buffer, for instance
Virtual Machine Challenges • resource management • monitor will not have all the information that the OS does, so it may make poor decision. Think of spin-locking. • sound familiar? This is the same as the argument against kernel-level threads.
Virtual Machine Challenges • communication and sharing • if OSes are separated by virtual machine boundaries how do they share resources and how does information cross those VM boundaries. VMs aren’t aware they are actually on the same machine. • sound familiar? This issue motivated LRPC and URPC, specializing communication protocols for the situation where server and client reside on the same physical machine
Disco implementation • Disco emulates the MMU and the trap architecture, allowing unmodified applications and OSes to run on the VM • frequently used kernel operations can be optimized. For instance interrupt disabling is done by the OSes by load and storing to special addresses
Disco implementation • all I/O devices are virtualized, including network connections and disks, and all access to them must pass through Disco to be translated or emulated.
Disco implementation • at only 13,000 lines there is a higher ability to hand tune code. • small image size, only 72KB, also means that copies of Disco can reside in every local node, so Disco text never has to be fetched at lower non-uniform rate • machine-wide data structures are partitioned so parts that are currently being used by processor can reside in local memory
Disco implementation • scheduling VMs is similar to traditional kernels scheduling processes, eg quanta size considerations, saving state in data structures, processor affinity, etc
Disco implementationVirtual Physical Memory • Disco maintains a physical-to-machine address mapping. • machine addresses are FLASH’s 40 bit addresses
Disco implementationVirtual Physical Memory • when a heavy weight OS tries to update the TLB, Disco steps in and applies the physical-to-machine translation. Subsequent memory accesses then can go straight thru the TLB • each VM has an associated pmap in the monitor • pmap also has a back pointer to its virtual address to help invalidate mappings in the TLB
Disco implementationVirtual Physical Memory • MIPS has a tagged TLB, called address space identifier (ASID). • ASIDs are not virtualized, so TLB must be flushed on uberweight VM context switches • 2nd level software TLB?
Disco implementationHidingNUMA • cache misses are served faster from local memory rather than remote memory • read and read-shared pages are migrated to all nodes that frequently access them • write-shared are not, since maintaining consistency requires remote access anyway • migratation and replacement policy is driven by cache miss counting
Disco implementationHidingNUMA • memmap tracks which virtual page references each physical page. Used during TLB shootdown.
Disco implementationVirtualizing I/O • all device access is intercepted by the monitor • disk reads can be serviced by monitor and if request size is a multiple of the machines’s page size, monitor only has to remap machine pages into the VM physical memory address space. • pages are read-only and will generate a copy-on-write fault if written to
IRIX • small changes were required to IRIX kernel, but were due to a MIPS pecularity • new device drivers were not needed • hardware abstraction layer is where the trap, the zeroed page, unused page, and VM de-scheduling optimizations were implemented
SPLASHOS • thin OS, supported by disco • used for parallel scientific applications
Experimental Results • since FLASH was not available experiments were run on SimOS, a machine simulator • simulator was too slow, compared to actual machine, to allow long work loads to be studied
Single VM • ran the four worloads in plain IRIX inside simulator and with a single VM running IRIX, 3% - 16% slowdown
Memory Overhead • ran pmake with 8 physical processors with six different configurations, plain IRIX, 1VM, 2VMs, 4VMs, 8VMs, and 8VMs communicating with NFS • demonstrates • 8VMs required less than twice the physical memory as plain IRIX • physical to machine mapping is a useful optimization
Scalability tests • compared the performance of pmake under the previously described configurations • summary: while 1VM showed a significant slowdown(36%), using 8VMs showed a significant speedup(40%) • also ran radix sorting algorithm on plain IRIX and on SPLASHOS/Disco. Reduced run time by 2/3
engineering ran on 8 processors, raytrace on 16 UMA machine is theoretical lower bound Page Migration and Replication
Conclusion • nanokernel is several orders of magnitude smaller than heavyweight OS, yet can run virtually unmodified OSes in virtual machine monitors • problems of execution overhead and memory footprint were addressed