250 likes | 508 Views
http://lmncp.uniparthenope.it. http://dsa.uniparthenope.it. A GPGPU transparent virtualization component for high performance computing clouds. G. Giunta, Raffaele Montella , G. Agrillo, G. Coviello University of Napoli Parthenope Department of Applied Science
E N D
http://lmncp.uniparthenope.it http://dsa.uniparthenope.it A GPGPU transparent virtualization component for high performance computing clouds G. Giunta, Raffaele Montella, G. Agrillo, G. Coviello University of Napoli ParthenopeDepartment of Applied Science {giunta,montella,agrillo,coviello}@uniparthenope.it
uniParthenope • Oneof the fiveUniversities in Napoli (Italy) • 20K students • 5 faculties • Science and Technologies • Engineering • Economy • Law • Sports & Health http://www.uniparthenope.it
Summary • Introduction • System Architecture and Design • Performance evaluation • Conclusions and developments gVirtuS GPGPU virtualization service
Introduction & Contextualization • High Performance Computing: • Stackoftechnologiesenabling high performance computingresourcesdemanding software • Gridcomputing: • Stackoftechnoogiesenabling the resourcesharing and aggregation • Manycore: • The “enforcement” of the Moore’s law • GPGPUs: • Computingefficient and costeffective high performance computingusingmanycoregraphics processing units • Virtualization: • Hardware and software resourcesabstraction • Oneof the manycoreCPUskillingapplications • Cloudcomputing: • Stackoftechnologiesenabling hosting on virtualizedresources • On demandresourcevirtualization • Payasyou go
High Performance CloudComputing • Hardware: • High performance computing cluster • Multicore / Multi processorcomputingnodes • GPGPUs • Software: • Linux • Virtualizationhypervisor • Private cloud management software • +Specialingredients…
gVirtuS • GPU Virtualization Service • Bonded on nVidia/CUDA APIs • Hypervisorindependent • Uses a front-end (FE)/ back-endapproach (BE) • FE/BE communicatorindependent The key properties of the proposed system are: 1. Enabling the CUDA kernels execution in a virtualized environment 2. With an overall performance not too far from un-virtualized machines
System Architecture and Design • CUDA device is under control of the hypervisor • Interface between guest and host machine • Any GPU access is routed via the FE/BE • The management component controls invocation and data movement
The Communicator • Provides a high performance communication between virtual machines and their hosts. • The choice of the hypervisor deeply affects the efficiency of the communication.
HowgVirtuSworks • CUDA library: • deals directly with the hardware accelerator • interacts with a GPU virtualization front end • The Front End: • packs the library function invocation • sends it to the back end • The back end: • deals with the hardware using the CUDA driver • unpacks the library function invocation • maps memory pointers • executes the CUDA operation • retrieves the results • sends them to the front end using the communicator • The Front End: • interacts with the CUDA library by terminating the GPU operation • provides results to the calling program. • This design is: • hypervisor independent • communicator independent • accelerator independent • The same approach could be followed to implement different kinds of virtualization.
Choices and Motivations • We focused on VMware and KVM hypervisors. • vmSocket is the component we have designed to obtain a high performance communicator • vmSocket exposes Unix Sockets on virtual machine instances thanks to a QEMU device connected to the virtual PCI bus. vmSocket
vmSocket: virtual PCI device • Programming interface: • Unix Socket • Communication between guest and host: • Virtual PCI interface • QEMU has been modified • GPU based high performance computing applications usually require massive data transfer between host (CPU) memory and device (GPU) memory… • FE/BE interaction efficiency: • there is no mapping between guest memory and device memory • the memory device pointers are never de-referenced on the host side • CUDA kernels are executed on the BE where the pointers are fully consistent.
Performance Evaluation • CUDA Workstation • Genesis GE-i940 Tesla • i7-940 2,93 133 GHz fsb, Quad Core hyper-threaded 8 Mb cache CPU and 12Gb RAM. • 1 nVIDIAQuadro FX5800 4Gb RAM video card • 2 nVIDIA Tesla C1060 4 Gb RAM • The testing system: • Fedora 12 Linux • nVIDIA CUDA Driver, and the SDK/Toolkit version 2.3. • VMware vs. KVM/QEMU (using different communicators).
…from CUDA SDK… • ScalarProd computes k scalar products of two real vectors of length m. Notice that each product is executed by a CUDA thread on the GPU so no synchronization is required. • MatrixMul computes a matrix multiplication. The matrices are mxn and nxp, respectively. It partitions the input matrices in blocks and associates a CUDA thread to each block. As in the previous case, there is no need of synchronization. • Histogram returns the histogram of a set of m uniformly distributed real random numbers in 64 bins. The set is distributed among the CUDA threads each computing a local histogram. The final result is obtained through synchronization and reduction techniques.
Test cases • Host/cpu: CPUwithoutvirtualization (no gVirtuS) • Host/gpu: GPU withoutvirtualization (no gVirtuS) • Host/afunix: GPU withoutvirtualizatiuon (withgVirtuS)measures the impact of the gVirtuSstack • Host/tcp: GPU withoutvirtualization (withgVirtuS)measures the impact of the communicationstack • */cpu: CPU in a virtualizedenvironment (no gVirtuS) • */tcp: GPU in a virtualizedenvironment (withgVirtuS) • Vmware/vmci: GPU in a vmwarevirtualmachinewithgVirtuSusing the VMCI basedcommunicator • KVM/vmSocket: GPU in a KVM/QEMU virtualmachinewithgVirtuSusing the vmSocketbasedcommunicator
AboutResults • Virtualization doesn’t affect computing performances in a heavy way • gVirtuS-kvm/vmsocket gives the best efficiency with the less impact respect to the raw host/gpu setup • The tcp based communicator could be used in a production scenario: • The problem size and the computing speed-up justify the poor communication performances
HPCC: High PerformanceCloudComputing [ ] • Intel based 12 computing nodes cluster • Each node: • quad core 64 bit CPU / 4 GB of RAM • nVIDIAGeForce GT 9400 video card with 16 CUDA cores and a memory of 1 Gbyte. • Software stack: • Fedora 12 • Eucalyptus • KVM/QEMU • gVirtus
HPCC Performance Evaluation • Ad hoc benchmark • Matrix multiplication algorithms. • Classic memory distributed parallel approach. • The first matrix is distributed by rows, the second one by columns • Each process has to perform a local matrix multiplication. • MPICH 2 as message passing interface among processes • Each process uses the CUDA library to perform the local matrix multiplication.
Future Directions • Enable shared memory communication between host and guest machines in order to improve virtual host to device and vice-versa memory copying. • Implementation of OpenGL interoperability to integrate gVirtuS and VMGL for 3D graphics virtualization. • Integrate MPICH2 with vmSocket in order to implement a high performance message passing standard interface.
Conclusions • The gVirtuS GPU virtualization and sharing system enables thin Linux based virtual machines to be accelerated by the computing power provided by nVIDIA GPUs. • The gVirtuS stack permits to accelerate virtual machines with a small impact on overall performance respect to a pure host/gpu setup. • gVirtuS can be easily extended to other CUDA enabled devices • This approach is based on highly proprietary and close-source nVIDIA products. Download Try & Contribute! http://osl.uniparthenope.it/projects/gvirtus/
gVirtuSimplementation (1/2) • The BE runs on the hostdevice • The FE runs on the virtualmachine • gVirtusisimplementend in C++ • BE/FE runasdeamons
gVirtuSimplementation (2/2) • The FE class diagram