1 / 16

Harmony: A Run-Time for Managing Accelerators Sponsor: LogicBlox Inc.

This paper discusses the software challenges of heterogeneity in programming and execution models for managing accelerators. The Harmony run-time aims to provide portability, performance, and pooled accelerator execution for heterogeneous systems. The key idea is to deploy accelerator kernels based on inter-kernel dependencies and employ multiple implementations corresponding to multiple accelerators. Preliminary performance evaluation shows low overhead and scalability. Additionally, the paper explores extensions to FPGAs and virtualization of accelerator resources.

Lucy
Download Presentation

Harmony: A Run-Time for Managing Accelerators Sponsor: LogicBlox Inc.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Harmony: A Run-Time for Managing AcceleratorsSponsor: LogicBlox Inc. Gregory Diamos and Sudhakar Yalamanchili

  2. Software Challenges of Heterogeneity Programming Model Execution Model Portability Performance

  3. Pooled Accelerator Execution Model Instance • Heterogeneous multiprocessor systems are viewed as a pool of processors, each potentially with a unique ISA and system interface • Applications that make full use of these systems must include binaries compatible with each accelerator ISA

  4. Execution Model HVM Accelerator-based Code Segment – compiled for specific device/driver combination Programming Model Control Thread Source Program Memory Stream Compilation Environment Kernel Kernel Stream Elements Configuration of the Machine Model Multicore processor 1 Accelerator 1 System Architecture Description … ACC Local Memory FIFO Cache DMA Architecture description specifies configuration of accelerators and processors & communicates QoS requirements

  5. Goals of Harmony Low Overhead Comparable to or better than hand tuned applications System Configuration Agnostic Correct execution on a system with any number and type of heterogeneous architectures No code modification required Scalable EP application performance should scale with the number of devices Familiar Do not require any more than current programming model of threaded applications for homogeneous architectures Harmony

  6. Key Idea Accelerator kernel deployment based on static and dynamic inter-kernel dependencies Inspired by ILP scheduling techniques Kernels are “issued” to accelerators and their execution is “committed” to release dependent kernels Harmony Buffer Ready Issue From Application op op op Dependence resolution op op op

  7. Harmony Architecture & Operation Harmony

  8. Harmony Runtime Operation Accelerator kernels are mapped to specific architectures based on Architectures in the system Available implementations Performance Results are forwarded to waiting functions Can support speculation Results are committed in order Harmony

  9. Application Development Programmer supplied (Harmony) checks on entry/exit to accelerator kernels Marshalling of operands when a accelerator kernel is invoked May employ multiple (static) implementations corresponding to multiple accelerators Harmony

  10. Preliminary Performance Evaluation Harmony Matrix Multiplication 3.1% Overhead 3.8% Overhead

  11. Scheduling Overhead Harmony

  12. Extensions to FPGAs Extensions to FPGAs • Maintain the base Harmony deployment model • Accelerator pools • Associate a Harmony thread with each FPGA-based accelerator • Virtualize the FPGA fabric • Demand-driven vs. static configuration of the fabric • Adapt existing register allocation based scheduling techniques • Example: Virtualized Packet Schedulers (Sponsor: RNET Technologies) • Poster Session

  13. FPGA-Based Accelerator Architecture Extensions to FPGAs PCIe/Hypertransport/CSI Interface NI Volatile (DRAM)‏ Memory Controller NI Switch Switch NI PowerPC Switch Switch NI FFT Nonvolatile (FLASH)‏ NI NI Encrypt Decrypt

  14. Accelerator Configuration Future Harmony Thread Harmony Thread Host (DRAM)‏ Host Driver PCIe/Hypertransport/CSI Interface NI Volatile (DRAM)‏ Memory Controller NI Switch Switch NI PowerPC Address translation in the NI allows isolated paths between accelerators and memory NI FFT Switch Switch Nonvolatile (FLASH)‏ NI NI Encrypt Decrypt

  15. Heterogeneous Virtual Machines User Software User Software Guest OS Guest OS Looking Ahead PIs: A. Gavrilovska, K. Schwan, S. Yalamanchili isolation security SW Resources legacy systems Virtual Machine Monitor CPU CPU CPU ACC ACC ACC HW Resources FIFO FIFO FIFO Local Memory Local Memory Local Memory Cache Cache Cache DMA DMA DMA Network • Virtualization of accelerator resources • Consolidation and sharing of accelerators

  16. Questions?

More Related