100 likes | 232 Views
This paper presents an execution model tailored for heterogeneous multicore architectures, exploring software challenges related to portability and performance. The Harmony runtime manages execution across varying system sizes and configurations, enabling transparent scheduling and chunk management. By balancing workload across multiple GPUs, multicore CPUs, and leveraging runtime translation of data-parallel intermediate representations, the model aims to minimize application retuning. Additionally, optimization techniques such as speculation and kernel fusion are discussed, highlighting advancements in leveraging emerging parallel programming environments like CUDA and OpenCL.
E N D
An Execution Model for Heterogeneous Multicore Architectures Gregory Diamos, Andrew Kerr, and Sudhakar Yalamanchili Computer Architecture and Systems Laboratory Center for Experimental Research in Computer Systems School of Electrical and Computer Engineering Georgia Institute of Technology
Software Challenges of Heterogeneity • Programming Model • Execution Model • Portability • Performance
System Space Single GPU Multicore CPU Multi GPU Multicore CPU Multi-node Level of Abstraction Runtime Execution Model (Harmony) Runtime Translation of Data-Parallel IR (Ocelot) System Size and Configuration
Scalable Portable Execution – Harmony Runtime Cap Model 3 readInputs(); computeInvariants(); for all chunks { simulateChunk(); } generateResults(); Memory Inputs Outputs Inputs Outputs kernel chunk chunk Transparent scheduling, execution management of chunks kernel Harmony Run-time CPU CPU CPU ACC ACC ACC FIFO FIFO FIFO Local Memory Local Memory Local Memory Cache Cache Cache DMA DMA DMA Binary compatibility across system sizes Network (e.g., Hypertransport, QPI, PCIe) • Minimize/avoid retuning and porting applications as you add accelerators • Advanced optimizations • Speculation, performance prediction, kernel fusion
Emerging Environment Datalog CUDA/OpenCL Language Front End Language Front End • Status: • Summer 2009 • With Prof. Nate Clark Kernel IR • Status: • Single node/multi-GPU Run Time (Harmony) Ocelot Emulator LLVM I/F • Status: • Test and Debug • Status: • In progress (Fall 2009) CUDAJIT Prof. H. Kim GPGPU Simulator Supported ISAs (MIPS, SPARC, x86, etc.)
Emerging HVM Platform Architecture With K. Schwan and A. Gavrilovska
Problem Scaling – Risk Analysis Application Measured execution times GPU interactive overhead dominates With latest CPUs (2x faster) and GPUs(4x faster), GPU advantage should grow by 2x
GPU Compilation Flow Abstract Syntax Tree (Datalog Clauses) Clauses to Execution Units Execution Group P GPU (EU) GPU (EU) GPU (EU) P Predicates to Data Structures Execution Units to Algorithms (Kernels) Data Structures Compute Kernels Runtime Mapping of Kernels to Cores Runtime GPU Core CPU Core