Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping

Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping Chi-Keung (CK) Luk Technology Pathfinding and Innovation Software Solutions and Services Group Intel Sunpyo Hong Electrical and Computer Engineering Georgia Institute of Technology Hyesoon Kim College of Computing School of Computer Science Georgia Institute of Technology

Heterogeneous Architectures Heterogeneous architectures are increasingly popular: Platform used Intel Core2 + Nvidia’s GPU IBM’s Cell processor NHM + Larrabee

Software Challenge SIMD Core-0 Core-1 Core-2 Core-3 A CPU + GPU system: CPU GPU • The Mapping Problem: • Map computations to PEs to optimize an objective function, which could be: • Performance • Energy • Performance / Energy

Existing Solutions to the Mapping Problem Programmer performs the mapping manually and statically Examples: IBM XL compiler extension that supports OpenMP on the Cell Intel CTG’s ExoCHI/Merge framework for programming the CPU and GPU Disadvantages: Labor intensive Not adaptable to changes in runtime environments

Outline Introduction Case Study Adaptive Mapping Experimental Evaluation Conclusions

Case Study: Matrix Multiplication • Heterogeneous machine used: • CPU: dual-socket QuadCore (max = 8 cores) • GPU: Nvidia GTX-8800 GPU • Three configurations tested: • Small problem size, max CPU cores used • Big problem size, max CPU cores used • Big problem size, fewer CPU cores used • In each configuration: • Perform cooperative matrix multiplication (varying the distribution of works over the CPU and GPU)

Cooperative Matrix Multiplication B x = A C CPU A1 C1 x = B C2 A2 GPU

Cooperative Matrix Multiplication Results • Lessons Learned: • The optimal PE mapping depends on the application, the input size, and hardware resources available • Need an automatic and dynamic technique that takes all these factors into account Our contribution: ADAPTIVE MAPPING Configuration 1: Matrix dimension size = 1000 #CPU cores = 8 Configuration 2: Matrix dimension size = 6000 #CPU cores = 8 Configuration 3: Matrix dimension size = 6000 #CPU cores = 2

Adaptive Mapping A technique to automatically find the near-optimal mapping for the given program, problem size and hardware Each <program, hardware> configuration involves one training run and many reference runs: Training run: Find the execution-time projections of the CPU and the GPU for the given configuration Reference run: Compute the near-optimal distribution of work for the current problem size

Training Run Database K K K N1,1 N1,m N2,1 N2,m Nt K K K K K time taken: Tc (N1,1) TG(N2,1) TG(N2,m) Tc (N1,m) curve fitting curve fitting Kernel K T’C(N) = The projected time to execute the kernel of problem size N on the CPU = ac + bc * N Runtime T’C(N) T’G(N) Input size T’G(N) = The projected time to execute the kernel of problem size N on the GPU = ag + bg * N

Reference Run β= Fraction of work mapped to CPU p = Number of CPU cores N = Problem size T’β(N) = The projected time to execute βN work on the CPU and (1- β)N work on the GPU = Max( p/(p-1)T’C(βN), T’G((1-β)N) ) Once N is fixed to the actual problem size Nr, we find the β that minimizes T’β(Nr). We consider where the two curves p/(p-1)T’C(βNr) and T’G((1-β)Nr) intersect. There are 3 possible cases (see next slide) Nr Database T’b(Nr) = Max( p/(p-1)T’C(βNr), T’G((1-β)Nr) ) Find βto minimize T’b(Nr) GPU CPU K K K

Three Possible Cases of β Case ii: The two curves intersect at β >= 1 Case i: CPU and GPU curves intersect at β <= 0 Time Time CPU: (p/p-1)T’c(b Nr) GPU: T’G((1-b) Nr) Minimized when mapping all work to the CPU Minimized when mapping all work to the GPU GPU: T’G((1-b) Nr) CPU: (p/p-1)T’c(b Nr) b b 0 0 1 1 Case iii: The two curves intersect at 0<β<1 Minimized when mapping bmin of work to the CPU GPU: T’G((1-b) Nr) CPU: (p/p-1)T’c(b Nr) bmin b 0 1

Outline Introduction Case Study Adaptive Mapping Experimental Evaluation Conclusions

Prototype Implementation Adaptive mapping could be implemented as: Off-line optimization for static compilation On-line optimization for dynamic compilation Our prototype: A dynamic compilation system called Qilin Qilin API: Both stream-based and thread-based Dynamic code generation: Generate TBB source code for the CPU Generate CUDA source code for the GPU Generate glue code to: Copy data back and forth between CPU and GPU Stage computations onto GPU to satisfy GPU memory limitation Division of work according to Adaptive Mapping C++ App Qilin API Qilin System CPU GPU

Heterogeneous PC used

Benchmarks (Financial, image processing, scientific)

Performance of Adaptive Mapping (Note: The y-axis is in logarithmic scale) CAdaptive mapping achieves 94% of the speedup of manual mapping

Energy Consumption (Total system power measured by Extech 38080 Power Analyser) CAdaptive mapping is nearly as good as manual mapping in energy consumption

Distribution of Computations CAdaptive mapping and manual mapping have similar distributions

Related Work Hardware Kumar et al. demonstrate advantages of heterogeneous over homogeneous CMPs in terms of power and throughput Similar observations from Hill and Mart => Both study point out the importance of the mapping problem Software GPGPU: Brook, Accelerator, Peakstream, Rapidmind, Brook+, Cuda (they are all GPU only) Intel’s TBB and Ct (currently CPU only) IBM’s OpenMP extension for Cell and Intel’s ExoCHI/Merge Use both CPU and GPU, but based on static manual mapping OpenCL: Doesn’t seem to have any automatic mapping technique based on the initial specification Autotuning Generating many variants of a computation kernel and benchmarking each variant on the target platform Adaptive mapping can be regarded as an autotuning technique that tunes for the distribution of works on heterogeneous platforms

Conclusions Automates the mapping from computations to heterogeneous multicores Encouraging results: Performance and energy consumption close to manual mapping Adapt to changes in input size, hardware & software configurations (see our paper) Applicable to other heterogeneous systems OpenCL or Ct on NHM + Larrabee Future work: Extend it to handle irregular computations CAdaptive mapping could be an important technique in the multicore software stack

Acknowledgments Michael Linderman, Jamison Collins, Hong Wang Sharing their Merge benchmarks Geoff Lowney and Mark Abel Support of this work Geoff Lowney and Robert Cohn Suggestions and feedbacks

Impact of Training Input Size (Note: The y-axis is in logarithmic scale) Training input size as percentage of the reference input size CMost of the performance benefit of Adaptive Mapping preserved when the training input size is at least 30% of the reference input size

Adapting to Hardware Changes (1) Original result CAdaptive mapping automatically recovers part of the performance loss in the GPU from the CPU Using a less powerful GPU (GTX8800 with 128 cores => GTS8800 with 96 cores)

Adapting to Hardware Changes (2) Original result CAdaptive mapping shifts most work to the GPU Using a less powerful CPU (CPU with 8 cores => CPU with 2 cores)

Adapting to Software Changes Original result GCC doesn’t use SSE-x as well as ICC does CAdaptive mapping biases to GPU Using a different compiler on CPU ICC => GCC (for both the serial and parallel cases)

Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping