qilin exploiting parallelism on heterogeneous multiprocessors with adaptive mapping l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping PowerPoint Presentation
Download Presentation
Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping

Loading in 2 Seconds...

play fullscreen
1 / 27

Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping - PowerPoint PPT Presentation


  • 194 Views
  • Uploaded on

Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping. Chi-Keung (CK) Luk Technology Pathfinding and Innovation Software Solutions and Services Group Intel. Sunpyo Hong Electrical and Computer Engineering Georgia Institute of Technology. Hyesoon Kim

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping' - terry


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
qilin exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping

Chi-Keung (CK) Luk

Technology Pathfinding and Innovation

Software Solutions and Services Group

Intel

Sunpyo Hong

Electrical and Computer

Engineering

Georgia Institute of Technology

Hyesoon Kim

College of Computing

School of Computer Science

Georgia Institute of Technology

heterogeneous architectures
Heterogeneous Architectures

Heterogeneous architectures are increasingly popular:

Platform used

Intel Core2 + Nvidia’s GPU

IBM’s Cell processor

NHM + Larrabee

software challenge
Software Challenge

SIMD

Core-0

Core-1

Core-2

Core-3

A CPU + GPU system:

CPU

GPU

  • The Mapping Problem:
  • Map computations to PEs to optimize an objective function, which could be:
      • Performance
      • Energy
      • Performance / Energy
existing solutions to the mapping problem
Existing Solutions to the Mapping Problem

Programmer performs the mapping manually and statically

Examples:

IBM XL compiler extension that supports OpenMP on the Cell

Intel CTG’s ExoCHI/Merge framework for programming the CPU and GPU

Disadvantages:

Labor intensive

Not adaptable to changes in runtime environments

outline
Outline

Introduction

Case Study

Adaptive Mapping

Experimental Evaluation

Conclusions

case study matrix multiplication
Case Study: Matrix Multiplication
  • Heterogeneous machine used:
    • CPU: dual-socket QuadCore (max = 8 cores)
    • GPU: Nvidia GTX-8800 GPU
  • Three configurations tested:
      • Small problem size, max CPU cores used
      • Big problem size, max CPU cores used
      • Big problem size, fewer CPU cores used
  • In each configuration:
      • Perform cooperative matrix multiplication

(varying the distribution of works over the CPU and GPU)

cooperative matrix multiplication
Cooperative Matrix Multiplication

B

x

=

A

C

CPU

A1

C1

x

=

B

C2

A2

GPU

cooperative matrix multiplication results
Cooperative Matrix Multiplication Results
  • Lessons Learned:
  • The optimal PE mapping depends on the application, the input size, and hardware resources available
  • Need an automatic and dynamic technique that takes all these factors into account

Our contribution: ADAPTIVE MAPPING

Configuration 1:

Matrix dimension size = 1000

#CPU cores = 8

Configuration 2:

Matrix dimension size = 6000

#CPU cores = 8

Configuration 3:

Matrix dimension size = 6000

#CPU cores = 2

adaptive mapping
Adaptive Mapping

A technique to automatically find the near-optimal mapping for the given program, problem size and hardware

Each <program, hardware> configuration involves one training run and many reference runs:

Training run:

Find the execution-time projections of the CPU and the GPU for the given configuration

Reference run:

Compute the near-optimal distribution of work for the current problem size

training run
Training Run

Database

K

K

K

N1,1

N1,m

N2,1

N2,m

Nt

K

K

K

K

K

time taken:

Tc (N1,1)

TG(N2,1)

TG(N2,m)

Tc (N1,m)

curve fitting

curve fitting

Kernel K

T’C(N) = The projected time to execute

the kernel of problem size N on

the CPU

= ac + bc * N

Runtime

T’C(N)

T’G(N)

Input size

T’G(N) = The projected time to execute

the kernel of problem size N on

the GPU

= ag + bg * N

reference run
Reference Run

β= Fraction of work mapped to CPU

p = Number of CPU cores

N = Problem size

T’β(N) = The projected time to execute

βN work on the CPU and

(1- β)N work on the GPU

= Max( p/(p-1)T’C(βN), T’G((1-β)N) )

Once N is fixed to the actual problem size Nr, we find the β that minimizes T’β(Nr).

We consider where the two curves

p/(p-1)T’C(βNr) and T’G((1-β)Nr) intersect.

There are 3 possible cases (see next slide)

Nr

Database

T’b(Nr) = Max( p/(p-1)T’C(βNr), T’G((1-β)Nr) )

Find βto minimize T’b(Nr)

GPU

CPU

K

K

K

three possible cases of
Three Possible Cases of β

Case ii: The two curves intersect at β >= 1

Case i: CPU and GPU curves intersect at β <= 0

Time

Time

CPU: (p/p-1)T’c(b Nr)

GPU: T’G((1-b) Nr)

Minimized when mapping all work to the CPU

Minimized when mapping all work to the GPU

GPU: T’G((1-b) Nr)

CPU: (p/p-1)T’c(b Nr)

b

b

0

0

1

1

Case iii: The two curves intersect at 0<β<1

Minimized when mapping bmin of work to the CPU

GPU: T’G((1-b) Nr)

CPU: (p/p-1)T’c(b Nr)

bmin

b

0

1

outline13
Outline

Introduction

Case Study

Adaptive Mapping

Experimental Evaluation

Conclusions

prototype implementation
Prototype Implementation

Adaptive mapping could be implemented as:

Off-line optimization for static compilation

On-line optimization for dynamic compilation

Our prototype:

A dynamic compilation system called Qilin

Qilin API:

Both stream-based and thread-based

Dynamic code generation:

Generate TBB source code for the CPU

Generate CUDA source code for the GPU

Generate glue code to:

Copy data back and forth between CPU and GPU

Stage computations onto GPU to satisfy GPU memory limitation

Division of work according to Adaptive Mapping

C++ App

Qilin API

Qilin System

CPU

GPU

benchmarks
Benchmarks

(Financial, image processing, scientific)

performance of adaptive mapping
Performance of Adaptive Mapping

(Note: The y-axis is in logarithmic scale)

CAdaptive mapping achieves 94% of the speedup of manual mapping

energy consumption
Energy Consumption

(Total system power measured by Extech 38080 Power Analyser)

CAdaptive mapping is nearly as good as manual mapping in energy consumption

distribution of computations
Distribution of Computations

CAdaptive mapping and manual mapping have similar distributions

related work
Related Work

Hardware

Kumar et al. demonstrate advantages of heterogeneous over homogeneous CMPs in terms of power and throughput

Similar observations from Hill and Mart

=> Both study point out the importance of the mapping problem

Software

GPGPU:

Brook, Accelerator, Peakstream, Rapidmind, Brook+, Cuda (they are all GPU only)

Intel’s TBB and Ct (currently CPU only)

IBM’s OpenMP extension for Cell and Intel’s ExoCHI/Merge

Use both CPU and GPU, but based on static manual mapping

OpenCL:

Doesn’t seem to have any automatic mapping technique based on the initial specification

Autotuning

Generating many variants of a computation kernel and benchmarking each variant on the target platform

Adaptive mapping can be regarded as an autotuning technique that tunes for the distribution of works on heterogeneous platforms

conclusions
Conclusions

Automates the mapping from computations to heterogeneous multicores

Encouraging results:

Performance and energy consumption close to manual mapping

Adapt to changes in input size, hardware & software configurations (see our paper)

Applicable to other heterogeneous systems

OpenCL or Ct on NHM + Larrabee

Future work:

Extend it to handle irregular computations

CAdaptive mapping could be an important technique in the multicore software stack

acknowledgments
Acknowledgments

Michael Linderman, Jamison Collins, Hong Wang

Sharing their Merge benchmarks

Geoff Lowney and Mark Abel

Support of this work

Geoff Lowney and Robert Cohn

Suggestions and feedbacks

impact of training input size
Impact of Training Input Size

(Note: The y-axis is in logarithmic scale)

Training input size as percentage of the reference input size

CMost of the performance benefit of Adaptive Mapping preserved when the training input size is at least 30% of the reference input size

adapting to hardware changes 1
Adapting to Hardware Changes (1)

Original result

CAdaptive mapping automatically recovers part of the performance loss in the GPU from the CPU

Using a less powerful GPU

(GTX8800 with 128 cores => GTS8800 with 96 cores)

adapting to hardware changes 2
Adapting to Hardware Changes (2)

Original result

CAdaptive mapping shifts most work to the GPU

Using a less powerful CPU

(CPU with 8 cores => CPU with 2 cores)

adapting to software changes
Adapting to Software Changes

Original result

GCC doesn’t use SSE-x as well as ICC does

CAdaptive mapping biases to GPU

Using a different compiler on CPU

ICC => GCC

(for both the serial and parallel cases)