Introduction to Accelerators and GPGPU Dan Ernst Cray, Inc.

Introduction to Accelerators and GPGPUDan ErnstCray, Inc.

Conventional Wisdom (CW) in Computer Architecture • Old CW: Transistors expensive • New CW: “Power wall” Power expensive, Transistors free (Can put more on chip than can afford to turn on) • Old: Multiplies are slow, Memory access is fast • New: “Memory wall” Memory slow, multiplies fast (200-600 clocks to DRAM memory, 4 clocks for FP multiply) • Old : Increasing Instruction Level Parallelism (ILP) via compilers, innovation (Out-of-order, speculation, VLIW, …) • New CW: “ILP wall” diminishing returns on more ILP • New: Power Wall + Memory Wall + ILP Wall = Brick Wall • Old CW: Uniprocessor performance 2X / 1.5 yrs • New CW: Uniprocessor performance only 2X / 5 yrs? Credit: D. Patterson, UC-Berkeley

The Ox Analogy • It turns out, sacrificing uniprocessor performance for power savings can save you a lot. • Example: • Scenario One: one-core processor with power budget W • Increase frequency/ILP by 20% • Substantially increases power, by more than 50% • But, only increase performance by 13% • Scenario Two: Decrease frequency by 20% with a simpler core • Decreases power by 50% • Can now add another core (one more ox!) "If one ox could not do the job, they did not try to grow a bigger ox, but used two oxen." - Admiral Grace Murray Hopper.

The Ox Analogy Extended • Chickens are gaining momentum nowadays: • For certain classes of applications (not including field plowing...), you can run many cores at lower frequency and come ahead (big time) at the speed game • Molecular Dynamics Codes (VMD, NAMD, etc.) reported speedups of 25x – 100x!! "If one ox could not do the job, they did not try to grow a bigger ox, but used two oxen." - Admiral Grace Murray Hopper. "If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens ?" - Seymour Cray

Ox vs. Chickens • Oxen are good at plowing • Chickens pick up feed • Which do I use if I want to catch mice? • I’d much rather have a couple cats Moral: Finding the most appropriate tool for the job brings about savings in efficiency Addendum: That tool will only exist and be affordable if someone can make money on it.

Example of Efficiency Cray High Density Custom Compute System • “Same” performance on Cray’s 2-cabinet custom solution compared to 200-cabinet x86 Off-the-Shelf system • Engineered to achieve application performance at < 1/100 the space, weight and power cost of an off-the shelf system • Cray designed, developed, integrated and deployed

Intel P4 Northwood

NVIDIA GT200

The Energy-Flexibility Gap GPUs were here 7-10 years ago 1000 Dedicated HW ASIC 100 ReconfigurableProcessor/Logic Now, they’re in this space Energy Efficiency (log scale) 10 ASPs DSPs 1 Embedded Processors 0.1 Flexibility (Coverage)

GPGPU General Purposecomputing on Graphics Processing Units per thread per Shader per Context Input Registers Fragment Program Texture Constants Temp Registers Output Registers FB Memory • Previous GPGPU Constraint: • To get general purpose code working, you had to use the corner cases of the graphics API • Essentially – re-write entire program as a collection of shaders and polygons

“Compute Unified Device Architecture” General purpose programming model User kicks off batches of threads on the GPU GPU = dedicated super-threaded, massively data parallel co-processor Targeted software stack Compute oriented drivers, language, and tools Driver for loading computational programs onto GPU CUDA

NVIDIA Tesla C2090 Card Specs • 512 GPU cores • 1.30 GHz • Single precision floating point performance: 1331 GFLOPs (2 single precision flops per clock per core) • Double precision floating point performance: 665 GFLOPs (1 double precision flop per clock per core) • Internal RAM: 6 GB DDR5 • Internal RAM speed: 177 GB/sec (compared 30s-ish GB/sec for regular RAM) • Has to be plugged into a PCIe slot (at most 8 GB/sec)

Why GPGPU Processing? • Calculation: TFLOPS vs. 150 GFLOPS • Memory Bandwidth: ~5-10x • Cost Benefit: GPU in every PC– massive volume

GPUs • The Good: • Performance: focused silicon use • High bandwidth for streaming applications • Similar power envelope to high-end CPUs • High volume  affordable • The Bad: • Programming: Streaming languages (CUDA, OpenCL, etc.) • Requires significant application intervention / development • Sensitive to hardware knowledge – memories, banking, resource management, etc. • Not good at certain operations or applications • Integer performance, irregular data, pointer logic, low compute intensity* • Questions about reliability / error • Many have been addressed in most recent hardware models

Intel Many Integrated Core (MIC) • Knights Ferry • 32 Cores • Wide vector units • x86 ISA • Mostly a test platform at this point • Knights Corner will be first real product - 2012

FPGAs – Generated Accelerators • Configurable logic blocks • Interconnection mesh • Can be incorporated into cards or integrated inline.

FPGAs • The Good: • Performance: good silicon use (do only what you need) • (maximize parallel ops/cycle) • Rapid growth: Cells, Speed, I/O • Power: 1/10th CPUs • Flexible: tailor to application • The Bad: • Programming: VHDL, Verilog, etc. • Advances have been made here to translate high level code (C, Fortran, etc.) to HW • Compile Time: Place and Route for the FPGA layout can take multiple hours • FPGAs are typically clocked about 1/10th to 1/5th of ASIC • Cost: They’re actually not cheap

Accelerators in a System • External – entire application offloading • “Appliances” – DataPower, Azul • Attached – targeted offloading • PCIe cards – CUDA/FireStream GPUs, FPGA cards. • Integrated – tighter connection • On-chip – AMD Fusion, Cell BE, Network processing chips • Incorporated – CPU instructions • Vector instructions, FMA, Crypto-acceleration

Purdy Pictures Cray XK6 Integrated Hybrid Blade AMD “Fusion” IBM “CloudBurst” (DataPower) Nvidia M2090

Accelerators in a System • External – entire application offloading • “Appliances” – DataPower, Azul • Attached – targeted offloading • PCIe cards – CUDA/FireStream GPUs, FPGA cards. • Integrated – tighter connection • On-chip – AMD Fusion, Cell BE, Network processing chips • Incorporated – CPU instructions • Vector instructions, FMA, Some crypto-acceleration

Programming Accelerators C. Cascaval, et al., IBM Journal of R&D, 2010

Programming Accelerators • Programming accelerators requires describing: • What portions of code will be run on the accelerator (as opposed to on the CPU) • How does that code map to the architecture of the accelerator • both compute elements and memories • The first is typically done on a function-by-function basis • i.e. GPU kernel • The second is much more variable • Parallel directives, SIMT block description, VHDL/Verilog… • Integrating these is not very mature at this point, but coming

CUDA SAXPY __global__ void saxpy_cuda(int n, float a, float *x, float *y) { inti = (blockIdx.x * blockDim.x) + threadIdx.x; if(i < n) y[i] = a*x[i] + y[i]; } … intnblocks = (n + 255) / 256; //invoke the kernel with 256 threads per block saxpy_cuda<<<nblocks, 256>>>(n, 2.0, x, y);

Integrating Accelerators More Tightly • There are several efforts (mostly libraries and directive methods) to lower the entry point for accelerator programming • Library example: Thrust – STL-like interface for GPUs • Accelerator example: OpenACC– Like OpenMP thrust :: device_vector < int > D (10 , 1) ; thrust :: fill (D . begin () , D. begin () + 7 , 9) ; thrust :: sequence (H. begin () , H. end () ); … #pragma acc parallel [clauses] { structured block } http://www.openacc-standard.org/

Developing with Accelerators • Profile your code • What code is heavily used (and amenable to acceleration) • Write accelerator kernels for heavily used code (Amdahl) • Replace CPU version with accelerator offload • Play “chase the bottleneck” around the accelerator • AKA re-write the kernel a dozen times • Profit! • Faster science/engineering/finance/whatever! ???

A Story About Acceleration • Brandon’s stuff

Big Picture • Architectures are moving towards “effective use of space” (or power). • Focusing architectures on a specific task (at the expense of others) can make for very efficient/effective tools (for that task) • HPC systems are beginning to integrate acceleration at numerous levels, but “PCIe card GPU” is the most common • Exploiting the most popular accelerators requires intervention by application programmers to map codes to the architecture. • Developing for accelerators can be challenging as significantly more hardware knowledge is needed to get good performance • There are major efforts at improving this

Other Sessions of Interest • Tomorrow • 2 – 3 pm: CUDA Programming Part I • 3:30 – 5 pm: CUDA Programming Part II • WSCC 2A/2B • Tomorrow at 5:30pm • BOF: Broad-based Efforts to Expand Parallelism Preparedness in the Computing Workforce • WSCC 611/612 (here) • Wednesday at 10:30am • Panel/Discussion: Parallelism, the Cloud, and the Tools of the Future for the next generation of practitioners • WSCC 2A/2B

Introduction to Accelerators and GPGPU Dan Ernst Cray, Inc.

Introduction to Accelerators and GPGPU Dan Ernst Cray, Inc.

Presentation Transcript

Introduction to Accelerators

Introduction to the Cray XK7

Leveraging GPUs for Application Acceleration Dan Ernst Cray, Inc.

An introduction to Particle accelerators

GPGPU introduction

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU Dan Ernst Cray, Inc.

AXEL- 2013 Introduction to Particle Accelerators

How to Use MPI on the Cray XT Howard Pritchard Mark Pagel Cray Inc.

Introduction to Accelerators Part 3

AXEL-2012 Introduction to Particle Accelerators

Introduction and History of Cray Supercomputers

AXEL- 2013 Introduction to Particle Accelerators

Introduction to RF for Accelerators

An Introduction to Particle Accelerators

An Introduction to Particle Accelerators

Introduction to Accelerators Part 1

An Introduction To Particle Accelerators

GPGPU Introduction

How to Use MPI on the Cray XT Howard Pritchard Mark Pagel Cray Inc.

Introduction to RF for Accelerators

Introduction to Particle Accelerators

An Introduction to Particle Accelerators