290 likes | 403 Views
Introduction to Accelerators and GPGPU Dan Ernst Cray, Inc. Conventional Wisdom (CW) in Computer Architecture. Old CW: Transistors expensive New CW: “ Power wall ” Power expensive, Transistors free (Can put more on chip than can afford to turn on)
E N D
Conventional Wisdom (CW) in Computer Architecture • Old CW: Transistors expensive • New CW: “Power wall” Power expensive, Transistors free (Can put more on chip than can afford to turn on) • Old: Multiplies are slow, Memory access is fast • New: “Memory wall” Memory slow, multiplies fast (200-600 clocks to DRAM memory, 4 clocks for FP multiply) • Old : Increasing Instruction Level Parallelism (ILP) via compilers, innovation (Out-of-order, speculation, VLIW, …) • New CW: “ILP wall” diminishing returns on more ILP • New: Power Wall + Memory Wall + ILP Wall = Brick Wall • Old CW: Uniprocessor performance 2X / 1.5 yrs • New CW: Uniprocessor performance only 2X / 5 yrs? Credit: D. Patterson, UC-Berkeley
The Ox Analogy • It turns out, sacrificing uniprocessor performance for power savings can save you a lot. • Example: • Scenario One: one-core processor with power budget W • Increase frequency/ILP by 20% • Substantially increases power, by more than 50% • But, only increase performance by 13% • Scenario Two: Decrease frequency by 20% with a simpler core • Decreases power by 50% • Can now add another core (one more ox!) "If one ox could not do the job, they did not try to grow a bigger ox, but used two oxen." - Admiral Grace Murray Hopper.
The Ox Analogy Extended • Chickens are gaining momentum nowadays: • For certain classes of applications (not including field plowing...), you can run many cores at lower frequency and come ahead (big time) at the speed game • Molecular Dynamics Codes (VMD, NAMD, etc.) reported speedups of 25x – 100x!! "If one ox could not do the job, they did not try to grow a bigger ox, but used two oxen." - Admiral Grace Murray Hopper. "If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens ?" - Seymour Cray
Ox vs. Chickens • Oxen are good at plowing • Chickens pick up feed • Which do I use if I want to catch mice? • I’d much rather have a couple cats Moral: Finding the most appropriate tool for the job brings about savings in efficiency Addendum: That tool will only exist and be affordable if someone can make money on it.
Example of Efficiency Cray High Density Custom Compute System • “Same” performance on Cray’s 2-cabinet custom solution compared to 200-cabinet x86 Off-the-Shelf system • Engineered to achieve application performance at < 1/100 the space, weight and power cost of an off-the shelf system • Cray designed, developed, integrated and deployed
The Energy-Flexibility Gap GPUs were here 7-10 years ago 1000 Dedicated HW ASIC 100 ReconfigurableProcessor/Logic Now, they’re in this space Energy Efficiency (log scale) 10 ASPs DSPs 1 Embedded Processors 0.1 Flexibility (Coverage)
GPGPU General Purposecomputing on Graphics Processing Units per thread per Shader per Context Input Registers Fragment Program Texture Constants Temp Registers Output Registers FB Memory • Previous GPGPU Constraint: • To get general purpose code working, you had to use the corner cases of the graphics API • Essentially – re-write entire program as a collection of shaders and polygons
“Compute Unified Device Architecture” General purpose programming model User kicks off batches of threads on the GPU GPU = dedicated super-threaded, massively data parallel co-processor Targeted software stack Compute oriented drivers, language, and tools Driver for loading computational programs onto GPU CUDA
NVIDIA Tesla C2090 Card Specs • 512 GPU cores • 1.30 GHz • Single precision floating point performance: 1331 GFLOPs (2 single precision flops per clock per core) • Double precision floating point performance: 665 GFLOPs (1 double precision flop per clock per core) • Internal RAM: 6 GB DDR5 • Internal RAM speed: 177 GB/sec (compared 30s-ish GB/sec for regular RAM) • Has to be plugged into a PCIe slot (at most 8 GB/sec)
Why GPGPU Processing? • Calculation: TFLOPS vs. 150 GFLOPS • Memory Bandwidth: ~5-10x • Cost Benefit: GPU in every PC– massive volume
GPUs • The Good: • Performance: focused silicon use • High bandwidth for streaming applications • Similar power envelope to high-end CPUs • High volume affordable • The Bad: • Programming: Streaming languages (CUDA, OpenCL, etc.) • Requires significant application intervention / development • Sensitive to hardware knowledge – memories, banking, resource management, etc. • Not good at certain operations or applications • Integer performance, irregular data, pointer logic, low compute intensity* • Questions about reliability / error • Many have been addressed in most recent hardware models
Intel Many Integrated Core (MIC) • Knights Ferry • 32 Cores • Wide vector units • x86 ISA • Mostly a test platform at this point • Knights Corner will be first real product - 2012
FPGAs – Generated Accelerators • Configurable logic blocks • Interconnection mesh • Can be incorporated into cards or integrated inline.
FPGAs • The Good: • Performance: good silicon use (do only what you need) • (maximize parallel ops/cycle) • Rapid growth: Cells, Speed, I/O • Power: 1/10th CPUs • Flexible: tailor to application • The Bad: • Programming: VHDL, Verilog, etc. • Advances have been made here to translate high level code (C, Fortran, etc.) to HW • Compile Time: Place and Route for the FPGA layout can take multiple hours • FPGAs are typically clocked about 1/10th to 1/5th of ASIC • Cost: They’re actually not cheap
Accelerators in a System • External – entire application offloading • “Appliances” – DataPower, Azul • Attached – targeted offloading • PCIe cards – CUDA/FireStream GPUs, FPGA cards. • Integrated – tighter connection • On-chip – AMD Fusion, Cell BE, Network processing chips • Incorporated – CPU instructions • Vector instructions, FMA, Crypto-acceleration
Purdy Pictures Cray XK6 Integrated Hybrid Blade AMD “Fusion” IBM “CloudBurst” (DataPower) Nvidia M2090
Accelerators in a System • External – entire application offloading • “Appliances” – DataPower, Azul • Attached – targeted offloading • PCIe cards – CUDA/FireStream GPUs, FPGA cards. • Integrated – tighter connection • On-chip – AMD Fusion, Cell BE, Network processing chips • Incorporated – CPU instructions • Vector instructions, FMA, Some crypto-acceleration
Programming Accelerators C. Cascaval, et al., IBM Journal of R&D, 2010
Programming Accelerators • Programming accelerators requires describing: • What portions of code will be run on the accelerator (as opposed to on the CPU) • How does that code map to the architecture of the accelerator • both compute elements and memories • The first is typically done on a function-by-function basis • i.e. GPU kernel • The second is much more variable • Parallel directives, SIMT block description, VHDL/Verilog… • Integrating these is not very mature at this point, but coming
CUDA SAXPY __global__ void saxpy_cuda(int n, float a, float *x, float *y) { inti = (blockIdx.x * blockDim.x) + threadIdx.x; if(i < n) y[i] = a*x[i] + y[i]; } … intnblocks = (n + 255) / 256; //invoke the kernel with 256 threads per block saxpy_cuda<<<nblocks, 256>>>(n, 2.0, x, y);
Integrating Accelerators More Tightly • There are several efforts (mostly libraries and directive methods) to lower the entry point for accelerator programming • Library example: Thrust – STL-like interface for GPUs • Accelerator example: OpenACC– Like OpenMP thrust :: device_vector < int > D (10 , 1) ; thrust :: fill (D . begin () , D. begin () + 7 , 9) ; thrust :: sequence (H. begin () , H. end () ); … #pragma acc parallel [clauses] { structured block } http://www.openacc-standard.org/
Developing with Accelerators • Profile your code • What code is heavily used (and amenable to acceleration) • Write accelerator kernels for heavily used code (Amdahl) • Replace CPU version with accelerator offload • Play “chase the bottleneck” around the accelerator • AKA re-write the kernel a dozen times • Profit! • Faster science/engineering/finance/whatever! ???
A Story About Acceleration • Brandon’s stuff
Big Picture • Architectures are moving towards “effective use of space” (or power). • Focusing architectures on a specific task (at the expense of others) can make for very efficient/effective tools (for that task) • HPC systems are beginning to integrate acceleration at numerous levels, but “PCIe card GPU” is the most common • Exploiting the most popular accelerators requires intervention by application programmers to map codes to the architecture. • Developing for accelerators can be challenging as significantly more hardware knowledge is needed to get good performance • There are major efforts at improving this
Other Sessions of Interest • Tomorrow • 2 – 3 pm: CUDA Programming Part I • 3:30 – 5 pm: CUDA Programming Part II • WSCC 2A/2B • Tomorrow at 5:30pm • BOF: Broad-based Efforts to Expand Parallelism Preparedness in the Computing Workforce • WSCC 611/612 (here) • Wednesday at 10:30am • Panel/Discussion: Parallelism, the Cloud, and the Tools of the Future for the next generation of practitioners • WSCC 2A/2B