Susan Thomas Brown, Ph.D. Summer 2010 Introduction to High Performance Computing
Why do we do HPC? What differentiates HPC from just “C?” What are the different types of HPC? Name some applications of HPC… What is High Performance Computing?
Serial vs. Parallel Computing • Types of parallel computers • Terminology – not to be feared! Concepts
Von Neumann Architecture (1945) Computing Basics Every Computer is comprised of Four Main components: Memory Control Unit Arithmetic Logic Unit Input/Output
A problem is broken down into a discrete number of computations (discretized) that can be executed one after another on one CPU Serial Computing Input Oper 1 Oper 2 Output ……….
A serial (non-parallel) computer Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle Single data: only one data stream is being used as input during any one clock cycle Deterministic execution This is the oldest and even today, the most common type of computer Examples: older generation mainframes, minicomputers and workstations; most modern day PCs. SISDSingle Instruction, Single Data
A type of parallel computer • Single instruction: All processing units execute the same instruction at any given clock cycle • Multiple data: Each processing unit can operate on a different data element • Best suited for specialized problems characterized by a high degree of regularity, such as graphics/image processing. • Synchronous (lockstep) and deterministic execution • Two varieties: Processor Arrays and Vector Pipelines • Examples: • Processor Arrays: Connection Machine CM-2, MasPar MP-1 & MP-2, ILLIAC IV • Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820, ETA10 • Most modern computers, particularly those with graphics processor units (GPUs) employ SIMD instructions and execution units. SIMDSingle Instruction, Multiple Data
SIMD examples From IBM’s brochure for the Enterprise IBM/9000, printed In 1990 LLNL Archives and Research Center CRAY C90 (antero): 1996–1999 Illiac IV (Burroughs and University of Illinois) (1965) Photo courtesy of "Lexikon's History of Computing Encyclopedia on CD ROM"
A single data stream is fed into multiple processing units. • Each processing unit operates on the data independently via independent instruction streams. • Few actual examples of this class of parallel computer have ever existed. One is the experimental Carnegie-Mellon C.mmp computer (1971). • Some conceivable uses might be: • multiple frequency filters operating on a single signal stream • multiple cryptography algorithms attempting to crack a single coded message. MISDMultiple Instruction, Single Data
Currently, the most common type of parallel computer. Most modern computers fall into this category. Multiple Instruction: every processor may be executing a different instruction stream Multiple Data: every processor may be working with a different data stream Execution can be synchronous or asynchronous, deterministic or non-deterministic Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers, multi-core PCs. Note: many MIMD architectures also include SIMD execution sub-components MIMDMultiple Instruction, Multiple Data
Pingo at ARSC – Cray XT5 MIMD Examples Midnight at ARSC – SUN Cluster w/Opteron processors 2280 Compute Cores 3456 Compute Cores SunFire Workstations at ARSC 4 Compute Cores
Use of Graphical Programming Units (GPUs) to boost performance Time-intensive calculations off-loaded to GPU GPGPU Architectures (~2000)
Shared Memory All CPUs access the same central memory space Uniform Memory Access vs. Non-uniform Memory Access Parallel Computing Memory Architecture UMA NUMA
Distributed Memory Each processor has its own local memory Memory not accessible by multiple CPUs Parallel Computing Memory Architecture (2) • Hybrid Distributed-Shared Memory • The largest computers in use today employ this model • Why?
Shared Memory Threads Message Passing Data Parallel Hybrid Higher Level Models Parallel Programming Models
All tasks share a common address space, which they read and write to asynchronously • Communication of data between tasks is simplified for the programmer due to lack of ownership of data • Controlling the locality of the data is difficult, however, and makes programming more difficult • Possible slow-down due to • Location of memory • “traffic jams” Shared Memory Programming Model
A single process can have multiple, concurrent execution paths • Analogy: a.out is the main program, the subroutine are the threads • The “threads” share all the common resources of a.out, but perform simultaneously • Examples: • POSIX • OpenMP Threads Model
Tasks use their own local memory during computation • Multiple tasks can reside on the same physical machine as well as across an arbitrary number of machines. • Tasks exchange data through communications by sending and receiving messages • MPI Forum formed in 1992 • Message Passage Interface (MPI) released in 1994 • Industry standard • MPI-2 released in 1994 • Name areas where the programmer could look for slow performance issues Message Passing Model
Most of the parallel work focuses on performing operations on a data set A set of tasks work collectively on the same data structure, but on a different partition of the structure Tasks perform the same operation on their partition of work Fortran 90 and 95 Data Parallel Model
Model Hierarchy • Assembly or Machine language - Simple, fast, low overhead, knowledge burden on the programmer, not very portable • First level of portable programming languages – C, C++, Fortran, F77, F90, MPI • Higher Level Models – Intuitive, GUI-based, little burden on the programmer, high overhead • Typical Higher Level Models for Parallel Computing • Single Program Multiple Data (SPMD) • Multiple Program Multiple Data (MPMD) • Python • Matlab Higher Level Models
Started in 1993 Based on LINPACK Benchmark Published every year in November at the Supercomputer Conference Freely available: http://www.top500.org/lists/2011/11 Top500 list
Break up into 4 Groups of 2 Each Group take an envelope You have 15 minutes to build your “computer” using the architecture defined in your envelope The components are given Don’t actually compute, just indicate what each component will do Use the equation in the problem statement Ask for help if you need it Which architecture do you think would be best to solve this problem? Can any architecture be used to solve any problem? What “cost” is incurred in using the non-ideal architecture for the problem at hand? Group Exercise
Barney, Blaise, “Introduction to Parallel Computing,” Lawrence Livermore National Laboratory, https://computing.llnl.gov/tutorials/parallel_comp/#Abstract http://www.nvidia.com/object/what-is-gpu-computing.html http://www.top500.org/list Sources