Parallel and High Performance Computing

Parallel and High Performance Computing Burton Smith Technical Fellow Microsoft

Agenda • Introduction • Definitions • Architecture and Programming • Examples • Conclusions

Introduction

“Parallel and High Performance”? • “Parallel computing is a form of computation in which many calculations are carried out simultaneously” G.S. Almasi and A. Gottlieb, Highly Parallel Computing. Benjamin/Cummings, 1994 • A High Performance (Super) Computer is: • One of the 500 fastest computers as measured by HPL: the High Performance Linpack benchmark • A computer that costs 200.000.000 руб or more • Necessarily parallel, at least since the 1970’s

Recent Developments • For 20 years, parallel and high performance computing have been the same subject • Parallel computing is now mainstream • It reaches well beyond HPC into client systems: desktops, laptops, mobile phones • HPC software once had to stand alone • Now, it can be based on parallel PC software • The result: better tools and new possibilities

The Emergence of the Parallel Client • Uniprocessor performance is leveling off • Instruction-level parallelism nears a limit (ILP Wall) • Power is getting painfully high (Power Wall) • Caches show diminishing returns (Memory Wall) • Logic density continues to grow (Moore’s Law) • So uniprocessors will collapse in area and cost • Cores per chip need to increase exponentially • We must all learn to write parallel programs • So new “killer apps” will enjoy more speed

The ILP Wall • Instruction-level parallelism preserves the serial programming model • While getting speed from “undercover” parallelism • For example, see HPS†: out-of-order issue, in-order retirement, register renaming, branch prediction, speculation, … • At best, we get a few instructions/clock † Y.N. Patt et al., "Critical Issues Regarding HPS, a High Performance Microarchitecture,“Proc. 18th Ann. ACM/IEEE Int'l Symp. on Microarchitecture, 1985, pp. 109−116.

The Power Wall • In the old days, power was kept roughly constant • Dynamic power, equal to CV2f, dominated • Every shrink of .7 in feature size halved transistor area • Capacitance C and voltage V also decreased by .7 • Even with the clock frequency f increased by 1.4, power per transistor was cut in half • Now, shrinking no longer reduces V very much • So even at constant frequency, power density doubles • Static (leakage) power is also getting worse • Simpler, slower processors are more efficient • And to conserve power, we can turn some of them off

The Memory Wall • We can get bigger caches from more transistors • Does this suffice, or is there a problem scaling up? • To speed up 2X without changing bandwidth below the cache, the miss rate must be halved • How much bigger does the cache have to be?† • For dense matrix multiply or dense LU, 4x bigger • For sorting or FFTs, the square of its former size • For sparse or dense matrix-vector multiply, impossible • Deeper interconnects increase miss latency • Latency tolerance needs memory access parallelism † H.T. Kung, “Memory requirements for balanced computer architectures,” 13th International Symposium on Computer Architecture, 1986, pp. 49−54.

Overcoming the Memory Wall • Provide more memory bandwidth • Increase DRAM I/O bandwidth per gigabyte • Increase microprocessor off-chip bandwidth • Use architecture to tolerate memory latency • More latency  more threads or longer vectors • No change in programming model is needed • Use caches for bandwidth as well as latency • Let compilers control locality • Keep cache lines short • Avoid mis-speculation

The End of The von Neumann Model “Instructions are executed one at a time…” • We have relied on this idea for 60 years • Now it (and things it brought) must change • Serial programming is easier than parallel programming, at least for the moment • But serial programs are now slow programs • We need parallel programming paradigms that will make all programmers successful • The stakes for our field’s vitality are high • Computing must be reinvented

Definitions

Asymptotic Notation • Quantities are often meaningful only within a constant factor • Algorithm performance analyses, for example • f(n) = O(g(n)) means there exist constants c and n0 such that nn0 implies |f(n)|  |cg(n)| • f(n) = (g(n)) means there exist constants c and n0 such that nn0 implies |f(n)|  |cg(n)| • f(n) = (g(n)) means both f(n) = O(g(n)) and f(n) = (g(n))

Speedup, Time, and Work • The speedup of a computation is how much faster it runs in parallel compared to serially • If one processor takes T1 and p of them take Tp then the p-processor speedup is Sp = T1/Tp • The work done is the number of operations performed, either serially or in parallel • W1 = O(T1) is the serial work, Wp the parallel work • We say a parallel computation is work-optimal ifWp = O(W1) = O(T1) • We say a parallel computation is time-optimal ifTp = O(W1/p) = O(T1/p)

Latency, Bandwidth, & Concurrency • In any system that moves items from input to output without creating or destroying them, • Queueing theory calls this result Little’s law latency × bandwidth = concurrency concurrency = 6 bandwidth = 2 latency = 3

Architecture ANDPROGRAMMING

Parallel Processor Architecture • SIMD: Each instruction operates concurrently on multiple data items • MIMD: Multiple instruction sequences execute concurrently • Concurrency is expressible in space or time • Spatial: the hardware is replicated • Temporal: the hardware is pipelined

Trends in Parallel Processors • Today’s chips are spatial MIMD at top level • To get enough performance, even in PCs • Temporal MIMD is also used • SIMD is tending back toward spatial • Intel’s Larrabee combines all three • Temporal concurrency is easily “adjusted” • Vector length or number of hardware contexts • Temporal concurrency tolerates latency • Memory latency in the SIMD case • For MIMD, branches and synchronization also

Parallel Memory Architecture • A shared memory system is one in which any processor can address any memory location • Quality of access can be either uniform (UMA) or nonuniform (NUMA), in latency and/or bandwidth • A distributed memory system is one in which processors can’t address most of memory • The disjoint memory regions and their associated processors are usually called nodes • A cluster is a distributed memory system with more than one processor per node • Nearly all HPC systems are clusters

Parallel Programming Variations • Data Parallelism andTask Parallelism • Functional Style and Imperative Style • Shared Memory and Message Passing • …and more we won’t have time to look at • A parallel application may use all of them

Data Parallelism and Task Parallelism • A computation is data parallel when similar independent sub-computations are done simultaneously on multiple data items • Applying the same function to every element of a data sequence, for example • A computation is task parallel when dissimilar independent sub-computations are done simultaneously • Controlling the motions of a robot, for example • It sounds like SIMD vs. MIMD, but isn’t quite • Some kinds of data parallelism need MIMD

Functional and Imperative Programs • A program is said to be written in (pure) functional style if it has no mutable state • Computing = naming and evaluating expressions • Programs with mutable state are usually called imperative because the state changes must be done when and where specified: while (z < x) { x = y; y = z; z = f(x, y);} return y; • Often, programs can be written either way: let w(x, y, z) = if (z < x) then w(y, z, f(x, y)) else y;

Shared Memory and Message Passing • Shared memory programs access data in a shared address space • When to access the data is the big issue • Subcomputations therefore must synchronize • Message passing programs transmit data between subcomputations • The sender computes a value and then sends it • The receiver recieves a value and then uses it • Synchronization can be built in to communication • Message passing can be implemented very well on shared memory architectures

Barrier Synchronization • A barrier synchronizes multiple parallel sub-computations by letting none proceed until all have arrived • It is named after the barrier • used to start horse races • It guarantees everything before the barrier finishes before anything after it begins • It is a central feature in several data-parallel languages such as OpenMP

Mutual Exclusion • This type of synchronization ensures only one subcomputation can do a thing at any time • If the thing is a code block, it is a critical section • It classically uses a lock: a data structure with which subcomputations can stop and start • Basic operations on a lock object L might be • Acquire(L): blocks until other subcomputations are finished with L, then acquires exclusive ownership • Release(L): yields L and unblocks some Acquire(L) • A lot has been written on these subjects

Non-Blocking Synchronization • The basic idea is to achieve mutual exclusion using memory read-modify-write operations • Most commonly used is compare-and-swap: • CAS(addr, old, new) reads memory at addr and if it contains old then old is replaced by new • Arbitrary update operations at an addr require {read old; compute new; CAS(addr, old, new);}be repeated until the CAS operation succeeds • If there is significant updating contention at addr, the repeated computation of new may be wasteful

Load Balancing • Some processors may be busier than others • To balance the workload, subcomputations can be scheduled on processors dynamically • A technique for parallel loops is self-scheduling: processors repetitively grab chunks of iterations • In guided self-scheduling, the chunk sizes shrink • Analogous imbalances can occur in memory • Overloaded memory locations are called hot spots • Parallel algorithms and data structures must be designed to avoid them • Imbalanced messaging is sometimes seen

Examples

A Data Parallel Example: Sorting void sort(int *src, int *dst,int size, intnvals) { inti, j, t1[nvals], t2[nvals]; for (j = 0 ; j < nvals ; j++) { t1[j] = 0; } for (i = 0 ; i < size ; i++) { t1[src[i]]++; } //t1[] now contains a histogram of the values t2[0] = 0; for (j = 1 ; j < nvals ; j++) { t2[j] = t2[j-1] + t1[j-1]; } //t2[j] now contains the origin for value j for (i = 0 ; i < size ; i++) {dst[t2[src[i]]++] = src[i]; } }

When Is a Loop Parallelizable? • The loop instances must safely interleave • A way to do this is to only read the data • Another way is to isolate data accesses • Look at the first loop: • The accesses to t1[] are isolated from each other • This loop can run in parallel “as is” for (j = 0 ; j < nvals ; j++) { t1[j] = 0; }

Isolating Data Updates • The second loop seems to have a problem: • Two iterations may access the same t1[src[i]] • If both reads precede both increments, oops! • A few ways to isolate the iteration conflicts: • Use an “isolated update” (lock prefix) instruction • Use an array of locks, perhaps as big as t1[] • Use non-blocking updates • Use a transaction for (i = 0 ; i < size ; i++) { t1[src[i]]++; }

Dependent Loop Iterations • The 3rd loop is an interesting challenge: • Each iteration depends on the previous one • This loop is an example of a prefix computation • If • is an associative binary operation on a set S, the • - prefixes of the sequence x0 ,x1 ,x2 … of values from S is x0,x0•x1,x0•x1•x2 … • Prefix computations are often known as scans • Scan can be done in efficiently in parallel for (j = 1 ; j < nvals ; j++) { t2[j] = t2[j-1] + t1[j-1]; }

a b c d e f g a ab bc cd de ef fg a ab abc abcd bcde cdef defg a ab abc abcd abcde abcdef abcdefg Cyclic Reduction • Each vertical line represents a loop iteration • The associated sequence element is to its right • On step k of the scan, iteration j prefixes its own value with the value from iteration j – 2k

Applications of Scan • Linear recurrences like the third loop • Polynomial evaluation • String comparison • High-precision addition • Finite automata • Each xi is the next-state function given the ith input symbol and • is function composition • APL compress • When only the final value is needed, the computation is called a reduction instead • It’s a little bit cheaper than a full scan

More Iterations nThan Processors p Wp = 3n + O(p log p), Tp = 3n / p + O(log p)

OpenMP • OpenMP is a widely-implemented extension to C++ and Fortran for data† parallelism • It adds directives to serial programs • A few of the more important directives: #pragmaomp parallel for <modifiers><for loop> #pragmaomp atomic<binary op=,++ or -- statement> #pragmaomp critical <name><structured block> #pragmaomp barrier †And perhaps task parallelism soon

The Sorting Example in OpenMP • Only the third “scan” loop is a problem • We can at least do this loop “manually”: nt = omp_get_num_threads(); intta[nt], tb[nt]; #omp parallel for for(myt = 0; myt < nt; myt++) { //Set ta[myt]= local sum of nvals/nt elements of t1[] #pragmaomp barrier for(k = 1; k <= myt; k *= 2){ tb[myt] = ta[myt]; ta[myt] += tb[myt - k]; #pragmaomp barrier } fix = (myt > 0) ? ta[myt – 1] : 0; //Setnvals/ntelements of t2[] to fix + local scan of t1[] }

Parallel Patterns Library (PPL) • PPL is a Microsoft C++ library built on top of the ConcRT user-mode scheduling runtime • It supports mixed data- and task-parallelism: • parallel_for, parallel_for_each, parallel_invoke • agent, send, receive, choice, join, task_group • Parallel loops use C++ lambda expressions: • Updates can be isolated using intrinsic functions • Microsoft and Intel plan to unify PPL and TBB parallel_for(1,nvals,[&t1](int j) { t1[j] = 0; }); (void)_InterlockedIncrement(t1[src[i]]++);

Dynamic Resource Management • PPL programs are written for an arbitrary number of processors, could be just one • Load balancing is mostly done by work stealing • There are two kinds of work to steal: • Work that is unblocked and waiting for a processor • Work that is not yet started and is potentially parallel • Work of the latter kind will be done serially unless it is first stolen by another processor • This makes recursive divide and conquer easy • There is no concern about when to stop parallelism

A Quicksort Example void quicksort (vector<int>::iterator first, vector<int>::iterator last) { if (last - first < 2){return;} int pivot = *first; auto mid1 = partition (first, last, [=](int e){return e < pivot;}); auto mid2 = partition (mid1, last, [=](int e){return e == pivot;}); parallel_invoke( [=] { quicksort(first, mid1); }, [=] { quicksort(mid2, last); } ); };

LINQ and PLINQ • LINQ (Language Integrated Query) extends the .NET languages C#, Visual Basic, and F# • A LINQ query is really just a functional monad • It queries databases, XML, or any IEnumerable • PLINQ is a parallel implementation of LINQ • Non-isolated functions must be avoided • Otherwise it is hard to tell the two apart

A LINQ Example PLINQ .AsParallel() var q = from n in names where n.Name == queryInfo.Name && n.State == queryInfo.State && n.Year >= yearStart && n.Year <= yearEnd orderbyn.Year ascending select n;

Message Passing Interface (MPI) • MPI is a widely used message passing library for distributed memory HPC systems • Some of its basic functions: • A few of its “collective communication” functions: MPI_Init MPI_Comm_rank MPI_Comm_size MPI_Send MPI_Recv MPI_Barrier MPI_Gather MPI_Allgather MPI_Alltoall MPI_Reduce MPI_Allreduce MPI_Scan MPI_Exscan

Sorting in MPI • Roughly, it could work like this on n nodes: • Run the first two loops locally • Use MPI_Allreduce to build a global histogram • Run the third loop (redundantly) at every node • Allocate n value intervals to nodes (redundantly) • Balancing the data per node as well as possible • Run the fourth loop using the local histogram • Use MPI_Alltoall to redistribute the data • Merge the n sorted subarrays on each node • Collective communication is expensive • But sorting needs it (see the Memory Wall slide)

Another Way to Sort in MPI • The Samplesort algorithm is like Quicksort • It works like this on n nodes: • Sort the local data on each node independently • Take s samples of the sorted data on each node • Use MPI_Allgather to send all nodes all samples • Compute n  1 splitters (redundantly) on all nodes • Balancing the data per node as well as possible • Use MPI_Alltoall to redistribute the data • Merge the n sorted subarrays on each node

CONCLUSIONS

Parallel Computing Has Arrived • We must rethink how we write programs • And we are definitely doing that • Other things will also need to change • Architecture • Operating systems • Algorithms • Theory • Application software • We are seeing the biggest revolution in computing since its very beginnings

Parallel and High Performance Computing