CS 140 : Feb 4, 2013 Multicore (and Shared Memory) Programming with Cilk ++

CS 140 : Feb 4, 2013Multicore (and Shared Memory) Programming with Cilk++ Multicore and Shared Memory Cilk++ as a concurrency platform Divide and conquer paradigm for Cilk++ Merge Sort Thanks to Charles E. Leiserson for some of these slides

Multicore Architecture Memory I/O Network … $ $ $ core core core Chip Multiprocessor (CMP)

Desktop Multicores Today This is your AMD Shangai or Intel Core i7 (Nehalem) ! On-chip interconnect Private cache: Cache coherence is required

Intel MIC

Cilk++ • Cilk++ is a faithful extension of C++ • Programmers implement algorithms mostly in the divide-and-conquerparadigm. Two hints to the compiler: • cilk_spawn: the following function can run in parallel with the caller. • cilk_sync: all spawned children must return before program execution can continue • Third keyword for programmer convenience only (compiler converts it to spawns & syncs under the covers) • cilk_for

QUICKSORT 14 13 45 56 34 31 32 21 78 Partition around Pivot 56 13 31 21 45 34 32 14 78

QUICKSORT 56 13 31 21 45 34 32 14 78 Quicksort recursively 13 14 21 31 32 34 45 56 78 13 14 21 31 32 34 45 56 78

Nested Parallelism template <typename T> void qsort(T begin, T end) { if (begin != end) { T middle = partition( begin, end, bind2nd( less<typenameiterator_traits<T>::value_type>(), *begin ) ); cilk_spawnqsort(begin, middle); qsort(max(begin + 1, middle), end); cilk_sync; } } Example:Quicksort The named childfunction may execute in parallel with the parentcaller. Control cannot pass this point until all spawned children have returned.

Cilk++ Loops A cilk_for loop’s iterations execute in parallel. The index must be declared in the loop initializer. The end condition is evaluated exactly once at the beginning of the loop. Loop increments should be a const value cilk_for (int i=1; i<n; ++i) { cilk_for (int j=0; j<i; ++j) { B[i][j] = A[j][i]; } } Example: Matrix transpose

Serial Correctness Cilk++Compiler Conventional Compiler The serializationis the code with the Cilk++keywords replaced by null or C++ keywords. int fib (int n) { if (n<2) return (n); else { int x,y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return (x+y); } } Cilk++source Linker int fib (int n) { if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); } } Binary Serialization Serial correctnesscan be debugged and verified by running the multithreaded code on a single processor. Cilk++Runtime Library Conventional Regression Tests Reliable Single-Threaded Code

Serialization • cilk++ -DCILKPAR –O2 –o parallel.exe main.cpp • g++ –O2 –o serial.exe main.cpp How to seamlessly switch between serial c++ and parallel cilk++ programs? Add to the beginning of your program #ifdef CILKPAR #include <cilk.h> #else #define cilk_for for #define cilk_main main #define cilk_spawn #define cilk_sync #endif Compile !

Parallel Correctness Cilk++Compiler Conventional Compiler int fib (int n) { if (n<2) return (n); else { int x,y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return (x+y); } } Cilk++source Linker CilkscreenRace Detector Binary Parallel correctnesscan be debugged and verified with the Cilkscreenrace detector, which guarantees to find inconsistencies with the serial code quickly. Parallel Regression Tests Reliable Multi-Threaded Code

Race Bugs Definition.A determinacy race occurs when two logically parallel instructions access the same memory location and at least one of the instructions performs a write. Example A int x = 0; int x = 0; cilk_for(int i=0, i<2, ++i) { x++; } assert(x == 2); A x++; x++; B C B C D assert(x == 2); D Dependency Graph

Race Bugs Definition.A determinacy race occurs when two logically parallel instructions access the same memory location and at least one of the instructions performs a write. A int x = 0; x++; x++; B C 1 x = 0; assert(x == 2); 2 4 r1 = x; r2 = x; D 3 5 r1++; r2++; 7 6 x = r1; x = r2; 8 assert(x == 2);

Types of Races Suppose that instruction A and instruction B both access a location x, and suppose that A∥B (A is parallel to B). Two sections of code are independent if they have no determinacy races between them.

Avoiding Races • All the iterations of a cilk_for should be independent. • Between a cilk_spawn and the corresponding cilk_sync, the code of the spawned child should be independent of the code of the parent, including code executed by additional spawned or called children. Ex. cilk_spawnqsort(begin, middle); qsort(max(begin + 1, middle), end); cilk_sync; Note: The arguments to a spawned function are evaluated in the parent before the spawn occurs.

Cilk++ Reducers • Hyperobjects: reducers, holders, splitters • Primarily designed as a solution to global variables, but has broader application int result = 0; cilk_for (size_t i = 0; i < N; ++i) { result += MyFunc(i); } Data race ! #include <reducer_opadd.h> … cilk::hyperobject<cilk::reducer_opadd<int> > result; cilk_for(size_ti = 0; i < N; ++i) { result+= MyFunc(i); } Race free ! This uses one of the predefined reducers, but you can also write your own reducer easily

Cilkscreen Cilkscreen runs off the binary executable: Compile your program with the –fcilkscreen option to include debugging information. Go to the directory with your executable and execute cilkscreen your_program [options] Cilkscreen prints information about any races it detects. For a given input, Cilkscreen mathematically guarantees to localize a race if there exists a parallel execution that could produce results different from the serial execution. It runs about 20times slower than real-time.

Complexity Measures TP = execution time on P processors T1 = work T∞= span* WORK LAW • TP ≥T1/P SPAN LAW • TP ≥ T∞ * Also called critical-path length or computational depth.

Speedup Def. T1/TP= speedupon P processors. If T1/TP= (P), we have linear speedup, = P, we have perfect linear speedup, > P, we have superlinear speedup, which is not possible in this performance model, because of the Work LawTP ≥ T1/P.

(Potential) Parallelism Because the Span Law dictates that TP≥ T∞, the maximum possible speedup given T1 and T∞ is T1/T∞ = parallelism =the average amount of work per step along the span.

Three Tips on Parallelism Minimize the span to maximize parallelism. Try to generate 10 times more parallelism than processors for near-perfect linear speedup. If you have plenty of parallelism, try to trade some of it off for reduced work overheads. Use divide-and-conquer recursion or parallel loops rather than spawning one small thing off after another. Do this: cilk_for (inti=0; i<n; ++i) { foo(i); } Not this: for (inti=0; i<n; ++i) { cilk_spawnfoo(i); } cilk_sync;

Three Tips on Overheads • Make sure that work/#spawns is not too small. • Coarsen by using function calls and inlining near the leaves of recursion rather than spawning. • Parallelize outer loops if you can, not inner loops. If you must parallelize an inner loop, coarsen it, but not too much. • 500 iterations should be plenty coarse for even the most meager loop. • Fewer iterations should suffice for “fatter” loops. • Use reducers only in sufficiently fat loops.

Sorting • Sorting is possibly the most frequently executed operation in computing! • Quicksort is the fastest sorting algorithm in practice with an average running time of O(N log N), (but O(N2) worst case performance) • Mergesort has worst case performance of O(N log N) for sorting N elements • Both based on the recursive divide-and-conquer paradigm

MERGESORT • Mergesort is an example of a recursive sorting algorithm. • It is based on the divide-and-conquer paradigm • It uses the merge operation as its fundamental component (which takes in two sorted sequencesandproduces a single sorted sequence) • Drawback of mergesort: Not in-place (uses an extra temporary array)

Merging Two Sorted Arrays 3 12 19 46 4 14 21 23 template <typename T> void Merge(T *C, T *A, T *B, intna, intnb) { while (na>0 && nb>0) { if (*A <= *B) { *C++ = *A++; na--; } else { *C++ = *B++; nb--; } } while (na>0) { *C++ = *A++; na--; } while (nb>0) { *C++ = *B++; nb--; } } Time to merge n elements = Θ(n). 3 12 19 46 4 14 21 23

Parallel Merge Sort 3 4 12 14 19 21 33 46 3 12 19 46 4 14 21 33 3 19 12 46 4 33 14 21 A: input (unsorted) B: output (sorted) C: temporary template <typename T> void MergeSort(T *B, T *A, int n) { if (n==1) { B[0] = A[0]; } else { T* C = new T[n]; cilk_spawnMergeSort(C, A, n/2); MergeSort(C+n/2, A+n/2, n-n/2); cilk_sync; Merge(B, C, C+n/2, n/2, n-n/2); delete[] C; } } merge merge merge 19 3 12 46 33 4 21 14

Work of Merge Sort template <typename T> void MergeSort(T *B, T *A, int n) { if (n==1) { B[0] = A[0]; } else { T* C = new T[n]; cilk_spawn MergeSort(C, A, n/2); MergeSort(C+n/2, A+n/2, n-n/2); cilk_sync; Merge(B, C, C+n/2, n/2, n-n/2); delete[] C; } } CASE 2: nlogba = nlog22 = n f(n) = Θ(nlogbalg0n) 2T1(n/2) + Θ(n) Work:T1(n) = = Θ(n lg n)

Span of Merge Sort template <typename T> void MergeSort(T *B, T *A, int n) { if (n==1) { B[0] = A[0]; } else { T* C = new T[n]; cilk_spawn MergeSort(C, A, n/2); MergeSort(C+n/2, A+n/2, n-n/2); cilk_sync; Merge(B, C, C+n/2, n/2, n-n/2); delete[] C; } } CASE 3: nlogba = nlog21 = 1 f(n) = Θ(n) T∞(n/2) + Θ(n) Span:T∞(n) = = Θ(n)

Parallelism of Merge Sort Work: T1(n) = Θ(n lg n) PUNY! Span: T∞(n) = Θ(n) Parallelism: T1(n) = Θ(lg n) T∞(n) We need to parallelize the merge!

Parallel Merge ma = na/2 ≤ A[ma] ≥ A[ma] Recursive P_Merge Recursive P_Merge Binary Search ≤ A[ma] ≥ A[ma] mb mb-1 Throw away at least na/2 ≥ n/4 0 na A na ≥ nb B 0 nb Key Idea:If the total number of elements to be merged in the two arrays is n = na + nb, the total number of elements in the larger of the two recursive merges is at most (3/4)n .

Parallel Merge template <typename T> void P_Merge(T *C, T *A, T *B, intna, intnb) { if (na < nb) { P_Merge(C, B, A, nb, na); } else if (na==0) { return; } else { int ma = na/2; intmb = BinarySearch(A[ma], B, nb); C[ma+mb] = A[ma]; cilk_spawnP_Merge(C, A, B, ma, mb); P_Merge(C+ma+mb+1, A+ma+1, B+mb, na-ma-1, nb-mb); cilk_sync; } } Coarsen base cases for efficiency.

Span of Parallel Merge template <typename T> void P_Merge(T *C, T *A, T *B, int na, int nb) { if (na < nb) { ⋮ int mb = BinarySearch(A[ma], B, nb); C[ma+mb] = A[ma]; cilk_spawn P_Merge(C, A, B, ma, mb); P_Merge(C+ma+mb+1, A+ma+1, B+mb, na-ma-1, nb-mb); cilk_sync; } } CASE 2: nlogba = nlog4/31 = 1 f(n) = Θ(nlogba lg1n) T∞(3n/4) + Θ(lg n) Span:T∞(n) = = Θ(lg2n )

Work of Parallel Merge HAIRY! template <typename T> void P_Merge(T *C, T *A, T *B, int na, int nb) { if (na < nb) { ⋮ int mb = BinarySearch(A[ma], B, nb); C[ma+mb] = A[ma]; cilk_spawn P_Merge(C, A, B, ma, mb); P_Merge(C+ma+mb+1, A+ma+1, B+mb, na-ma-1, nb-mb); cilk_sync; } } T1(αn) + T1((1-α)n) + Θ(lg n), where 1/4 ≤ α≤ 3/4. Work:T1(n) = Claim:T1(n) = Θ(n).

Analysis of Work Recurrence Work: T1(n) = T1(αn) + T1((1-α)n) + Θ(lg n), where 1/4 ≤ α≤ 3/4. Substitution method: Inductive hypothesis is T1(k) ≤ c1k – c2lg k, wherec1,c2 > 0. Prove that the relation holds, and solve for c1 and c2. T1(n) = T1(n) + T1((1–)n) + (lg n) ≤ c1(n) – c2lg(n) + c1(1–)n – c2lg((1–)n) + (lg n)

Analysis of Work Recurrence Work: T1(n) = T1(αn) + T1((1-α)n) + Θ(lg n), where 1/4 ≤ α≤ 3/4. T1(n) = T1(n) + T1((1–)n) + (lg n) ≤ c1(n) – c2lg(n) + c1(1–)n – c2lg((1–)n) + (lg n)

Analysis of Work Recurrence Work: T1(n) = T1(αn) + T1((1-α)n) + Θ(lg n), where 1/4 ≤ α≤ 3/4. T1(n) = T1(n) + T1((1–)n) + Θ(lg n) ≤ c1(n) – c2lg(n) + c1(1–)n – c2lg((1–)n) + Θ(lg n) ≤ c1n – c2lg(n) – c2lg((1–)n) + Θ(lg n) ≤ c1n – c2 ( lg((1–)) + 2 lg n ) + Θ(lg n) ≤ c1n – c2 lg n – (c2(lg n + lg((1–))) – Θ(lg n)) ≤ c1n – c2 lg nby choosing c2 large enough. Choose c1 large enough to handle the base case.

Parallelism of P_Merge Work: T1(n) = Θ(n) Span: T∞(n) = Θ(lg2n) Parallelism: T1(n) = Θ(n/lg2n) T∞(n)

Parallel Merge Sort template <typename T> void P_MergeSort(T *B, T *A, int n) { if (n==1) { B[0] = A[0]; } else { T C[n]; cilk_spawnP_MergeSort(C, A, n/2); P_MergeSort(C+n/2, A+n/2, n-n/2); cilk_sync; P_Merge(B, C, C+n/2, n/2, n-n/2); } } CASE 2: nlogba = nlog22 = n f(n) = Θ(nlogba lg0n) 2T1(n/2) + Θ(n) Work:T1(n) = = Θ(n lg n)

Parallel Merge Sort template <typename T> void P_MergeSort(T *B, T *A, int n) { if (n==1) { B[0] = A[0]; } else { T C[n]; cilk_spawnP_MergeSort(C, A, n/2); P_MergeSort(C+n/2, A+n/2, n-n/2); cilk_sync; P_Merge(B, C, C+n/2, n/2, n-n/2); } } CASE 2: nlogba = nlog21 = 1 f(n) = Θ(nlogba lg2n) T∞(n/2) + Θ(lg2n) Span:T∞(n) = = Θ(lg3n)

Parallelism of P_MergeSort Work: T1(n) = Θ(n lg n) Span: T∞(n) = Θ(lg3n) Parallelism: T1(n) = Θ(n/lg2n) T∞(n)

CS 140 : Feb 4, 2013 Multicore (and Shared Memory) Programming with Cilk ++

CS 140 : Feb 4, 2013 Multicore (and Shared Memory) Programming with Cilk ++

Presentation Transcript

Parallel Programming in C with MPI and OpenMP

Memory Management

Introduction to Concurrency in Programming Languages: Appendices: A brief overview of OpenMP, Erlang, and Cilk

Transactional Memory

Memory Scaling: A Systems Architecture Perspective

Chapter 6, Process Synchronization

Scalable Many-Core Memory Systems Topic 3 : Memory Interference and QoS -Aware Memory Systems

Memory Management:

Scalable Many-Core Memory Systems Optional Topic 4: Cache Management

The joys of recursion

Dynamic Memory Management

Supercomputing in Plain English Multicore Madness

UNIT 3 : EMBEDDED ‘C’ PROGRAMMING

The iterated shared memory model of computation and an enrichment with safe-consensus tasks

Writing and tuning OpenMP programs on distributed shared memory platforms

Parallel Computing with OpenMP on distributed shared memory platforms

Virtual Memory

Shared Memory Multiprocessors

Memory Interface