Basic Cilk Programming

Basic Cilk Programming

Multithreaded Programming inCilkLECTURE 1 Adapted from Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multi-core processors … P P P $ $ $ Network Memory I/O MIMD – shared memory

Cilk Overview • Cilk extends the C language with just a handful of keywords: cilk, spawn, sync • Every Cilk program has a serial semantics. • Not only is Cilk fast, it provides performance guarantees based on performance abstractions. • Cilk is processor-oblivious. • Cilk’s provably good runtime system auto-matically manages low-level aspects of parallel execution, including protocols, load balancing, and scheduling. • Cilk supports speculative parallelism.

Fibonacci int fib (int n) { if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); } } Cilk code cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } } C elision Cilk is a faithfulextension of C. A Cilk program’s serial elision is always a legal implementation of Cilk semantics. Cilk provides nonew data types.

Basic Cilk Keywords Identifies a function as a Cilk procedure, capable of being spawned in parallel. cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } } The named child Cilk procedure can execute in parallel with the parent caller. Control cannot pass this point until all spawned children have returned.

Dynamic Multithreading cilkint fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } } Example:fib(4) 4 3 2 2 1 1 0 “Processor oblivious” The computation dag unfolds dynamically. 1 0

Multithreaded Computation initial thread final thread continue edge return edge spawn edge • The dag G = (V, E) represents a parallel instruction stream. • Each vertex v2V represents a (Cilk) thread: a maximal sequence of instructions not containing parallel control (spawn, sync, return). • Every edge e2E is either a spawn edge, a return edge, or a continue edge.

Algorithmic Complexity Measures TP = execution time on P processors

Algorithmic Complexity Measures TP = execution time on P processors T1 = work

Algorithmic Complexity Measures TP = execution time on P processors T1 = work T1 = span* * Also called critical-path length or computational depth.

Algorithmic Complexity Measures TP = execution time on P processors T1 = work T1 = span* LOWER BOUNDS • TP¸T1/P • TP¸T1 * Also called critical-path length or computational depth.

Speedup If T1/TP= (P) ·P, we have linear speedup; = P, we have perfect linear speedup; > P, we have superlinear speedup, which is not possible in our model, because of the lower bound TP¸T1/P. Definition:T1/TP= speedupon P processors.

Parallelism Because we have the lower bound TP¸T1, the maximum possible speedup given T1 and T1 is T1/T1= parallelism = the average amount of work per step along the span.

Example: fib(4) 1 8 2 7 3 4 6 5 Assume for simplicity that each Cilk thread in fib() takes unit time to execute. Work:T1 = ? Work:T1 = 17 Span:T1 = ? Span:T1 = 8

Example: fib(4) Assume for simplicity that each Cilk thread in fib() takes unit time to execute. Work:T1 = ? Work:T1 = 17 Using many more than 2 processors makes little sense. Span:T1 = ? Span:T1 = 8 Parallelism:T1/T1 = 2.125

Parallelizing Vector Addition C void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; }

Parallelizing Vector Addition } } C void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } C if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { void vadd (real *A, real *B, int n){ vadd (A, B, n/2); vadd (A+n/2, B+n/2, n-n/2); Parallelization strategy: • Convert loops to recursion.

Parallelizing Vector Addition } } C void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { void vadd (real *A, real *B, int n){ cilk C ilk spawn vadd (A, B, n/2); sync; spawn vadd (A+n/2, B+n/2, n-n/2); Parallelization strategy: • Convert loops to recursion. • Insert Cilk keywords. Side benefit:D&C is generally good for caches!

Vector Addition cilkvoid vadd (real *A, real *B, int n){ if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { spawn vadd (A, B, n/2); spawn vadd (A+n/2, B+n/2, n-n/2); sync; } }

Vector Addition Analysis To add two vectors of length n, where BASE = (1): Work:T1 = ? (n) Span:T1 = ? (lg n) Parallelism:T1/T1 = ? (n/lg n) BASE

Another Parallelization C void vadd1 (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { vadd1(A+j, B+j, min(BASE, n-j)); } } Cilk cilk void vadd1 (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } cilk void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { spawn vadd1(A+j, B+j, min(BASE, n-j)); } sync; }

Analysis Work:T1 = ? Span:T1 = ? Parallelism:T1/T1 = ? PUNY! To add two vectors of length n, where BASE = (1): … … BASE (n) (n) (1)

Optimal Choice of BASE To add two vectors of length n using an optimal choice of BASE to maximize parallelism: … … BASE Work: T1 = ? (n) Span:T1 = ? (BASE + n/BASE) Choosing BASE = √n)T1 = (√n) Parallelism: T1/T1 = ? (√n )

Weird! Don’t we want to remove recursion? Parallel Programming = Sequential Program + Decomposition + Mapping + Communication and synchronization

Scheduling … P P P $ $ $ Network Memory I/O • Cilk allows the programmer to express potential parallelism in an application. • The Cilk scheduler maps Cilk threads onto processors dynamically at runtime. • Since on-line schedulers are complicated, we’ll illustrate the ideas with an off-line scheduler.

Greedy Scheduling IDEA: Do as much as possible on every step. Definition:A thread is ready if all its predecessors have executed.

Greedy Scheduling IDEA: Do as much as possible on every step. Definition:A thread is ready if all its predecessors have executed. P = 3 Completestep • ¸P threads ready. • Run any P.

Greedy Scheduling IDEA: Do as much as possible on every step. Definition:A thread is ready if all its predecessors have executed. P = 3 Completestep • ¸P threads ready. • Run any P. Incomplete step • < P threads ready. • Run all of them.

Greedy-Scheduling Theorem Theorem [Graham ’68 & Brent ’75]. Any greedy scheduler achieves TP T1/P+ T. P = 3 Proof. • # complete steps ·T1/P, since each complete step performs P work. • # incomplete steps ·T1, since each incomplete step reduces the span of the unexecuted dag by 1. ■

Optimality of Greedy Corollary. Any greedy scheduler achieves within a factor of 2 of optimal. Proof. Let TP* be the execution time produced by the optimal scheduler. Since TP* ¸ max{T1/P, T1} (lower bounds), we have TP·T1/P + T1 · 2¢max{T1/P, T1} · 2TP* . ■

Linear Speedup Corollary. Any greedy scheduler achieves near-perfect linear speedup whenever P ¿ T1/T1. Proof.Since P ¿T1/T1 is equivalent to T1¿T1/P, the Greedy Scheduling Theorem gives us TP·T1/P + T1 ¼ T1/P . Thus, the speedup is T1/TP¼P. ■ Definition. The quantity (T1/T1 )/P is called the parallel slackness.

Lessons Work and span can predict performance on large machines better than running times on small machines can. Focus on improving Parallelism (ie. Maximize (T1/T1 )). This will allow you to effectively use larger processor counts.

Cilk Performance • Cilk’s “work-stealing” scheduler achieves • TP= T1/P + O(T1) expected time (provably); • TPT1/P +T1time (empirically). • Near-perfect linear speedup if P ¿T1/T1. • Instrumentation in Cilk allows the user to determine accurate measures of T1 and T1 . • The average cost of a spawn in Cilk-5 is only 2–6 times the cost of an ordinary C function call, depending on the platform.

Cilk’s Work-Stealing Scheduler Each processor maintains a work dequeof ready threads, and it manipulates the bottom of the deque like a stack. Spawn! P P P P

Cilk’s Work-Stealing Scheduler Spawn! Spawn! Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P

Cilk’s Work-Stealing Scheduler Each processor maintains awork deque of ready threads, and it manipulates the bottom of the deque like a stack. Return! P P P P

Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Return! P P P P

Cilk’s Work-Stealing Scheduler When a processor runs out of work, it steals a thread from the top of a random victim’s deque. Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Steal! P P P P

Cilk’s Work-Stealing Scheduler When a processor runs out of work, it steals a thread from the top of a random victim’s deque. Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P

Cilk’s Work-Stealing Scheduler When a processor runs out of work, it steals a thread from the top of a random victim’s deque. Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Spawn! P P P P

Performance of Work-Stealing Theorem: Cilk’s work-stealing scheduler achieves an expected running time of TP T1/P+ O(T1) on P processors. Pseudoproof. A processor is either working or stealing. The total time all processors spend working is T1. Each steal has a 1/P chance of reducing the span by 1. Thus, the expected cost of all steals is O(PT1). Since there are P processors, the expected time is (T1 + O(PT1))/P =T1/P + O(T1) . ■

Space Bounds P S1 P P Theorem. Let S1 be the stack space required by a serial execution of a Cilk program. Then, the space required by a P-processor execution is at most SP·PS1 . P = 3 Proof (by induction). The work-stealing algorithm maintains the busy-leaves property: every extant procedure frame with no extant descendents has a processor working on it.■

Linguistic Implications Code like the following executes properly without any risk of blowing out memory: for (i=1; i<1000000000; i++) { spawn foo(i); } sync; MORAL Better to steal parents than children!

Summary • Cilk is simple: cilk, spawn, sync • Recursion, recursion, recursion, … • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span

Sudoko • A game where you fill in a grid with numbers • A number cannot appear more than once in any column • A number cannot appear more than once in any row • A number can not appear more than once in any “region” • Typically presented with a 9 by 9 grid … but for simplicity we’ll consider a 4 by 4 grid Since 1 is the only number missing in this column 1 Since 3 is the only number missing in this row Since 3 already appears in this region 3 2 A 4 x 4 Sudoku puzzle with 11 open positions … we show three steps in the solution

Sudoko Algorithm • The two-dimensional Sudoko grid is flattened into a vector • Unsolved locations are filled with zeros • The first two rows of the initial 4 x 4 puzzle are shown • The current working location [loc=0] is shown in red and the subgrid size is 3 • Initially call spawn solve(size=3, grid, loc=0) grid 3 0 0 4 0 0 0 2 … • The first location has a solution so move to next location • Recursively call spawn solve(size=3, grid, loc=loc+1) 3 0 0 4 0 0 0 2 …

Exhaustive Search • The next location [loc=1] has no solution (‘0’ in the current cell) so … • Create 4 new grids and try each of the 4 possibilities (1,2,3,4) concurrently • Note: the search goes much faster if the guess is first tested to see if it is legal • Spawn a new search tree for each guess k • Call: spawn solve(size=3, grid[k], loc=loc+1) 3 1 0 4 0 0 0 2 … 3 2 0 4 0 0 0 2 … new grids 3 3 0 4 0 0 0 2 … 3 4 0 4 0 0 0 2 … Illegal since 3 and 4 are already in the same row Source: Mattson and Keutzer, UCB CS294

Cilk Sudoko solution (part 1 of 3) cilk int solve(int size, int* grid, int loc) { int i, k, solved, solution[MAX_NUM]; int* grid[MAX_NUM]; int numNumbers = size*size: int Girdlen = numNumbers*numNumbers; if (loc == Gridlen) { /* maximum depth; reached the end of the puzzle */ return check_solution(size, grid); } /* if this node has a solution (given by puzzle) at this location */ /* move to next node location */ if (grid[loc] != 0) { solved = spawn solve(size, g, loc+1); return solved; }

Basic Cilk Programming