1 / 52

Basic Cilk Programming

Basic Cilk Programming. Multithreaded Programming in Cilk L ECTURE 1. Adapted from. Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology. Multi-core processors. …. P. P. P. $. $.

vivian
Download Presentation

Basic Cilk Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Basic Cilk Programming

  2. Multithreaded Programming inCilkLECTURE 1 Adapted from Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

  3. Multi-core processors … P P P $ $ $ Network Memory I/O MIMD – shared memory

  4. Cilk Overview • Cilk extends the C language with just a handful of keywords: cilk, spawn, sync • Every Cilk program has a serial semantics. • Not only is Cilk fast, it provides performance guarantees based on performance abstractions. • Cilk is processor-oblivious. • Cilk’s provably good runtime system auto-matically manages low-level aspects of parallel execution, including protocols, load balancing, and scheduling. • Cilk supports speculative parallelism.

  5. Fibonacci int fib (int n) { if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); } } Cilk code cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } } C elision Cilk is a faithfulextension of C. A Cilk program’s serial elision is always a legal implementation of Cilk semantics. Cilk provides nonew data types.

  6. Basic Cilk Keywords Identifies a function as a Cilk procedure, capable of being spawned in parallel. cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } } The named child Cilk procedure can execute in parallel with the parent caller. Control cannot pass this point until all spawned children have returned.

  7. Dynamic Multithreading cilkint fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } } Example:fib(4) 4 3 2 2 1 1 0 “Processor oblivious” The computation dag unfolds dynamically. 1 0

  8. Multithreaded Computation initial thread final thread continue edge return edge spawn edge • The dag G = (V, E) represents a parallel instruction stream. • Each vertex v2V represents a (Cilk) thread: a maximal sequence of instructions not containing parallel control (spawn, sync, return). • Every edge e2E is either a spawn edge, a return edge, or a continue edge.

  9. Algorithmic Complexity Measures TP = execution time on P processors

  10. Algorithmic Complexity Measures TP = execution time on P processors T1 = work

  11. Algorithmic Complexity Measures TP = execution time on P processors T1 = work T1 = span* * Also called critical-path length or computational depth.

  12. Algorithmic Complexity Measures TP = execution time on P processors T1 = work T1 = span* LOWER BOUNDS • TP¸T1/P • TP¸T1 * Also called critical-path length or computational depth.

  13. Speedup If T1/TP= (P) ·P, we have linear speedup; = P, we have perfect linear speedup; > P, we have superlinear speedup, which is not possible in our model, because of the lower bound TP¸T1/P. Definition:T1/TP= speedupon P processors.

  14. Parallelism Because we have the lower bound TP¸T1, the maximum possible speedup given T1 and T1 is T1/T1= parallelism = the average amount of work per step along the span.

  15. Example: fib(4) 1 8 2 7 3 4 6 5 Assume for simplicity that each Cilk thread in fib() takes unit time to execute. Work:T1 = ? Work:T1 = 17 Span:T1 = ? Span:T1 = 8

  16. Example: fib(4) Assume for simplicity that each Cilk thread in fib() takes unit time to execute. Work:T1 = ? Work:T1 = 17 Using many more than 2 processors makes little sense. Span:T1 = ? Span:T1 = 8 Parallelism:T1/T1 = 2.125

  17. Parallelizing Vector Addition C void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; }

  18. Parallelizing Vector Addition } } C void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } C if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { void vadd (real *A, real *B, int n){ vadd (A, B, n/2); vadd (A+n/2, B+n/2, n-n/2); Parallelization strategy: • Convert loops to recursion.

  19. Parallelizing Vector Addition } } C void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { void vadd (real *A, real *B, int n){ cilk C ilk spawn vadd (A, B, n/2); sync; spawn vadd (A+n/2, B+n/2, n-n/2); Parallelization strategy: • Convert loops to recursion. • Insert Cilk keywords. Side benefit:D&C is generally good for caches!

  20. Vector Addition cilkvoid vadd (real *A, real *B, int n){ if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { spawn vadd (A, B, n/2); spawn vadd (A+n/2, B+n/2, n-n/2); sync; } }

  21. Vector Addition Analysis To add two vectors of length n, where BASE = (1): Work:T1 = ? (n) Span:T1 = ? (lg n) Parallelism:T1/T1 = ? (n/lg n) BASE

  22. Another Parallelization C void vadd1 (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { vadd1(A+j, B+j, min(BASE, n-j)); } } Cilk cilk void vadd1 (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } cilk void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { spawn vadd1(A+j, B+j, min(BASE, n-j)); } sync; }

  23. Analysis Work:T1 = ? Span:T1 = ? Parallelism:T1/T1 = ? PUNY! To add two vectors of length n, where BASE = (1): … … BASE (n) (n) (1)

  24. Optimal Choice of BASE To add two vectors of length n using an optimal choice of BASE to maximize parallelism: … … BASE Work: T1 = ? (n) Span:T1 = ? (BASE + n/BASE) Choosing BASE = √n)T1 = (√n) Parallelism: T1/T1 = ? (√n )

  25. Weird! Don’t we want to remove recursion? Parallel Programming = Sequential Program + Decomposition + Mapping + Communication and synchronization

  26. Scheduling … P P P $ $ $ Network Memory I/O • Cilk allows the programmer to express potential parallelism in an application. • The Cilk scheduler maps Cilk threads onto processors dynamically at runtime. • Since on-line schedulers are complicated, we’ll illustrate the ideas with an off-line scheduler.

  27. Greedy Scheduling IDEA: Do as much as possible on every step. Definition:A thread is ready if all its predecessors have executed.

  28. Greedy Scheduling IDEA: Do as much as possible on every step. Definition:A thread is ready if all its predecessors have executed. P = 3 Completestep • ¸P threads ready. • Run any P.

  29. Greedy Scheduling IDEA: Do as much as possible on every step. Definition:A thread is ready if all its predecessors have executed. P = 3 Completestep • ¸P threads ready. • Run any P. Incomplete step • < P threads ready. • Run all of them.

  30. Greedy-Scheduling Theorem Theorem [Graham ’68 & Brent ’75]. Any greedy scheduler achieves TP T1/P+ T. P = 3 Proof. • # complete steps ·T1/P, since each complete step performs P work. • # incomplete steps ·T1, since each incomplete step reduces the span of the unexecuted dag by 1. ■

  31. Optimality of Greedy Corollary. Any greedy scheduler achieves within a factor of 2 of optimal. Proof. Let TP* be the execution time produced by the optimal scheduler. Since TP* ¸ max{T1/P, T1} (lower bounds), we have TP·T1/P + T1 · 2¢max{T1/P, T1} · 2TP* . ■

  32. Linear Speedup Corollary. Any greedy scheduler achieves near-perfect linear speedup whenever P ¿ T1/T1. Proof.Since P ¿T1/T1 is equivalent to T1¿T1/P, the Greedy Scheduling Theorem gives us TP·T1/P + T1 ¼ T1/P . Thus, the speedup is T1/TP¼P. ■ Definition. The quantity (T1/T1 )/P is called the parallel slackness.

  33. Lessons Work and span can predict performance on large machines better than running times on small machines can. Focus on improving Parallelism (ie. Maximize (T1/T1 )). This will allow you to effectively use larger processor counts.

  34. Cilk Performance • Cilk’s “work-stealing” scheduler achieves • TP= T1/P + O(T1) expected time (provably); • TPT1/P +T1time (empirically). • Near-perfect linear speedup if P ¿T1/T1. • Instrumentation in Cilk allows the user to determine accurate measures of T1 and T1 . • The average cost of a spawn in Cilk-5 is only 2–6 times the cost of an ordinary C function call, depending on the platform.

  35. Cilk’s Work-Stealing Scheduler Each processor maintains a work dequeof ready threads, and it manipulates the bottom of the deque like a stack. Spawn! P P P P

  36. Cilk’s Work-Stealing Scheduler Spawn! Spawn! Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P

  37. Cilk’s Work-Stealing Scheduler Each processor maintains awork deque of ready threads, and it manipulates the bottom of the deque like a stack. Return! P P P P

  38. Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Return! P P P P

  39. Cilk’s Work-Stealing Scheduler When a processor runs out of work, it steals a thread from the top of a random victim’s deque. Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Steal! P P P P

  40. Cilk’s Work-Stealing Scheduler When a processor runs out of work, it steals a thread from the top of a random victim’s deque. Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Steal! P P P P

  41. Cilk’s Work-Stealing Scheduler When a processor runs out of work, it steals a thread from the top of a random victim’s deque. Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P

  42. Cilk’s Work-Stealing Scheduler When a processor runs out of work, it steals a thread from the top of a random victim’s deque. Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Spawn! P P P P

  43. Performance of Work-Stealing Theorem: Cilk’s work-stealing scheduler achieves an expected running time of TP T1/P+ O(T1) on P processors. Pseudoproof. A processor is either working or stealing. The total time all processors spend working is T1. Each steal has a 1/P chance of reducing the span by 1. Thus, the expected cost of all steals is O(PT1). Since there are P processors, the expected time is (T1 + O(PT1))/P =T1/P + O(T1) . ■

  44. Space Bounds P S1 P P Theorem. Let S1 be the stack space required by a serial execution of a Cilk program. Then, the space required by a P-processor execution is at most SP·PS1 . P = 3 Proof (by induction). The work-stealing algorithm maintains the busy-leaves property: every extant procedure frame with no extant descendents has a processor working on it.■

  45. Linguistic Implications Code like the following executes properly without any risk of blowing out memory: for (i=1; i<1000000000; i++) { spawn foo(i); } sync; MORAL Better to steal parents than children!

  46. Summary • Cilk is simple: cilk, spawn, sync • Recursion, recursion, recursion, … • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span

  47. Sudoko • A game where you fill in a grid with numbers • A number cannot appear more than once in any column • A number cannot appear more than once in any row • A number can not appear more than once in any “region” • Typically presented with a 9 by 9 grid … but for simplicity we’ll consider a 4 by 4 grid Since 1 is the only number missing in this column 1 Since 3 is the only number missing in this row Since 3 already appears in this region 3 2 A 4 x 4 Sudoku puzzle with 11 open positions … we show three steps in the solution

  48. Sudoko Algorithm • The two-dimensional Sudoko grid is flattened into a vector • Unsolved locations are filled with zeros • The first two rows of the initial 4 x 4 puzzle are shown • The current working location [loc=0] is shown in red and the subgrid size is 3 • Initially call spawn solve(size=3, grid, loc=0) grid 3 0 0 4 0 0 0 2 … • The first location has a solution so move to next location • Recursively call spawn solve(size=3, grid, loc=loc+1) 3 0 0 4 0 0 0 2 …

  49. Exhaustive Search • The next location [loc=1] has no solution (‘0’ in the current cell) so … • Create 4 new grids and try each of the 4 possibilities (1,2,3,4) concurrently • Note: the search goes much faster if the guess is first tested to see if it is legal • Spawn a new search tree for each guess k • Call: spawn solve(size=3, grid[k], loc=loc+1) 3 1 0 4 0 0 0 2 … 3 2 0 4 0 0 0 2 … new grids 3 3 0 4 0 0 0 2 … 3 4 0 4 0 0 0 2 … Illegal since 3 and 4 are already in the same row Source: Mattson and Keutzer, UCB CS294

  50. Cilk Sudoko solution (part 1 of 3) cilk int solve(int size, int* grid, int loc) { int i, k, solved, solution[MAX_NUM]; int* grid[MAX_NUM]; int numNumbers = size*size: int Girdlen = numNumbers*numNumbers; if (loc == Gridlen) { /* maximum depth; reached the end of the puzzle */ return check_solution(size, grid); } /* if this node has a solution (given by puzzle) at this location */ /* move to next node location */ if (grid[loc] != 0) { solved = spawn solve(size, g, loc+1); return solved; }

More Related