1 / 38

Cilk-5

Cilk-5. The implementation of the Cilk-5 Multithreaded language. By : Matteo Frigo Charles E. Leiserson KeithH Randall MIT, 1998. Presented by : Roni Licher Technion , 2013. Based on Charles E. Leiserson (2006) : http://supertech.csail.mit.edu/ cilk /lecture-1.ppt ‎. Cilk.

vivian
Download Presentation

Cilk-5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cilk-5 The implementation of the Cilk-5 Multithreaded language By: MatteoFrigo Charles E. Leiserson KeithH Randall MIT, 1998. Presented by: RoniLicher Technion, 2013. Based on Charles E. Leiserson(2006): http://supertech.csail.mit.edu/cilk/lecture-1.ppt‎

  2. Cilk A C language for programming dynamic multithreaded applications on shared-memory multiprocessors. Example applications: • virus shell assembly • graphics rendering • n-body simulation • heuristic search • dense and sparse matrix computations • friction-stir welding simulation • artificial evolution

  3. Shared-Memory Multiprocessor … P P P $ $ $ Network Memory I/O In particular, over the next decade, chip multiprocessors (CMP’s) will be an increasingly important platform!

  4. Cilk Is Simple • Cilk extends the C language with just a handfulof keywords. • Every Cilk program has a serial semantics. • Not only is Cilk fast, it provides performance guarantees based on performance abstractions. • Cilk is processor-oblivious. • Cilk’sprovably good runtime system auto-matically manages low-level aspects of parallel execution, including protocols, load balancing, and scheduling. • Cilk supports speculative parallelism.

  5. Fibonacci int fib (int n) { if (n<2) return (n); else { intx,y; x = fib(n-1); y = fib(n-2); return (x+y); } } Cilk code cilkint fib (int n) { if (n<2) return (n); else { intx,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } } C elision Cilk is a faithfulextension of C. A Cilk program’s serial elision is always a legal implementation of Cilk semantics. Cilk provides nonew data types.

  6. Basic Cilk Keywords Identifies a function as a Cilk procedure, capable of being spawned in parallel. cilkint fib (int n) { if (n<2) return (n); else { intx,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } } The named child Cilk procedure can execute in parallel with the parent caller. Control cannot pass this point until all spawned children have returned.

  7. Dynamic Multithreading cilkint fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } } Example:fib(4) 4 3 2 2 1 1 0 “Processor oblivious” The computation dag unfolds dynamically. 1 0

  8. Multithreaded Computation initial thread final thread continue edge return edge spawn edge • The dag G = (V, E) represents a parallel instruction stream. • Each vertex v2V represents a (Cilk) thread: a maximal sequence of instructions not containing parallel control (spawn, sync, return). • Every edge e2E is either a spawn edge, a return edge, or a continue edge.

  9. Cactus Stack A A A B C D E C B B A A A A C C C D E D E Views of stack Cilk supports C’s rule for pointers: A pointer to stack space can be passed from parent to child, but not from child to parent. (Cilk also supports malloc.) Cilk’s cactus stack supports several views in parallel.

  10. Algorithmic Complexity Measures TP = execution time on P processors

  11. Algorithmic Complexity Measures TP = execution time on P processors T1 = work

  12. Algorithmic Complexity Measures TP = execution time on P processors T1 = work T1 = span* * Also called critical-path length or computational depth.

  13. Algorithmic Complexity Measures TP = execution time on P processors T1 = work T1 = span* • LOWER BOUNDS • TP¸T1/P • TP¸T1 * Also called critical-path length or computational depth.

  14. Speedup If T1/TP= (P) ·P, we have linear speedup; = P, we have perfect linear speedup; > P, we have superlinear speedup, which is not possible in our model, because of the lower bound TP¸T1/P. Definition:T1/TP= speedupon P processors.

  15. Parallelism Because we have the lower bound TP¸T1, the maximum possible speedup given T1 and T1 is T1/T1= parallelism = the average amount of work per step along the span.

  16. Example: fib(4) 1 8 2 7 3 4 6 5 Assume for simplicity that each Cilk thread in fib() takes unit time to execute. Work:T1 = ? Work:T1 = 17 Span:T1 = ? Span:T1 = 8

  17. Example: fib(4) Assume for simplicity that each Cilk thread in fib() takes unit time to execute. Work:T1 = ? Work:T1 = 17 Using many more than 2 processors makes little sense. Span:T1 = ? Span:T1 = 8 Parallelism:T1/T1 = 2.125

  18. Cilk’s Work-Stealing Scheduler Each processor maintains a work dequeof ready threads, and it manipulates the bottom of the deque like a stack. Spawn! P P P P

  19. Cilk’s Work-Stealing Scheduler Spawn! Spawn! Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P

  20. Cilk’s Work-Stealing Scheduler Each processor maintains awork deque of ready threads, and it manipulates the bottom of the deque like a stack. Return! P P P P

  21. Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Return! P P P P

  22. Cilk’s Work-Stealing Scheduler When a processor runs out of work, it steals a thread from the top of a random victim’s deque. Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Steal! P P P P

  23. Cilk’s Work-Stealing Scheduler When a processor runs out of work, it steals a thread from the top of a random victim’s deque. Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Steal! P P P P

  24. Cilk’s Work-Stealing Scheduler When a processor runs out of work, it steals a thread from the top of a random victim’s deque. Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P

  25. Cilk’s Work-Stealing Scheduler When a processor runs out of work, it steals a thread from the top of a random victim’s deque. Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Spawn! P P P P

  26. The T.H.E. Protocol • Deques held in shared memory. * Workers operate at the end, thiefs at the front. • We must prevent race conditions where a thief and victim try to access the same procedure frame. • Locking deques would be expensive for workers. • The T.H.E Protocol removes overhead of the common case, where there is no conflict.

  27. The T.H.E. Protocol • Assumes only reads and writes are atomic. • Head of the deque is H, tail is T, and (T≥ H) • Only thief can change H. • Only worker can change T. • To steal thiefs must get the lock L. • At most two processors operating on deque. • Three cases of interaction: • Two or more items on deque – each gets one. • One item on deque – either worker or thief gets frame, but not both. • No items on deque – both worker and thief fail.

  28. T.H.E. Protocol: The Worker/Victim push() { T++; } steal() { lock(L); H++; if (H > T) { H--; unlock(L); return FAILURE; } unlock(L); return SUCCESS;} pop() { T--; if (H > T) { T++; lock(L); T--; if (H > T) { T++; unlock(L); return FAILURE; } unlock(L); } return SUCCESS;}

  29. Performance of Work-Stealing Theorem: Cilk’s work-stealing scheduler achieves an expected running time of TP T1/P+ O(T1) on P processors. Pseudoproof. A processor is either working or stealing. The total time all processors spend working is T1. The expected cost of all steals is O(PT1). Since there are P processors, the expected time is(T1 + O(PT1))/P =T1/P + O(T1) . ■

  30. The Work First Principle Definition: TS – The running time of the C elision T1/TS– Work overhead – Critical-path overhead Assumption: P ¿ T1/T1

  31. The Work First Principle Definition:TS – The running time of the C elision T1/TS– Work overhead – Critical-path overhead Assumption: P ¿ T1/T1 Work-first justification: Since P ¿T1/T1 is equivalent to T1¿T1/P, (1) TPT1/P+ O(T1) (2) TP T1/P+ T1 (3) TP /P + T1 TP¼ /P

  32. Cilk Chess Programs • Socrates placed 3rd in the 1994 International Computer Chess Championship running on NCSA’s 512-node Connection Machine CM5. • Socrates 2.0 took 2nd place in the 1995 World Computer Chess Championship running on Sandia National Labs’ 1824-node Intel Paragon. • Cilkchessplaced 1st in the 1996 Dutch Open running on a 12-processor Sun Enterprise 5000. It placed 2nd in 1997 and 1998 running on Boston University’s 64-processor SGI Origin 2000. • Cilkchess tied for 3rd in the 1999 WCCC running on NASA’s 256-node SGI Origin 2000.

  33. Socrates Normalized Speedup T1/TP T1/T P T1/T 1 TP=T TP=T1/P +T 0.1 TP=T1/P measured speedup 0.01 0.1 1 0.01

  34. Developing Socrates • For the competition, Socrates was to run on a 512-processor Connection Machine Model CM5 supercomputer at the University of Illinois. • The developers had easy access to a similar 32-processor CM5 at MIT. • One of the developers proposed a change to the program that produced a speedup of over 20% on the MIT machine. • After a back-of-the-envelope calculation, the proposed “improvement” was rejected!

  35. Socrates Speedup Paradox Original program Proposed program T32 = 40 seconds T32 = 65 seconds T1 = 1024 seconds T1 = 2048 seconds T= 8 seconds T= 1 second T32 = 2048/32 + 1 T32 = 1024/32 + 8 = 40 seconds = 65 seconds T512 = 1024/512 + 8 T512 = 2048/512 + 1 = 10 seconds = 5 seconds TPT1/P +T

  36. Cilk Performance • Cilk’s “work-stealing” scheduler achieves • TP= T1/P + O(T1) expected time (provably); • TPT1/P +T1time (empirically). • Near-perfect linear speedup if P ¿T1/T1. • The average cost of a spawn in Cilk-5 is only 2–6 times the cost of an ordinary C function call, depending on the platform. • Empirical results show speed up average of 6.2 on an 8 processor machine.

  37. Cilk quick history • 1994 - Cilk 1 • 1998 - Cilk 5 • 2005 - JCilk • 2006 - Cilk Arts • 2008 - Cilk++ • 2009 - Intel buys Cilk Arts • 2010 - Intel released a commercial implementation in its compilers 

  38. Questions?

More Related