1 / 146

Load Balancing and Multithreaded Programming

Load Balancing and Multithreaded Programming. Nir Shavit Multiprocessor Synchronization Spring 2003. How to write Parallel Apps?. Multithreaded Programming Programming model Programming language (Cilk) Well-developed theory Successful practice. Why We Care. Interesting in its own right

agordon
Download Presentation

Load Balancing and Multithreaded Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Load Balancing and Multithreaded Programming Nir Shavit Multiprocessor Synchronization Spring 2003

  2. How to write Parallel Apps? • Multithreaded Programming • Programming model • Programming language (Cilk) • Well-developed theory • Successful practice M. Herlihy & N. Shavit (c) 2003

  3. Why We Care • Interesting in its own right • Scheduler • Ideal application for • Lock-free data structures M. Herlihy & N. Shavit (c) 2003

  4. Multithreaded Fibonacci int fib(int n) { if (n < 2) { return n; } else { int x = spawn fib(n-1); int y = spawn fib(n-2); sync(); return x + y; }} *Cilk Code (Java Code in Notes) M. Herlihy & N. Shavit (c) 2003

  5. Multithreaded Fibonacci int fib(int n) { if (n < 2) { return n; } else { int x = spawn fib(n-1); int y = spawn fib(n-2); sync(); return x + y; }} Parallel method call M. Herlihy & N. Shavit (c) 2003

  6. Multithreaded Fibonacci int fib(int n) { if (n < 2) { return n; } else { int x = spawn fib(n-1); int y = spawn fib(n-2); sync(); return x + y; }} Wait for children to complete M. Herlihy & N. Shavit (c) 2003

  7. Multithreaded Fibonacci int fib(int n) { if (n < 2) { return n; } else { int x = spawn fib(n-1); int y = spawn fib(n-2); sync(); return x + y; }} Safe to use children’s values M. Herlihy & N. Shavit (c) 2003

  8. Note • Spawn & synch operators • Like Israeli traffic signs • Are purely advisory in nature • The scheduler • Like the Israeli driver • Has complete freedom to decide M. Herlihy & N. Shavit (c) 2003

  9. Dynamic Behavior • Multithreaded program is • A directed acyclic graph (DAG) • That unfolds dynamically • A thread is • Maximal sequence of instructions • Without spawn, sync, or return M. Herlihy & N. Shavit (c) 2003

  10. fib(4) fib(2) sync spawn fib(3) fib(2) fib(1) fib(1) fib(1) fib(1) fib(1) Fib DAG M. Herlihy & N. Shavit (c) 2003

  11. fib(2) fib(1) fib(1) fib(1) fib(1) fib(1) Arrows Reflect Dependencies fib(4) sync spawn fib(3) fib(2) M. Herlihy & N. Shavit (c) 2003

  12. How Parallel is That? • Define work: • Total time on one processor • Define critical-path length: • Longest dependency path • Can’t beat that! M. Herlihy & N. Shavit (c) 2003

  13. fib(4) fib(3) fib(2) fib(2) fib(1) fib(1) fib(1) fib(1) fib(1) Fib Work M. Herlihy & N. Shavit (c) 2003

  14. Fib Work 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 work is 17 16 17 M. Herlihy & N. Shavit (c) 2003

  15. fib(4) Fib Critical Path M. Herlihy & N. Shavit (c) 2003

  16. Fib Critical Path fib(4) 1 8 2 7 3 4 6 Critical path length is 8 5 M. Herlihy & N. Shavit (c) 2003

  17. Notation Watch • TP = time on P processors • T1 = work (time on 1 processor) • T∞ = critical path length (time on ∞ processors) M. Herlihy & N. Shavit (c) 2003

  18. Simple Bounds • TP ≥ T1/P • In one step, can’t do more than P work • TP ≥ T∞ • Can’t beat infinite resources M. Herlihy & N. Shavit (c) 2003

  19. More Notation Watch • Speedup on P processors • Ratio T1/TP • How much faster with P processors • Linear speedup • T1/TP = Θ(P) • Max speedup (average parallelism) • T1/T∞ M. Herlihy & N. Shavit (c) 2003

  20. Remarks • Graph nodes have out-degree ≤ 2 • Unique • Starting node • Ending node M. Herlihy & N. Shavit (c) 2003

  21. Matrix Multiplication M. Herlihy & N. Shavit (c) 2003

  22. Matrix Multiplication • Each n-by-n matrix multiplication • 8 multiplications • 4 additions • Of n/2-by-n/2 submatrices M. Herlihy & N. Shavit (c) 2003

  23. Addition int add(Matrix C, Matrix T, int n) { if (n == 1) { C[1,1] = C[1,1] + T[1,1]; } else { partition C, T into half-size submatrices; spawn add(C11,T11,n/2); spawn add(C12,T12,n/2); spawn add(C21,T21,n/2); spawn add(C22,T22,n/2) sync(); }} M. Herlihy & N. Shavit (c) 2003

  24. Addition • Let AP(n) be running time • For n x n matrix • on P processors • For example • A1(n) is work • A∞(n) is critical path length M. Herlihy & N. Shavit (c) 2003

  25. Addition • Work is A1(n) = 4 A1(n/2) + Θ(1) Partition, synch, etc 4 spawned additions M. Herlihy & N. Shavit (c) 2003

  26. Addition • Work is A1(n) = 4 A1(n/2) + Θ(1) = Θ(n2) Same as double-loop summation M. Herlihy & N. Shavit (c) 2003

  27. Addition • Critical Path length is A∞(n) = A∞(n/2) + Θ(1) spawned additions in parallel Partition, synch, etc M. Herlihy & N. Shavit (c) 2003

  28. Addition • Critical Path length is A∞(n) = A∞(n/2) + Θ(1) = Θ(log n) M. Herlihy & N. Shavit (c) 2003

  29. Multiplication int mult(Matrix C, Matrix A, Matrix B, int n) { if (n == 1) { C[1,1] = A[1,1]·B[1,1]; } else { allocate temporary n·n matrix T; partition A,B,C,T into half-size submatrices; … M. Herlihy & N. Shavit (c) 2003

  30. Multiplication (con’t) spawn mult(C11,A11,B11,n/2); spawn mult(C12,A11,B12,n/2); spawn mult(C21,A21,B11,n/2); spawn mult(C22,A22,B12,n/2) spawn mult(T11,A11,B21,n/2); spawn mult(T12,A12,B22,n/2); spawn mult(T21,A21,B21,n/2); spawn mult(T22,A22,B22,n/2) sync(); spawn add(C,T,n); }} M. Herlihy & N. Shavit (c) 2003

  31. Multiplication • Work is M1(n) = 8 M1(n/2) + A1(n) Final addition 8 spawned mulitplications M. Herlihy & N. Shavit (c) 2003

  32. Multiplication • Work is M1(n) = 8 M1(n/2) + Θ(n2) = Θ(n3) Same as serial triple-nested loop M. Herlihy & N. Shavit (c) 2003

  33. Multiplication • Critical path length is M∞(n) = M∞(n/2) + A∞(n) Final addition Half-size parallel multiplications M. Herlihy & N. Shavit (c) 2003

  34. Multiplication • Critical path length is M∞(n) = M∞(n/2) + A∞(n) = M∞(n/2) + Θ(log n) = Θ(log2 n) M. Herlihy & N. Shavit (c) 2003

  35. Parallelism • M1(n)/ M∞(n) = Θ(n3/log2 n) • To multiply two 1000 x 1000 matrices • 10003/102=107 • Much more than number of processors on any real machine M. Herlihy & N. Shavit (c) 2003

  36. Shared-Memory Multiprocessors • Parallel applications • Java • Cilk, etc. • Mix of other jobs • All run together • Come & go dynamically M. Herlihy & N. Shavit (c) 2003

  37. Scheduling • Ideally, • User-level scheduler • Maps threads to dedicated processors • In real life, • User-level scheduler • Maps threads to fixed number of processes • Kernel-level scheduler • Maps processes to dynamic pool of processors M. Herlihy & N. Shavit (c) 2003

  38. For Example • Initially, • All P processors available for application • Serial computation • Takes over one processor • Leaving P-1 for us • Waits for I/O • We get that processor back …. M. Herlihy & N. Shavit (c) 2003

  39. Speedup • Map threads onto P processes • Cannot get P-fold speedup • What if the kernel doesn’t cooperate? • Can try for PA-fold speedup • PA is time-averaged number of processors the kernel gives us M. Herlihy & N. Shavit (c) 2003

  40. ideal mm(1024) lu(2048) barnes(16K,10) heat(4K,512,100) Static Load Balancing 8 7 6 5 speedup 4 8-processor Sun Ultra Enterprise 5000. 3 2 1 1 4 8 12 16 20 24 28 32 processes M. Herlihy & N. Shavit (c) 2003

  41. ideal mm(1024) lu(2048) barnes(16K,10) heat(4K,512,100) msort(32M) ray() Dynamic Load Balancing 8 7 6 5 speedup 4 8-processor Sun Ultra Enterprise 5000. 3 2 1 1 4 8 8 12 12 16 16 20 24 28 32 processes M. Herlihy & N. Shavit (c) 2003

  42. Scheduling Hierarchy • User-level scheduler • Tells kernel which processes are ready • Kernel-level scheduler • Synchronous (for analysis, not correctness!) • Picks pi threads to schedule at step i • Time-weighted average is: M. Herlihy & N. Shavit (c) 2003

  43. Greed is Good • Greedy scheduler • Schedules as much as it can • At each time step M. Herlihy & N. Shavit (c) 2003

  44. Theorem • Greedy scheduler ensures actual timeT ≤ T1/PA + T∞(P-1)/PA M. Herlihy & N. Shavit (c) 2003

  45. Proof Strategy Bound this! M. Herlihy & N. Shavit (c) 2003

  46. Put Tokens in Buckets Thread scheduled and executed Thread scheduled but not executed work idle M. Herlihy & N. Shavit (c) 2003

  47. At the end …. Total #tokens = work idle M. Herlihy & N. Shavit (c) 2003

  48. At the end …. T1 tokens work idle M. Herlihy & N. Shavit (c) 2003

  49. Must Show ≤ T∞(P-1) tokens work idle M. Herlihy & N. Shavit (c) 2003

  50. Every Move You Make … • Scheduler is greedy • At least one node ready • Number of idle threads in one step • At most pi-1 ≤ P-1 M. Herlihy & N. Shavit (c) 2003

More Related