1 / 42

Lithe: Enabling Efficient Composition of Parallel Libraries

Lithe: Enabling Efficient Composition of Parallel Libraries. Heidi Pan , Benjamin Hindman, Krste Asanovi ć. xoxo@mit.edu  {benh, krste}@eecs.berkeley.edu Massachusetts Institute of Technology  UC Berkeley. HotPar  Berkeley, CA  March 31, 2009. Functionality :. or. or. or.

kevyn
Download Presentation

Lithe: Enabling Efficient Composition of Parallel Libraries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lithe: Enabling Efficient Composition of Parallel Libraries Heidi Pan, Benjamin Hindman, Krste Asanović xoxo@mit.edu {benh, krste}@eecs.berkeley.eduMassachusetts Institute of Technology UC Berkeley HotPar  Berkeley, CA  March 31, 2009

  2. Functionality: or or or Resource Management: Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 How to Build Parallel Apps? App OS Hardware Need both programmer productivity and performance!

  3. App bubblesort quicksort modularity same app, different library implementations Composability is Key to Productivity App 1 App 2 sort sort code reuse same library implementation, different apps Functional Composability

  4. fast + fast fast fast + faster fast(er) Composability is Key to Productivity Performance Composability

  5. Talk Roadmap • Problem: Efficient parallel composability is hard! • Solution: • Harts • Lithe • Evaluation

  6. ColumnElimination Tree Frontal MatrixFactorization TBB MKLOpenMP Software Architecture Motivational Example Sparse QR Factorization(Tim Davis, Univ of Florida) SPQR OS Hardware System Stack

  7. sequential Out-of-the-Box Performance Performance of SPQR on 16-core Machine Out-of-the-Box Time (sec) Input Matrix

  8. TX TX TX TX TX TX TX TX TY TY TY TY TY TY TY TY +Z +Z +Z +Z +Z +Z +Z +Z NZ NZ NZ NZ NZ NZ NZ NZ CZ CZ CZ CZ CZ CZ CZ CZ +Y +Y +Y +Y +Y +Y +Y +Y Cache Blocking Cache Blocking Cache Blocking Cache Blocking Cache Blocking Cache Blocking Cache Blocking Cache Blocking +X (unit stride) +X (unit stride) +X (unit stride) +X (unit stride) +X (unit stride) +X (unit stride) +X (unit stride) +X (unit stride) NY NY NY NY NY NY NY NY CY CY CY CY CY CY CY CY NX NX NX NX NX NX NX NX CX CX CX CX CX CX CX CX Core0 Core1 Core2 Core3 A[0] A[0] A[0] A[0] A[0] A[0] A[0] A[0] A[1] A[1] A[1] A[1] A[1] A[1] A[1] A[1] A[2] A[2] A[2] A[2] A[2] A[2] A[2] A[2] Prefetch Prefetch Prefetch Prefetch Prefetch Prefetch Prefetch Prefetch A[10] A[10] A[10] A[10] A[10] A[10] A[10] A[10] A[11] A[11] A[11] A[11] A[11] A[11] A[11] A[11] virtualized kernel threads A[12] A[12] A[12] A[12] A[12] A[12] A[12] A[12] Out-of-the-Box Libraries Oversubscribe the Resources TBB OpenMP OS Hardware

  9.  If more than one thread calls Intel MKL and thefunction being called is threaded, it is importantthat threading in Intel MKL be turned off. Set OMP_NUM_THREADS=1 in the environment. MKL Quick Fix Using Intel MKL with Threaded Applicationshttp://www.intel.com/support/performancetools/libraries/mkl/sb/CS-017177.htm

  10. Sequential MKL in SPQR Core0 Core1 Core2 Core3 TBB OpenMP OS Hardware

  11. Sequential MKL Sequential MKL Performance Performance of SPQR on 16-core Machine Out-of-the-Box Time (sec) Input Matrix

  12. No task-level parallelism! Want to exploitmatrix-level parallelism. SPQR Wants to Use Parallel MKL

  13. TBB_NUM_THREADS = 2 OMP_NUM_THREADS = 2 TBB OpenMP OS Hardware Share Resources Cooperatively Core0 Core1 Core2 Core3 Tim Davis manually tunes libraries to effectively partition the resources.

  14. Manually Tuned Manually Tuned Performance Performance of SPQR on 16-core Machine Out-of-the-Box Sequential MKL Time (sec) Input Matrix

  15. Give resources to OpenMP Give resources to TBB Manual Tuning Cannot Share Resources Effectively

  16. MKLOpenMP Manual Tuning Destroys Functional Composability Tim Davis OMP_NUM_THREADS = 4 LAPACKAx=b

  17. MKLv1 MKLv2 MKLv3 App SPQR 0 0 1 2 3 Manual Tuning Destroys Performance Composability SPQR

  18. Talk Roadmap • Problem: Efficient parallel composability is hard! • Solution: • Harts: better resource abstraction • Lithe: framework for sharing resources • Evaluation

  19. Virtualized Threads are Bad App 1 (TBB) App 1 (OpenMP) App 2 OS Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Different codes compete unproductively for resources.

  20. Partition 1 Partition 2 SPQR TBB MKLOpenMP Space-Time Partitions aren’t Enough App 2 OS Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Space-time partitions isolate diff apps. What to do within an app?

  21. Represent real hw resources. Requested, not created. OS doesn’t manage harts for app. SPQR Harts TBB MKLOpenMP Harts: Hardware Thread Contexts OS Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7

  22. Sharing Harts Hart 0 Hart 1 Hart 2 Hart 3 time TBB OpenMP OS Hardware Partition

  23. library (scheduler) hierarchy Cilk Cilk Ct TBB Ct OpenMP Cooperative Hierarchical Schedulers • Modular:Each piece of the app scheduled independently. application call graph TBB OMP • Hierarchical:Caller gives resources to callee to execute on its behalf. • Cooperative:Callee gives resources back to caller when done.

  24. TBB Sched: next? executeTBB task TBB Tasks TBB Sched: next? execute TBB task TBB Sched: next?nothing left to do, give hart back to parent Cilk Sched: next? don‘t start new task, finish existing one first Ct Sched: next? A Day in the Life of a Hart Cilk TBB Ct OpenMP time

  25. Caller enter yield request register unregister call return interface for exchanging values enter yield request register unregister Callee call return Standard Lithe ABI TBBLithe Scheduler CilkLithe Scheduler Parent Scheduler interface for sharing harts OpenMPLithe Scheduler TBBLithe Scheduler Child Scheduler • Analogous to function call ABI for enabling interoperable codes. • Mechanism for sharing harts, not policy.

  26. TBBLithe enter yield request register unregister OpenMPLithe enter yield request register unregister scheduler hierarchy current scheduler current scheduler TBBLithe OpenMPLithe TBB OpenMP Lithe OS OS Hardware Hardware Lithe Runtime yield harts

  27. matmult(){ enter enter yield yield request request register register unregister unregister register(OpenMPLithe); : : OpenMPLithe Scheduler unregister(OpenMPLithe); } Register / Unregister TBBLithe Scheduler register unregister time Register dynamically adds the new scheduler to the hierarchy.

  28. matmult(){ enter enter yield yield request request register register unregister unregister request(n); OpenMPLithe Scheduler unregister(OpenMPLithe); } Request TBBLithe Scheduler register(OpenMPLithe); request : time Request asks for more harts from the parent scheduler.

  29. enter enter yield yield request request register register unregister unregister enter(OpenMPLithe); : : OpenMPLithe Scheduler yield(); Enter / Yield TBBLithe Scheduler yield enter time Enter/Yield transfers additional harts between the parent and child.

  30. reg matmult req enter enter TBBLithe MKLOpenMPLithe enter SPQR yield yield yield unreg time SPQR with Lithe

  31. req matmult reg reg req matmult matmult reg req req reg matmult TBBLithe MKLOpenMPLithe SPQR unreg unreg unreg unreg SPQR with Lithe time

  32. Talk Roadmap • Problem: Efficient parallel composability is hard! • Solution: • Harts • Lithe • Evaluation

  33. Implementation • Harts: simulated using pinned Pthreads on x86-Linux • Lithe: user-level library (register, unregister, request, enter, yield, ...) • TBBLithe • OpenMPLithe(GCC4.4) ~600 lines of C & assembly ~2000 lines of C, C++, assembly ~1500 / ~8000 relevant lines added/removed/modified ~1000 / ~6000 relevant lines added/removed/modified TBBLithe OpenMPLithe Lithe Harts

  34. No Lithe Overhead w/o Composing • TBBLithe Performance (µbench included with release) All results on Linux 2.6.18, 8-core Intel Clovertown. • OpenMPLithe Performance (NAS parallel benchmarks)

  35. Performance Characteristics of SPQR (Input = ESOC) Time (sec) NUM_TBB_THREADS NUM_OMP_THREADS

  36. Performance Characteristics of SPQR (Input = ESOC) SequentialTBB=1, OMP=1 172.1 sec Time (sec) NUM_TBB_THREADS NUM_OMP_THREADS

  37. Performance Characteristics of SPQR (Input = ESOC) Time (sec) Out-of-the-Box TBB=8, OMP=8 111.8 sec NUM_TBB_THREADS NUM_OMP_THREADS

  38. Manually Tuned70.8 sec Performance Characteristics of SPQR (Input = ESOC) Time (sec) NUM_TBB_THREADS Out-of-the-Box NUM_OMP_THREADS

  39. Performance of SPQR with Lithe Out-of-the-Box Manually Tuned Lithe Time (sec) Input Matrix

  40. CilkLithe CtLithe TBBLithe OpenMPLithe    enter enter enter enter yield yield yield yield req req req req reg reg reg reg unreg unreg unreg unreg Future Work SPQR

  41. functionality 0 1 2 3 TBB MKLOpenMP • Parallel libraries need to share resources cooperatively. SPQR resource management Conclusion • Composability essential for parallel programming to become widely adopted. • Lithe project contributions • Harts: better resource model for parallel programming • Lithe: enables parallel codes to interoperate by standardizing the sharing of harts

  42. Acknowledgements We would like to thank George Necula and the rest of Berkeley Par Lab for their feedback on this work. Research supported by Microsoft (Award #024263 ) and Intel (Award #024894) funding and by matching funding by U.C. Discovery (Award #DIG07-10227). This work has also been in part supported by a National Science Foundation Graduate Research Fellowship. Any opinions, findings, conclusions, or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the National Science Foundation. The authors also acknowledge the support of the Gigascale Systems Research Focus Center, one of five research centers funded under the Focus Center Research Program, a Semiconductor Research Corporation program.

More Related