1 / 16

Building Composable Parallel Software with Liquid Threads

Building Composable Parallel Software with Liquid Threads. Heidi Pan*, Benjamin Hindman + , Krste Asanovic + *MIT, + UC Berkeley Microsoft Numerical Library Incubation Team Visit UC Berkeley, April 29, 2008. Today’s Parallel Programs are Fragile.

charo
Download Presentation

Building Composable Parallel Software with Liquid Threads

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building Composable Parallel Software with Liquid Threads Heidi Pan*, Benjamin Hindman+, Krste Asanovic+ *MIT, +UC Berkeley Microsoft Numerical Library Incubation Team Visit UC Berkeley, April 29, 2008

  2. Today’s Parallel Programs are Fragile • Parallel programming usually needs to be aware of hardware resources to achieve good performance. • Don’t incur overhead of thread creation if no resources to run in parallel. • Run related tasks on same core to preserve locality. • Today’s programs don’t have direct control over resources, but hope that the OS will do the right thing. • Create 1 kernel thread per core. • Manually multiplex work onto kthreads to control locality & task prioritization. • Even if the OS tries to bind each thread to a particular core, it’s still not enough! Integer Programming App (B&B) spawn spawn spawn spawn Task Parallel Library(TPL) Runtime OS KT0 KT1 KT2 KT3 KT4 KT5 P0 P1 P2 P3 P4 P5

  3. Today’s Parallel Codes are Not Composable • The system is oversubscribed! • Today’s typical solution: use sequential version of libraries within parallel app! Integer Programming App (B&B) parallel for 1 2 85 0 09 2 0 4 3 75 0 53 1 2 3 3 91 2 25 2 0 MathLib(MKL) spawn spawn 6 6 81 3 24 8 6 1 0 05 3 72 3 5 7 1 98 6 29 0 0 spawn spawn OpenMP Runtime Task Parallel Library(TPL) Runtime 1 2 85 0 09 2 0 4 3 75 0 53 1 2 3 3 91 2 25 2 0 6 6 81 3 24 8 6 1 0 05 3 72 3 5 7 1 98 6 29 0 0 OS P0 P1 P2 P3 P4 P5

  4. Global Scheduler is Not the Right Solution • Difficult to design a one-size-fits-all scheduler that provides enough expressiveness and performance for a wide range of codes efficiently. • How do you design a dynamic load-balancing scheduler that preserves locality of both divide-and-conquer and linear algebra algorithms? • Difficult to convince all SW vendors and programmers to comply to the same programming model. • Difficult to optimize critical sections of code w/o interfering with or changing the global scheduler. Integer Programming App (B&B) Solver 1 2 85 0 09 2 0 parallel constructsspawn, parallel for, … Generic Global Scheduler(User or OS)

  5. Cooperative Hierarchical Scheduling Goals: • Distributed Scheduling • Customizable, scalable, extensible schedulers that make localized code-specific scheduling decisions. • Hierarchical Scheduling • Parent decides relative priority of its children. • Cooperative Scheduling • Schedulers cooperate with each other to achieve globally optimal performance for app. Integer Programming App (B&B) Solver 1 2 85 0 09 2 0 OpenMP Scheduler(Child) TPL Scheduler (Parent)

  6. 235 781 143 128500920 552 372 801 990115423 OpenMP OpenMP OpenMP TPL OpenMP OpenMP TPL Cooperative Hierarchical Scheduling • Distributed Scheduling • At any point in time, each scheduler has full control over a subset of the kernel threads allotted to the application to schedule its code. • Hierarchical Scheduling • A scheduler decides how many of its kernel threads to give to each child scheduler, and when these threads are given. • Cooperative Scheduling • A scheduler decides when to relinquish its kernel threads instead of being pre-empted by its parent scheduler.

  7. Standardizing Inter-Scheduler Interface Integer Programming App (B&B) Solver 1 2 85 0 09 2 0 OpenMP Scheduler(Child) Standardized Inter-SchedulerResource Management Interface to achieveCooperative Hierarchical Scheduling TPL Scheduler (Parent) Need to extend sequential ABI to support the transfer of resources!

  8. Updating the ABI for the Parallel World • Functional ABI • Call transfers the thread to the callee, which has full control of register & stack resources to schedule its instructions, and cooperatively relinquishes thread upon return. • Identical to sequential call. Integer Programming App (B&B) call 2378 T0 8502 T1 9254 solve(A) { 2385780292035431 0331 T2 T3 T4 T5 ret t OpenMP • Resource Mgmt ABI • Parallel callee registers with caller to ask for more resources. • Caller enters callee on additional threads that it decides to grant. • Callee cooperatively yields threads. }; call (steal) reg 2378 TPL Scheduler enter 8502 0331 OS 9254 yield unreg P0 P1 P2 P3 P4 P5 ret t

  9. The Case for a Resource Mgmt ABI By making resources a first-class citizen, we enable: • Composability: • Code can be written without knowing the context in which it will be called to encourage abstraction, reuse, and independence. • Scalability: • Code can call any library function without worrying about inadvertently oversubscribing the system’s resources. • Heterogeneity: • An application can incorporate parallel libraries that are implemented in different languages and/or linked with different runtimes. • Transparency: • A library function looks the same to its caller, regardless of whether its implementation is sequential or parallel.

  10. TPL Example: Managing Child Schedulers • T0: 1) Push continuations at spawn points onto work queue. 2) Upon child registration, push child’s enter to recruit more threads. 3) Child keeps track of its own parallelism (not pushed onto parent queue). • T1: Steal subtree to compute. • T2: Steal enter task, which effectively grants the thread to the child. T0 T1 T2 T0 call T1 call 0 T2 spawn solve(A) { steal steal 2385780292035431 1 enter enter 1 3 OpenMP 2 }; steal steal

  11. MVMult Ex: Managing Variable # of Threads • Partition work into tasks, each operating on an optimal cache block size. • Instead of statically mapping all tasks onto a fixed number of threads (SPMD), tasks are dynamically fetched by current threads (and load balanced). • No loss of locality if no reuse of data between tasks. • Additional synchronization may be needed to impose an ordering of noncommutative floating-point operations. next task call reg enter parallel for 1 2 85 0 09 2 0 4 3 75 0 53 1 2 3 3 91 2 25 2 0 enter 6 6 81 3 24 8 6 1 0 05 3 72 3 5 7 1 98 6 29 0 0 yield yield unreg ret t

  12. Liquid Threads Model • Thread resources flow dynamically & flexibly between different modules. • More robust parallel codes that adapt to different/changing environments. P0 P1 P2 P3 call enter enter P0 P1 P2 P3 yield P0 P1 P2 P3 yield P0 P1 P2 P3 ret t

  13. Lithe: Liquid Thread Environment ABI call ret enter yield request : • Not a (high-level) programming model. • Low-level ABI for expert programmers (compiler/tool/standard library developers) to control resources & map parallel codes. • Lithe can be deployed incrementally b/c it supports sequential library function calls & provides some basic cooperative schedulers. functional cooperativeresourcemanagement • Lithe also supports management of other resources, such as memory and bandwidth. • Lithe also supports (uncooperative) revocation of resources by the OS.

  14. Lithe’s Interaction with the OS App • Up till now, we’ve implicitly assumed that we’re the only app running, but the OS is usually time-multiplexing multiple apps onto the machine. • We believe that a manycore OS should partition the machine spatially & give each app direct control over resources (cores instead of kthreads). • The OS may want to dynamically change the resource allocation between the apps depending on the current workload. • Lithe-compliant schedulers are robust and can easily absorb additional threads given by the OS & yield threads voluntarily to the OS. • Lithe-compliant schedulers can also easily dynamically check for contexts from threads pre-empted by the OS to schedule on remaining threads. • Lithe-compliant schedulers don’t use spinlocks (deadlock avoidance). App App 1 App 1 App 2 App 3 time-multiplexing space-multiplexing(spatial partitioning) OS OS P0 P1 P2 P3 P0 P1 P2 P3

  15. Slither Status: In Early Stage of Development add/kill thread Fibonacci onVthread (Work Stealing Scheduler) • Slither simulates a variable-sized partition. • We simulate hard threads using pthreads • We simulate partitions using processes. • User can dynamically add/kill threads from the Vthread partition through the Slither prompt & Vthread will adapt.

  16. Summary • Lithe defines a new parallel ABI that: • supports cooperative hierarchical scheduling. • enables a liquid threads model in which thread resources flow dynamically & flexibly between different modules. • provides the foundation to build composable & robustparallel software. • The work is funded partly by

More Related