Building Composable Parallel Software with Liquid Threads

Building Composable Parallel Software with Liquid Threads Heidi Pan*, Benjamin Hindman+, Krste Asanovic+ *MIT, +UC Berkeley Microsoft Numerical Library Incubation Team Visit UC Berkeley, April 29, 2008

Today’s Parallel Programs are Fragile • Parallel programming usually needs to be aware of hardware resources to achieve good performance. • Don’t incur overhead of thread creation if no resources to run in parallel. • Run related tasks on same core to preserve locality. • Today’s programs don’t have direct control over resources, but hope that the OS will do the right thing. • Create 1 kernel thread per core. • Manually multiplex work onto kthreads to control locality & task prioritization. • Even if the OS tries to bind each thread to a particular core, it’s still not enough! Integer Programming App (B&B) spawn spawn spawn spawn Task Parallel Library(TPL) Runtime OS KT0 KT1 KT2 KT3 KT4 KT5 P0 P1 P2 P3 P4 P5

Today’s Parallel Codes are Not Composable • The system is oversubscribed! • Today’s typical solution: use sequential version of libraries within parallel app! Integer Programming App (B&B) parallel for 1 2 85 0 09 2 0 4 3 75 0 53 1 2 3 3 91 2 25 2 0 MathLib(MKL) spawn spawn 6 6 81 3 24 8 6 1 0 05 3 72 3 5 7 1 98 6 29 0 0 spawn spawn OpenMP Runtime Task Parallel Library(TPL) Runtime 1 2 85 0 09 2 0 4 3 75 0 53 1 2 3 3 91 2 25 2 0 6 6 81 3 24 8 6 1 0 05 3 72 3 5 7 1 98 6 29 0 0 OS P0 P1 P2 P3 P4 P5

Global Scheduler is Not the Right Solution • Difficult to design a one-size-fits-all scheduler that provides enough expressiveness and performance for a wide range of codes efficiently. • How do you design a dynamic load-balancing scheduler that preserves locality of both divide-and-conquer and linear algebra algorithms? • Difficult to convince all SW vendors and programmers to comply to the same programming model. • Difficult to optimize critical sections of code w/o interfering with or changing the global scheduler. Integer Programming App (B&B) Solver 1 2 85 0 09 2 0 parallel constructsspawn, parallel for, … Generic Global Scheduler(User or OS)

Cooperative Hierarchical Scheduling Goals: • Distributed Scheduling • Customizable, scalable, extensible schedulers that make localized code-specific scheduling decisions. • Hierarchical Scheduling • Parent decides relative priority of its children. • Cooperative Scheduling • Schedulers cooperate with each other to achieve globally optimal performance for app. Integer Programming App (B&B) Solver 1 2 85 0 09 2 0 OpenMP Scheduler(Child) TPL Scheduler (Parent)

235 781 143 128500920 552 372 801 990115423 OpenMP OpenMP OpenMP TPL OpenMP OpenMP TPL Cooperative Hierarchical Scheduling • Distributed Scheduling • At any point in time, each scheduler has full control over a subset of the kernel threads allotted to the application to schedule its code. • Hierarchical Scheduling • A scheduler decides how many of its kernel threads to give to each child scheduler, and when these threads are given. • Cooperative Scheduling • A scheduler decides when to relinquish its kernel threads instead of being pre-empted by its parent scheduler.

Standardizing Inter-Scheduler Interface Integer Programming App (B&B) Solver 1 2 85 0 09 2 0 OpenMP Scheduler(Child) Standardized Inter-SchedulerResource Management Interface to achieveCooperative Hierarchical Scheduling TPL Scheduler (Parent) Need to extend sequential ABI to support the transfer of resources!

Updating the ABI for the Parallel World • Functional ABI • Call transfers the thread to the callee, which has full control of register & stack resources to schedule its instructions, and cooperatively relinquishes thread upon return. • Identical to sequential call. Integer Programming App (B&B) call 2378 T0 8502 T1 9254 solve(A) { 2385780292035431 0331 T2 T3 T4 T5 ret t OpenMP • Resource Mgmt ABI • Parallel callee registers with caller to ask for more resources. • Caller enters callee on additional threads that it decides to grant. • Callee cooperatively yields threads. }; call (steal) reg 2378 TPL Scheduler enter 8502 0331 OS 9254 yield unreg P0 P1 P2 P3 P4 P5 ret t

The Case for a Resource Mgmt ABI By making resources a first-class citizen, we enable: • Composability: • Code can be written without knowing the context in which it will be called to encourage abstraction, reuse, and independence. • Scalability: • Code can call any library function without worrying about inadvertently oversubscribing the system’s resources. • Heterogeneity: • An application can incorporate parallel libraries that are implemented in different languages and/or linked with different runtimes. • Transparency: • A library function looks the same to its caller, regardless of whether its implementation is sequential or parallel.

TPL Example: Managing Child Schedulers • T0: 1) Push continuations at spawn points onto work queue. 2) Upon child registration, push child’s enter to recruit more threads. 3) Child keeps track of its own parallelism (not pushed onto parent queue). • T1: Steal subtree to compute. • T2: Steal enter task, which effectively grants the thread to the child. T0 T1 T2 T0 call T1 call 0 T2 spawn solve(A) { steal steal 2385780292035431 1 enter enter 1 3 OpenMP 2 }; steal steal

MVMult Ex: Managing Variable # of Threads • Partition work into tasks, each operating on an optimal cache block size. • Instead of statically mapping all tasks onto a fixed number of threads (SPMD), tasks are dynamically fetched by current threads (and load balanced). • No loss of locality if no reuse of data between tasks. • Additional synchronization may be needed to impose an ordering of noncommutative floating-point operations. next task call reg enter parallel for 1 2 85 0 09 2 0 4 3 75 0 53 1 2 3 3 91 2 25 2 0 enter 6 6 81 3 24 8 6 1 0 05 3 72 3 5 7 1 98 6 29 0 0 yield yield unreg ret t

Liquid Threads Model • Thread resources flow dynamically & flexibly between different modules. • More robust parallel codes that adapt to different/changing environments. P0 P1 P2 P3 call enter enter P0 P1 P2 P3 yield P0 P1 P2 P3 yield P0 P1 P2 P3 ret t

Lithe: Liquid Thread Environment ABI call ret enter yield request : • Not a (high-level) programming model. • Low-level ABI for expert programmers (compiler/tool/standard library developers) to control resources & map parallel codes. • Lithe can be deployed incrementally b/c it supports sequential library function calls & provides some basic cooperative schedulers. functional cooperativeresourcemanagement • Lithe also supports management of other resources, such as memory and bandwidth. • Lithe also supports (uncooperative) revocation of resources by the OS.

Lithe’s Interaction with the OS App • Up till now, we’ve implicitly assumed that we’re the only app running, but the OS is usually time-multiplexing multiple apps onto the machine. • We believe that a manycore OS should partition the machine spatially & give each app direct control over resources (cores instead of kthreads). • The OS may want to dynamically change the resource allocation between the apps depending on the current workload. • Lithe-compliant schedulers are robust and can easily absorb additional threads given by the OS & yield threads voluntarily to the OS. • Lithe-compliant schedulers can also easily dynamically check for contexts from threads pre-empted by the OS to schedule on remaining threads. • Lithe-compliant schedulers don’t use spinlocks (deadlock avoidance). App App 1 App 1 App 2 App 3 time-multiplexing space-multiplexing(spatial partitioning) OS OS P0 P1 P2 P3 P0 P1 P2 P3

Slither Status: In Early Stage of Development add/kill thread Fibonacci onVthread (Work Stealing Scheduler) • Slither simulates a variable-sized partition. • We simulate hard threads using pthreads • We simulate partitions using processes. • User can dynamically add/kill threads from the Vthread partition through the Slither prompt & Vthread will adapt.

Summary • Lithe defines a new parallel ABI that: • supports cooperative hierarchical scheduling. • enables a liquid threads model in which thread resources flow dynamically & flexibly between different modules. • provides the foundation to build composable & robustparallel software. • The work is funded partly by

Building Composable Parallel Software with Liquid Threads

Building Composable Parallel Software with Liquid Threads

Presentation Transcript

Building Product Populations with Software Components

Infer.NET Building software with intelligence

Software building

Synchronizing threads with mutexes

Synchronizing threads with mutexes

Synchronizing threads with mutexes

Software Verification With Liquid Types

Threads “the future is parallel”

Programming with Threads

Building SOLID Software with Dependency Injection

Experiences with Parallel Numerical Software Interoperability

Composable Metamodeling Environment

Building Core Concepts with Computational Software

Software Parallel Intro

Liquid Software

Building Software

Parallel Programming with Threads

Programming with Threads

Programming with Threads