Programming models for the Barrelfish multi-kernel operating system

Programming models for the Barrelfish multi-kernel operating system Tim Harris Based on joint work with Martín Abadi, Andrew Baumann, Paul Barham, Richard Black, Tim Harris, Orion Hodson, Rebecca Isaacs, Ross McIlroy, Simon Peter, Vijayan Prabhakaran, Timothy Roscoe, Adrian Schüpbach, Akhilesh Singhania

Cache-coherent multicore Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 RAM RAM RAM RAM RAM RAM RAM RAM L3 L3 L3 L3 AMD Istanbul: 6 cores, per-core L2, per-package L3

Single-chip cloud computer (SCC) L2 Core RAM RAM MC-1 MC-3 Router MPB L2 Core RAM RAM MC-0 MC-4 VRC System interface Non-coherent caches Hardware supported messaging 24 * 2-core tiles On-chip mesh n/w

MSR Beehive Module RISCN Module RISCN Module RISCN Module RISCN Core 2 Core 1 Core N Core 3 RingIn [ 31 : 0 ] , SlotTypeIn [ 3 : 0 ] , SrcDestIn [ 3 : 0 ] Module MemMux Messages , Locks MQ WD Rdreturn ( 32 bits ) RD ( 128 bits ) RA , DDR Controller RA from display WA ( pipelined bus to controller all cores ) RD to Display controller RAM Ring interconnect Message passing in h/w No cache coherence Split-phase memory access

IBM Cell L3 L3 L3 L3 L3 L3 L3 L3 SXU SXU SXU SXU SXU SXU SXU SXU LS LS LS LS LS LS LS LS MFC MFC MFC MFC MFC MFC MFC MFC Memory interface I/O interface L1 PXU L2 64-bit Power (PXU) and SIMD (SXU) core types

Messaging vs shared data as default • Fundamental model is message based • “It’s better to have shared memory and not need it than to need shared memory and not have it” Barrelfishmultikernel Traditional operating systems Shared state,one-big-lock Fine-grainedlocking Clustered objects,partitioning Distributed state,replica maintenance

The Barrelfish multi-kernel OS App App App App OS node OS node OS node OS node State replica State replica State replica State replica Message passing x64 x64 ARM Accelerator core Hardware interconnect

The Barrelfish multi-kernel OS App App App App OS node OS node OS node OS node State replica State replica State replica State replica Message passing System runs on heterogeneous hardware, currently supporting ARM, Beehive, SCC, x86 & x64 x64 x64 ARM Accelerator core Hardware interconnect

The Barrelfish multi-kernel OS App App App App OS node OS node OS node OS node Focus of this talk: system components, each local to a specific core, and using message passing State replica State replica State replica State replica Message passing System runs on heterogeneous hardware, currently supporting ARM, Beehive, SCC, x86 & x64 x64 x64 ARM Accelerator core Hardware interconnect

The Barrelfish multi-kernel OS User-mode programs: several models supported, including conventional shared-memory OpenMP & pthreads App App App App OS node OS node OS node OS node Focus of this talk: system components, each local to a specific core, and using message passing State replica State replica State replica State replica Message passing System runs on heterogeneous hardware, currently supporting ARM, Beehive, SCC, x86 & x64 x64 x64 ARM Accelerator core Hardware interconnect

The Barrelfish messaging stack • High-level language runtime system: threads, join patterns, buffering, ...

The Barrelfish messaging stack • High-level language runtime system: threads, join patterns, buffering, ... H/W specific interface: max message sizes, guarantees, flow control, ...

The Barrelfish messaging stack • High-level language runtime system: threads, join patterns, buffering, ... Portable IDL: marshalling to/from C structs, events on send/receive possible H/W specific interface: max message sizes, guarantees, flow control, ...

The Barrelfish messaging stack • High-level language runtime system: threads, join patterns, buffering, ... Simple synchronous send/receive interface, support for concurrency Portable IDL: marshalling to/from C structs, events on send/receive possible H/W specific interface: max message sizes, guarantees, flow control, ...

Multi-core hardware & BarrelfishEvent-based modelSynchronous modelSupporting concurrencyCancellationPerformance

Event-based interface interface Echo { rpc ping(in int32 arg, out int32 result); } Stub compiler Common event-based programming interface Interconnect-specific stubs (SCC) Interconnect-specific stubs (Shared-mem) Interconnect-specific stubs (Beehive) Interconnect-specific stubs (Same core)

Event-based interface interface Echo { rpc ping(in int32 arg, out int32 result); } // Try to send => OK or ERR_TX_BUSY errval_tEcho_send_ping_call (Echo_binding *b, intarg); Stub compiler // Register callback when send worth re-trying errval_tEcho_register_send(Echo_binding *b, Echo_can_send_cb *cb); typedef void Echo_can_send_cb (Echo_binding *b); Common programming interface // Register callback on incoming “ping” response errval_tEcho_register_recv_ping_resp(Echo_binding *b, Echo_recv_ping_cb *cb); typedef void Echo_recv_ping_cb (Echo_binding *b, int result); Interconnect-specific stubs (SCC) Interconnect-specific stubs (UMP) Interconnect-specific stubs (BMP) Interconnect-specific stubs (LMP) // Wait for next callback to be ready, then execute it errval_tevent_dispatch(void);

Event-based interface Echo_binding *b; ... Echo_register_recv_ping_resp(b, &response_handler); ... bool done = false; err = Echo_send_ping_call(b, 10); if (err == ERR_TX_BUSY) { b->st = malloc(sizeof(int)); *(int*)b->st = 10; err = Echo_register_send(b, &resend_handler); assert(err == ERR_OK); } while (!done) { event_dispatch(); }

Event-based interface Echo_binding *b; ... Echo_register_recv_ping_resp(b, &response_handler); ... bool done = false; err = Echo_send_ping_call(b, 10); if (err == ERR_TX_BUSY) { b->st = malloc(sizeof(int)); *(int*)b->st = 10; err = Echo_register_send(b, &resend_handler); assert(err == ERR_OK); } while (!done) { event_dispatch(); } static void response_handler(Echo_binding *b, intval) { printf(“Got response %d\n”, val); done = true; }

Event-based interface Echo_binding *b; ... Echo_register_recv_ping_resp(b, &response_handler); ... bool done = false; err = Echo_send_ping_call(b, 10); if (err == ERR_TX_BUSY) { b->st = malloc(sizeof(int)); *(int*)b->st = 10; err = Echo_register_send(b, &resend_handler); assert(err == ERR_OK); } while (!done) { event_dispatch(); } static void response_handler(Echo_binding *b, intval) { printf(“Got response %d\n”, val); done = true; } static void resend_handler(Echo_binding *b) { err = Echo_send_ping_call(b, (int)*(b->st)); if (err == ERR_TX_BUSY) { err = Echo_register_send(&resend_handler); assert(err == ERR_OK); } else { free(b->st); } }

Why do it this way? • Overlap computation and communication • Non-blocking send/receive operations allow the caller to continue with other work • Remain responsive to multiple clients • Don’t end up “stuck” blocked for a receive from one client while another client is ready for service • Lightweight runtime system • No threads, no GC, etc. • Support on diverse hardware • Use within the implementation of runtime systems for higher-level languages

Goals for the synchronous model • Cleaner programming model • Integration in C • Resource consumption can be anticipated • Low overhead over the underlying messaging primitives • Don’t want to harm speed • Don’t want to harm flexibility (e.g., ability to compute while waiting for responses • Focus on concurrency between communicating processes • Everything runs in a single thread (unless the code says otherwise) • Execution is deterministic (modulo the timing and content of external inputs)

Event-based programming model interface Echo { rpc ping(in int32 arg, out int32 result); } Stub compiler Common event-based programming interface Interconnect-specific stubs (SCC) Interconnect-specific stubs (Shared-mem) Interconnect-specific stubs (Same core)

Synchronous message-passing interface Echo { rpc ping(in int32 arg, out int32 result); } Stub compiler Synchronous message-passing interface Synchronous message-passing stubs Common event-based programming interface Interconnect-specific stubs (SCC) Interconnect-specific stubs (Shared-mem) Interconnect-specific stubs (Same core)

Synchronous message-passing interface Echo { rpc ping(in int32 arg, out int32 result); } Stub compiler // Send “ping”, block until complete void Echo_tx_ping (Echo_binding *b, intarg); Synchronous message-passing interface // Wait for and receive response to “ping” void Echo_rx_ping(Echo_binding *b, int *result); Synchronous message-passing stubs Common event-based programming interface // RPC send-receive pair void Echo_ping(Echo_binding*b, intarg, int*result); Interconnect-specific stubs (SCC) Interconnect-specific stubs (UMP) Interconnect-specific stubs (LMP)

Channel abstraction Client-side Server-side

Channel abstraction Pair of uni-directional channels, of bounded but unknown capacity Client-side Server-side

Channel abstraction Send is synchronous between the sender and their channel endpoint Client-side Server-side

Channel abstraction FIFO, lossless transmission, with unknown delay Client-side Server-side

Channel abstraction Receive is synchronous between receiver and head of channel Client-side Server-side

Channel abstraction Client-side Server-side

Channel abstraction Only whole-channel failures, e.g. if other party exits Client-side Server-side

Two back-to-back RPCs • This looks cleaner but: • We’ve lost the ability to contact multiple servers concurrently • We’ve lost the ability to overlap computation with waiting static inttotal_rpc(Echo_binding *b1, Echo_binding *b2, intarg) { int result1, result2; Echo_ping(b1, arg, &result1); Echo_ping(b2, arg, &result2); return result1+result2; }

Adding asynchrony: async, do..finish static inttotal_rpc(Echo_binding *b1, Echo_binding *b2, intarg) { int result1, result2; do { asyncEcho_ping(b1, arg, &result1); asyncEcho_ping(b2, arg, &result2); } finish; return result1+result2; }

Adding asynchrony: async, do..finish static inttotal_rpc(Echo_binding *b1, Echo_binding *b2, intarg) { int result1, result2; do { asyncEcho_ping(b1, arg, &result1); asyncEcho_ping(b2, arg, &result2); } finish; return result1+result2; } If the async code blocks, then resume after the async

Adding asynchrony: async, do..finish static inttotal_rpc(Echo_binding *b1, Echo_binding *b2, intarg) { int result1, result2; do { asyncEcho_ping(b1, arg, &result1); asyncEcho_ping(b2, arg, &result2); } finish; return result1+result2; } If the async code blocks, then resume after the async Wait until all the async work (dynamically) in the do..finish has completed

Example: same-core L4-style RPC static inttotal_rpc(Echo_binding *b1, Echo_binding *b2, intarg) { int result1, result2; do { asyncEcho_ping(b1, arg, &result1); asyncEcho_ping(b2, arg, &result2); } finish; return result1+result2; }

Example: same-core L4-style RPC static inttotal_rpc(Echo_binding *b1, Echo_binding *b2, intarg) { int result1, result2; do { asyncEcho_ping(b1, arg, &result1); asyncEcho_ping(b2, arg, &result2); } finish; return result1+result2; } Execute to the “ping” call as normal

Example: same-core L4-style RPC static inttotal_rpc(Echo_binding *b1, Echo_binding *b2, intarg) { int result1, result2; do { asyncEcho_ping(b1, arg, &result1); asyncEcho_ping(b2, arg, &result2); } finish; return result1+result2; } Message send transitions directly to recipient process

Example: same-core L4-style RPC static inttotal_rpc(Echo_binding *b1, Echo_binding *b2, intarg) { int result1, result2; do { asyncEcho_ping(b1, arg, &result1); asyncEcho_ping(b2, arg, &result2); } finish; return result1+result2; } Response returned to caller

Example: same-core L4-style RPC static inttotal_rpc(Echo_binding *b1, Echo_binding *b2, intarg) { int result1, result2; do { asyncEcho_ping(b1, arg, &result1); asyncEcho_ping(b2, arg, &result2); } finish; return result1+result2; } Caller proceeds through second call

Example: cross-core RPC static inttotal_rpc(Echo_binding *b1, Echo_binding *b2, intarg) { int result1, result2; do { asyncEcho_ping(b1, arg, &result1); asyncEcho_ping(b2, arg, &result2); } finish; return result1+result2; }

Example: cross-core RPC static inttotal_rpc(Echo_binding *b1, Echo_binding *b2, intarg) { int result1, result2; do { asyncEcho_ping(b1, arg, &result1); asyncEcho_ping(b2, arg, &result2); } finish; return result1+result2; } First call sends message, this core now idle

Example: cross-core RPC static inttotal_rpc(Echo_binding *b1, Echo_binding *b2, intarg) { int result1, result2; do { asyncEcho_ping(b1, arg, &result1); asyncEcho_ping(b2, arg, &result2); } finish; return result1+result2; } Resume after async; send second message

Example: cross-core RPC static inttotal_rpc(Echo_binding *b1, Echo_binding *b2, intarg) { int result1, result2; do { asyncEcho_ping(b1, arg, &result1); asyncEcho_ping(b2, arg, &result2); } finish; return result1+result2; } Resume after async; block here until done

Programming models for the Barrelfish multi-kernel operating system