1 / 38

Pillar

Pillar. Jim Stichnoth Programming Systems Lab, Intel. Outline. Overview of the Pillar project Details on the Pillar runtime. Motivation. Implementing & tuning each new concurrent language is a lot of work! Compiler Find concurrency opportunities Global optimizations

kenton
Download Presentation

Pillar

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pillar Jim StichnothProgramming Systems Lab, Intel

  2. Outline • Overview of the Pillar project • Details on the Pillar runtime Pillar - Jim Stichnoth - 2008-12-01

  3. Motivation • Implementing & tuning each new concurrent language is a lot of work! • Compiler • Find concurrency opportunities • Global optimizations • Register allocation, code scheduling, instruction selection • Runtime • Threading, synchronization, data parallelism, transactions, ... • Garbage collection, stack walking, exceptions, ... Pillar - Jim Stichnoth - 2008-12-01

  4. Pillar • Parallel Implementation Language • Low-level implementation language for high-level concurrent & managed languages • Reusable compiler & runtime infrastructure across high-level languages Pillar - Jim Stichnoth - 2008-12-01

  5. Pillar architecture(Very high level) Pillar compiler Pillar program Parallel program Compile-timetool chain Languagecompiler Pillar compiler Object code Run-timeexecutable Language runtime Pillar runtime High-level language compiler Pillar runtime Pillar - Jim Stichnoth - 2008-12-01

  6. The Pillar language • A set of C language extensions • Heavily influenced by C-- work at Harvard & Microsoft Research • Highly practical reason: reuse existing optimizing C compiler • Concurrency constructs • Thread creation, synchronization, data-parallel operations • Sequential constructs • Support managed languages, fix other shortcomings of C Pillar - Jim Stichnoth - 2008-12-01

  7. Concurrency constructs:Parallel call • pcall(aff) func(a, b, c); • Fork a new child thread & execute func, parent & child run concurrently • Affinity/locality hint via aff • Join can be implemented via shared synchronization object passed as a parameter Pillar - Jim Stichnoth - 2008-12-01

  8. Concurrency constructs:Parallel-ready sequential call • prscall(aff) func(a, b, c); • Parallel-Ready Sequential Call (Goldstein’96) • Semantics identical to pcall • Parent thread starts eagerly executing child func • An idle thread can introduce concurrency by stealing parent’s continuation • Optimized for sequential execution Pillar - Jim Stichnoth - 2008-12-01

  9. Concurrency constructs:Data parallelism • Intel’s Ct primitives • “C with throughput extensions” • Large set of nested data parallel primitives (a la NESL) • Compiler analysis & optimization • Future work Pillar - Jim Stichnoth - 2008-12-01

  10. Concurrency constructs:Bulk-spawn • Efficiently spawn a number of threads with similar arguments • Useful for data-parallel operations • Join at the end • Future work Pillar - Jim Stichnoth - 2008-12-01

  11. Concurrency constructs:Synchronization • Software transactions • Other common synchronization primitives Pillar - Jim Stichnoth - 2008-12-01

  12. Sequential constructs:Stack walking • Pillar runtime provides a frame-by-frame iterator over a thread’s stack • Spans • span KEY value { ... } • Associate metadata with a block of code • Look up metadata during stack walking • Garbage collection • No specific GC implementation (or object model) is provided or dictated • ref obj; • New ref type allows compiler to track GC references in stack frame • Optional parameters to ref declaration allow for arbitrary language-defined reference variants • E.g. interior pointers, weak references, pinned objects, etc. Pillar - Jim Stichnoth - 2008-12-01

  13. C’s setjmp/longjmp “done right” Roughly speaking, continuation=setjmp and cut=longjmp Directly in the language, not a library A cut to the target continuation may pass arguments Special source code annotations give the compiler extra control flow info Syntax continuation k(a, b, c):... foo(k); ... cut to k(x, y, z); foo() also cuts to k1, k2; foo() also unwinds to k3, k4; foo() never returns; Sequential constructs:Second-class continuations Pillar - Jim Stichnoth - 2008-12-01

  14. Sequential constructs:Calls • Tail calls • tailcall foo(); • Particularly for compiling functional languages • Managed/unmanaged calls • Unmanaged (legacy) code uses calling conventions like __cdecl, __stdcall, etc. • Managed (Pillar) functions implicitly add the managed attribute • Compiler recognizes mismatches, redirects through Pillar runtime routine • Allows stack unwinding past sections of unmanaged frames • #pragma managed(off)#include <stdio.h>#pragma managed(on) Pillar - Jim Stichnoth - 2008-12-01

  15. Cuts compose poorly with some operations Example: cutting out of a transaction A calls B B starts a transaction B calls C transactionally C cuts back into A Transaction was not ended! Many other examples Pillar’s solution: composable cuts See LCPC’2007 paper for more details Composable cuts function A(): ... B(k); ... continuation k: ... function B(k): ... txn_begin(); C(k); txn_end(); ... function C(k): ... cut to k; ... Pillar - Jim Stichnoth - 2008-12-01

  16. Pillar compiler • Modification of Intel’s product compiler • Continuations: model additional control-flow edges, killing of callee-save registers during a cut • Recognize managed/unmanaged calls • GC support: track GC references through all compiler phases • Stack unwinding metadata: frame-by-frame unwinding, spans, GC roots • Implement Pillar runtime API to decode metadata at run time Pillar - Jim Stichnoth - 2008-12-01

  17. Pillar runtime • Implements key Pillar services • Parallel calls, prscall continuation stealing, futures • Stack walking, root set enumeration • Composable cuts • Invokes Pillar compiler’s metadata decoder as necessary • Built on top of McRT (Intel’s “Many Core Run Time”) • Provides core services such as user-level threads, scheduling, synchronization, software transactional memory • Approximately 7,000 lines of C code • API is architecture-neutral except for machine word size & stack iterator’s set of registers Pillar - Jim Stichnoth - 2008-12-01

  18. Pillar architecture Pillar compiler Pillar program Parallel program Compile-timetool chain Languageconverter Pillar compiler Metadatadecoder Object code& metadata Run-timeexecutable Language runtime Pillar runtime GCinterface Garbage collector McRT High-level language compiler Pillar runtime Pillar - Jim Stichnoth - 2008-12-01

  19. High-level languages • Java • Main motivation: throw huge volume of Pillar code at the Pillar compiler and runtime • Exercises stack iteration, spans, GC support, second-class continuations, managed/unmanaged calls • X10 • Leverages IBM’s Java-based open-source reference implementation • Hard to study performance/scalability using reference implementation! • Concurrent functional language • Lots of concurrency due to limitations on side effects & dependencies • Exercises GC support, second-class continuations, tail calls • Implements a futures package using pcall Pillar - Jim Stichnoth - 2008-12-01

  20. Outline • Overview of the Pillar project • Gory details of the Pillar runtime & architecture Pillar - Jim Stichnoth - 2008-12-01

  21. Stack walking • Pillar runtime interface for iterating over a thread’s stack frames • Get youngest frame, get next frame, test for last frame • Getting next frame means simulating a function return • Access metadata associated with a stack frame • Look up span metadata • Enumerate the root set to the GC • How? • Code generator registers callback functions with Pillar runtime • Perform operations like: unwind one frame, look up span metadata • Associated with a code address range • Thus code generator defines its own flexible metadata format • Or, stack walking can be sped up by using a standard metadata format Pillar - Jim Stichnoth - 2008-12-01

  22. Garbage collection support • Pillar language & runtime do not dictate an object model • This is a contract between the Pillar program (generated from the high-level language) and the GC implementation (provided with the HLL) • Minimal language/runtime support allows highly flexible range of GC implementations • Generalized references • ref(TAG, parameter) r; • Predefined tags • PrtGcTagDefault: r is the canonical object pointer • PrtGcTagBase: r is a (possibly interior) pointer with respect to parameter • PrtGcTagOffset: r is an interior pointer at a given offset from the canonical base • Other user-defined tags, e.g. weak roots, pinned, etc. • Predefined tags correspond to traditional compiler optimizations • Compiler metadata allows tag & parameter to be passed directly to the GC Pillar - Jim Stichnoth - 2008-12-01

  23. Implementing exceptions • Iterate frame-by-frame using the stack unwinding interface • Use the span lookup interface to get handler metadata • Decide whether current frame handles the exception • Use “also unwinds to” metadata to get handler • Annotation on a function call (ordered list of continuations) • Causes compiler to produce metadata allowing lookup & instantiation of continuations during stack unwinding • Use the “cut to” mechanism to transfer control to the handler Pillar - Jim Stichnoth - 2008-12-01

  24. Cuts & continuations • A continuation is like a C jmpbuf, allowing a unit-time cut back to somewhere in the continuation’s stack frame • Also allows arguments to be passed back • Structure in Pillar: • Code address (initialized lazily) • Optional argument buffer space • Cut operation is simple • Copy arguments into buffer space • Load continuation address into predetermined register • Jump to code address • Continuation code needs to fix up stack frame based on predetermined register value Pillar - Jim Stichnoth - 2008-12-01

  25. Composable cuts • Each thread maintains lightweight virtual stack alongside regular stack • Virtual stack head (VSH) in TLS • A virtual stack element (VSE) defines a destructor operation • The application pushes/pops VSEs in a balanced fashion • When instantiating a continuation, also capture current VSH • Fat continuation contains code address, VSH, and arguments • During a cut operation, intervening destructors are executed virtual stack head stack grows Pillar - Jim Stichnoth - 2008-12-01

  26. void foo(args) { int a; ref b; … bar(k1) also cuts to k1; … continuation k1(a, b): … continuation k2(b, a): … } Instantiated k1 within the method forces preserved registers to be saved in prolog Stack space allocated for continuations & locals Instantiating k1 initializes eip & vsh fields Eventually some method cuts to k1 Run any destructors based on TLS.vsh & k1.vsh (see earlier example) Continuation prolog adjusts esp, copies args Now ready to resume in k1’s code Cuts & continuations example Stack grows in this direction  Low memory addresses  esp  High memory addresses in-args ret IP saved registers (ebp, ebx, esi, edi) a b k1 k2 k1 ret IP … … k1 … b’ vsh eip a’ vsh eip a’ b’ Warning: Be sure to view this in slide-show mode! uninitialized stack space Pillar - Jim Stichnoth - 2008-12-01 initialized stack space

  27. Virtual stack notes • VSEs can be used as markers as well as destructors • E.g., the location of a prscall or managed-to-unmanaged transition • Markers have a trivial destructor • Pillar runtime interface for iterating over VSEs • A VSE may contain GC roots • Register an enumeration function for a VSE type Pillar - Jim Stichnoth - 2008-12-01

  28. Stack limit check • Prscall continuation stealing results in two threads sharing one stack • Parent uses bottom part, child uses top part • With a long & dense enough chain of prscalls, the thread stack can become arbitrarily small • Therefore, each function prolog must begin with a limit check and conditional stack extension sequence • Special tailcall sequence to a stack-extension runtime routine that re-invokes function with a fresh stack • Observation: Every function begins with a yield check and a stack limit check. Can they be combined? • Suspending a thread installs a special stack limit value such that limit checks always fail • Explicit prolog yield check can be removed Pillar - Jim Stichnoth - 2008-12-01

  29. Managed/unmanaged calls • When calling into unmanaged (legacy) code, it’s no longer possible to reliably walk the stack frame-by-frame • Solution: push a managed-to-unmanaged (M2U) VSE before calling • Record all relevant context in M2U VSE • When unwinding from an unmanaged frame, search the virtual stack for the topmost M2U VSE • Restore context from VSE • Resume unwinding managed frames Pillar - Jim Stichnoth - 2008-12-01

  30. Thread-local storage • Pillar needs several TLS fields • Current stack limit value • Yield semaphore • Language-specific TLS pointer • E.g., nursery parameters for fast allocation • Virtual stack head • Etc. • TLS accesses tend to be very frequent • Therefore, a callee-save register (ebx) is reserved to hold TLS pointer within managed code • Substantial performance gain despite loss of register • In a pcall/prscall, child inherits parent’s language-specific TLS pointer • Child may want to override Pillar - Jim Stichnoth - 2008-12-01

  31. Cooperative preemption • Only a certain subset of instructions are GC-safe • I.e., the root set can be accurately determined • Compiler typically chooses the function entry and call sites as GC safepoints • Compiler generates calls to prtYield() at GC safepoints • Fast-path: check whether the TLS yield semaphore field is set • Can be inlined by compiler • Pillar runtime provides a suspend/resume interface • And an interface for iterating over threads Pillar - Jim Stichnoth - 2008-12-01

  32. McRT • McRT = Many-core RunTime • Internal platform for concurrency research • Features include: • Thread creation & scheduling • Large set of synchronization primitives • Scalable malloc/free • Software transactional memory • Pillar requires a few enhancements • Pillar-provided thread-id for synchronization • Maintain appearance of separate threads for prscall parent & child • Allows blocked threads to unblock to respond to suspend requests • Thread enumeration, in the presence of thread creation and dying • Ability to enumerate GC roots of “unborn” threads • Idle-wait function that can trigger prscall continuation stealing Pillar - Jim Stichnoth - 2008-12-01

  33. Private nurseries • Observation: High allocation rate of short-lived objects kills scalability • Some combination of memory & cache coherence traffic • The more we improved sequential performance, the worse scalability became! • Solution: Allocate from a thread-local “private nursery” • Invariant: No heap objects outside the private nursery point into the private nursery • If an update of an object field would break the invariant, do a private nursery collection • Move all live objects from the private nursery to the regular heap • Reset the private nursery • Doesn’t require stopping any other threads Pillar - Jim Stichnoth - 2008-12-01

  34. Private nurseries • Problem: Finding roots in deep stacks can be expensive • Observation: Deeper portions of stack tend to remain unchanged • Solution: “high-water marks” on stack • Each stack frame contains a high-water mark • Mark is cleared upon function entry • Stack walking interface allows mark to be set, and status to be queried • Stack walk for a private nursery collection can terminate early when a marked frame is found • Problem: “Unstable” performance with private nurseries • Overly frequent collections can kill performance • Hard to predict from static analysis of code • Experimenting with more general forms of escape analysis to reduce private nursery collections Pillar - Jim Stichnoth - 2008-12-01

  35. Prscall • Leave a special prscall frame/VSE on the stack • Flag indicates whether prscall continuation has been stolen • Provide some extra space for stolen continuation to expand into • Create new thread ID for child • Call child function • VSE destructor prevents cutting into “different” thread • Idle processor/thread suspends threads, looking for prscall • May be beneficial to steal deepest prscall • Set the continuation-stolen flag • Split the stack between parent and child • Including the virtual stack • Force a private-nursery collection • Cut to a continuation found within VSE, which returns to caller • Child returns to prscall frame, finds continuation-stolen flag set, exits Pillar - Jim Stichnoth - 2008-12-01

  36. Prscall challenges • Expected benefits • Dynamic load balancing • No locking in the common case • Auxiliary storage managed on stack, not heap • Difficulties/drawbacks • Stack limit check on every function entry • Inlining reduces function calls; combining with yield check helps • Possible stack extension/retraction hysteresis after stealing • What’s the best policy for where to steal? • Stealing also has to do a private nursery collection • Without high-water mark optimization • Finding the right granularity of concurrency Pillar - Jim Stichnoth - 2008-12-01

  37. Concurrent functional language • Working with a game company to design & implement the language • Novel type system • Functional style restricts dependencies, eases parallelization • Compilation/execution strategy: • Create thunks/closures to be evaluated • Compiler optimizations reduce number of thunks to evaluate, objects to allocate • Some (or all!) thunks can be spawned as futures • Vectorization for Larrabee • Pure C path in addition to Pillar • Boehm-Demers-Weiser conservative collector • Performance & scalability problems • Setjmp/longjmp Pillar - Jim Stichnoth - 2008-12-01

  38. Future research • Affinity • Automatic (& semi-automatic) means of scheduling threads near their data • Transactional memory • Find the right division of work between Pillar and high-level language • Interactions between transactions, thread creation, & cuts/exceptions • Bulk spawns & vectorization for data parallelism • Other task parallel models Pillar - Jim Stichnoth - 2008-12-01

More Related