ParalleX: Future of Scalable Parallel Execution Models

A Presentation to CGCSC 2006: ParalleX:Towards a New Parallel Execution Model for Scalable Programming and Architectures Thomas Sterling Louisiana State University California Institute of Technology Oak Ridge National Laboratory September 11, 2006

Parallel Computing Challenges • Ease of programming • Representation of application algorithms • Control of resource management • Efficiency of operation • Overhead, latency, contention (bandwidth), starvation • Scalability – strict and weak • Tracking technology advances • Relative costs & sizes, balance of feeds & speeds, physical limits • Exposure & Exploitation of parallelism • Reduce execution time • Increase throughput • Hide latencies and overlap overheads • Symbolic computing, e.g. directed graphs

Observations on Historical Trends • Multiple Technology Generations • 5 generations of main memory • 6 generations of processor logic • Moore’s Law • DRAM • 4X every 3 years • Top-500 • Stretchable Linpack benchmark • ~100X per decade • Traverses multiple architecture classes • Architecture innovation tracks technology opportunity • Exploits technology strengths while mitigating weaknesses • Adopts alternative models of computation to govern execution • Technology has progressed > 2 orders of magnitude (density) while execution model has not changed in 15 years

Attributes of an Execution Model • Specifies referents, their interrelationships, and actions that can be performed on them • Defines semantics of state objects, functions, parallel flow control, and distributed interactions • Leaves unbound policies of implementation technology, structure, and mechanism • Is NOT a programming model, architecture, or virtual machine • Enables reasoning about the decision chain for resource scheduling • Conceptual framework for considering design decisions for programming languages, compilers, runtime, OS, and hardware architectures

Towards A New Model of Parallel Computation • Serve as a discipline to govern future scalable system architectures, programming methods, and runtime/OS • Address Dominant Challenges to Enable High Efficiency • Latency • Overhead • Starvation • Resource contention • Programmability • Originated with ANL DOE Advanced Programming project • Driver for SNL/UNM DOE Fast-OS Config-OS project • Provide alternative model to conventional CSP paradigm • Target for composability methodology and lightweight kernels • Test for dynamic adaptive runtime components • Explore system architecture and support mechanisms under NSF • Potentially divisive, disruptive, and ignored • Borrows from a lot good stuff

Prior Projects that Influenced ParalleX • Beowulf • NASA, GSFC & JPL • Continuum Computing Architecture • DARPA, Caltech • HTMT • NSF/DARPA/NSA/NASA, Caltech & JPL • DIVA • DARPA, USC ISI • Gilgamesh • NASA, JPL • Percolation • NSF, U. of Delaware • Advance Programming Models • DOE, ANL • Cascade • DARPA, Cray • Config-OS • DOE, SNL & UNM

Latency Hiding with ParcelsIdle Time with respect to Degree of Parallelism

LSU Research Strategy • Devise a new model of computation • Liberates current and future generation technologies • Empowers computer architecture beyond incrementalism • Specification with intermediate form syntax • Develop early reference implementations for near term application experiments and architecture studies • Derive resource/action decision chain strategy within context of new model to establish responsibilities of every level from application to hardware • Explore implications for programming models and compilers, runtime and operating systems, and parallel architecture design • Apply to conventional systems for near term improvements in non-disruptive market context • Parcel/threads FPGA accelerator

ParalleX Model’s Major Features • Specifies semantics, not implementation policies • Does not assume particular architecture • Permits different hardware/software implementations • Lends itself to innovative optimization and heterogeneity • Not a language • Provides an intermediate form (PXI) • Intrinsically latency hiding • Highly asynchronous • Exposes wide variety of forms of parallelism • Degrees of granularity, including fine grain • Different types of parallel action • Separation of virtual logical space from physical execution and storage space • Towards low power, multi-core, PIM, fault tolerance

A New Synthesis of Selected Concepts • Split-phase transaction • Telephone switching and database transaction processing • Dennis/Arvind with dataflow • Yelick/Culler with split-C • Message-driven • Dennis using Tokens in dataflow • Dally in his J-machine • Hewitt with Actors model • The whole object oriented thing • Multi-threaded • Smith with HEP & MTA • Unix Pthreads

More Synthesis • Distributed Shared Memory • Scott with T3D • BBN with Butterfly machines • Futures • Hewitt with Actors • Halstead’s Multilisp • Arvind’s I-structures • Machines like • MTA • J-machine • Percolation • Sterling & Gao in HTMT • Babb course grain dataflow • Samon and Warren in their N-body tree code

Yet more Synthesis • Lightweight control objects • Dennis’s dataflow templates • Struct processing from MIND • One sided • Gropp et al with MPI-2 • Carlson with UPC • In-memory synchronization • Smith & Callahan with empty/full bits • Copy semantics • Variation on Gao’s location consistency model

Localities • A “locality” is a contiguous physical domain • Guarantees compound atomic operations on local state • Manages intra-locality latencies • Exposes diverse temporal locality attributes • Divides the world into synchronous and asynchronous • System comprises a set of mutually exclusive, collectively exhaustive localities • A first class object • An attribute of other objects • Heterogeneous • Specific inalienable properties

Split Phase Transactions • A transaction is a set of interdependent actions on exchanged values • Transactions are divided between successive phases • All actions of a transaction phase are relatively local • Assigned to a given execution element • Operations perform on local state for low latency • Phases are divided at stages of remote access or service request • Thus, asynchronous phasing at split

Multi-Grain Multithreading • Threads are collections of related operations that perform on locally shared data • A thread is a continuation combined with a local environment • Modifies local named data state and temporaries • Updates intra thread and inter thread control state • Does not assume sequential execution • Other flow control for intra-thread operations possible • Thread can realize transaction phase • Thread does not assume dedicated execution resources • Thread is first class object identified in global name space • Thread is ephemeral

Parcels • Enables message-driven computation • Messages that specify function to be performed on a named element • Moves work and data between objects in different localities • Parcels are not first-class objects • Exists in the world of “parcel sets” • First-class objects • Transfer between parcel sets is atomic, invariant, and unobservable • Major semantic content • Destination object • Action to be performed on targeted object • Operands for function to be performed • Continuation specifier

Percolation Pre-Staging • An important latency hiding and scheduling technique • Overhead functions are not necessarily done optimally by high speed processors • Moves data and task specification to local temporary storage of an execution element by external means • Minimum overhead at execution site • Almost no remote accesses • Cycle: dispatch/prestage/execute/commit/control update • High speed execution element operates on work queue • Processors are dumb, memory is smart • Good for accelerators, functional elements, precious resources

Fine-grain event driven synchronization: breaking the barrier • A number of forms of synchronization are incorporated into the semantics • Message-driven remote thread instantiation • Lightweight objects • Data flow • Futures • In-memory synchronization • Control state is in the name space of the machine • Producer-consumer in memory • e.g., empty/full bits • Local mutual exclusion protection • Synchronization mechanisms as well as state are presumed to be intrinsic to memory • Directed trees and graphs • Low cost traversal

Global name space • User variables • Synchronization variables and objects • Threads as first-class objects • Moves virtual named elements in physical space • Parcel sets • Process • First class object • Specifies a broad task • Defines a distributed environment • Spans multiple localities • Need not be contiguous

Beyond current scope • Policies not specified • Execution order • Language and language syntax • What’s special about hardware • Runtime vs. OS responsibilities • Load balancing • What’s missing • Affinity, colocation • Fault intrinsics • Meta threads • I/O • Many details

PXIF – ParalleX Intermediate-form • Not a programming language • Provides command line-like hooks to relate to and control all elements and actions of ParalleX execution • Lists of actions and operands • ‘(<action> <target object> <function operands>)’ • Some special forms (sadly) • Create, delete, and move objects • Invoke, terminate, migrate actions • Syntaxless syntax • Currently uses a prefix notation but is amenable to other forms as long as one-on-one isomorphism • Note that MPI has more than one syntax • In-work • Not finished • Subject to change • But great progress

Reference Implementation • Goal • Validation of semantics and PXI formulation • Correctness • Completeness • Early testbed for experimentation and algorithm development • Executable reference for future PXI implementations by external collaborators • Strategy • Facilitates development of PXIF syntax specification • Employ rapid prototyping software development environment • Incremental design • Replace existing functions with PXI-specific modules • Refinement of ParalleX concepts and PXIF formalism

Accomplishments and Status • A previous toy version of some parts of early ParalleX concepts (predated actual PXIF) • Running PXIF interpreter (version 0.4) • Distributed vers. 0.3.2 to NASA ARC & UNM • Based on open source Common Lisp environment • CLISP & SBCL • Compiled interpreter • Virtual machine for execution of serialized code • ~75% PXIF implemented • Object representation of system, localities, threads, data, text • Futures, templates, continuations • Data types: arrays, structs, scalars • Automatic compilation of methods • Some small application kernels running on multithreaded PXVM • Collects runtime statistics • e.g., #threads, #instr, histogramming • Rudimentary PXIF debugger

Text Object Pool Text Text Text PXIF from Sources to Execution PXIF Sources High-level Translator Static Compiler Dynamic Compiler Internal Representation Thread Scheduler Thread Scheduler Thread Scheduler Multi-threaded VM Multi-threaded VM Multi-threaded VM ...

Conclusions • Undertaking an exploratory study of alternative execution models • Influenced by early architecture studies • Benefits from previous projects • Working toward • Specification • Reference implementation • Costs quantification • Realization on conventional distributed systems • FPGA-based accelerator • Architecture refinement • Programming model and language

ParalleX: Future of Scalable Parallel Execution Models