Elastic Computing

A Framework for Effective Multi-core Heterogeneous Computing Elastic Computing

Introduction • Clear trend towards multi-core heterogeneous systems • Problem: increased application-design complexity • Different resources require different algorithms to execute efficiently • Compiler research attempts to compile code for different resources • Fundamentally limited as compilers can’t infer one algorithm from another • Elastic Computing: optimization framework with knowledge base of implementations for different elastic functions • Designers call functions that automatically optimize for any system • i.e., designers specify “what” without specifying “how” bitonic_sort(...) { ... } quick_sort(...) { ... } Sorting Implementations System Resources quick_sort(...) quick_sort(...) quick_sort(...) Performance uP Compiler FPGA bitonic_sort(...) bitonic_sort(...) bitonic_sort(...) Optimal Algorithm Single Algorithm

Overview • Instead of specifying a specific implementation, applications use Elastic Functions • Elastic Functions contain a knowledge-base of implementation and parallelization options • At run-time, Elastic Computing Framework determines the best execution decisions • Decision based on available system resources as well as function parameters int main(...) { ... sort(A, 100); ... } Application Implementations for Sorting Elastic Function sort(A, 100); Performance quick_sort(...) insertion_sort(...) { ... } quick_sort(...) { ... } bitonic_sort(...) { ... } Elastic Computing Framework Quick Sort Bitonic Sort Insertion Sort System Resources Elastic Function Library …

Overview • Instead of specifying a specific implementation, applications use Elastic Functions • Elastic Functions contain a knowledge-base of implementation and parallelization options • At run-time, Elastic Computing Framework determines the best execution decisions • Decision based on available system resources as well as function parameters int main(...) { ... sort(A, 100); ... } Application Implementations for Sorting Elastic Function Performance quick_sort(...) { ... } bitonic_sort(...) { ... } insertion_sort(...) { ... } Elastic Computing Framework Quick Sort Bitonic Sort Insertion Sort bitonic_sort(...) System Resources Elastic Function Library …

Overview • If multiple resources are available, the Elastic Computing Framework will dynamically parallelize work across different resources • Automatically determines efficient partitioning of work to resources • Also, determines most efficient implementation for each resource individually Application Implementations for Sorting Elastic Function int main(...) { ... sort(A, 100); ... } Logical Execution quick_sort(...) insertion_sort(...) { ... } quick_sort(...) { ... } bitonic_sort(...) { ... } partition quick sort bitonic sort Elastic Computing Framework bitonic_sort(...) System Resources Elastic Function Library …

Overview • If multiple resources are available, the Elastic Computing Framework will dynamically parallelize work across different resources • Automatically determines efficient partitioning of work to resources • Also, determines most efficient implementation for each resource individually Application Logical Execution Implementations for Sorting Elastic Function int main(...) { ... sort(A, 100); ... } partition quick_sort(...) { ... } bitonic_sort(...) { ... } insertion_sort(...) { ... } partition partition Elastic Computing Framework … quick sort bitonic sort quick sort System Resources Elastic Function Library …

Overview • Elastic Computing is transparent int main(...) { ... sort(A, 100); ... } Application Implementations for Sorting Elastic Function Applications treat Elastic Computing as a high-performance auto-tuning library of functions insertion_sort(...) { ... } quick_sort(...) { ... } bitonic_sort(...) { ... } System Resources Elastic Computing Framework Elastic Computing determines how to efficiently execute the Elastic Functions on behalf of the application Elastic Function Library …

Overview • Elastic Computing is transparent, portable Elastic Computing automatically optimizes the Elastic Function execution to the available system resources, even if the application is moved to a different system int main(...) { ... sort(A, 100); ... } Application System Resources Implementations for Sorting Elastic Function insertion_sort(...) { ... } quick_sort(...) { ... } bitonic_sort(...) { ... } System Resources Elastic Computing Framework System Resources Elastic Function Library … …

Overview • Elastic Computing is transparent, portable, and adaptive Elastic Computing also automatically adapts the Elastic Function execution to the application’s input parameters (e.g., sorting 5 elements as opposed to 100) int main(...) { ... sort(A, 100); ... } int main(...) { ... sort(A, 5); ... } Application Implementations for Sorting Elastic Function Performance Performance quick_sort(...) bitonic_sort(...) { ... } quick_sort(...) { ... } insertion_sort(...) { ... } Elastic Computing Framework Quick Sort Quick Sort Bitonic Sort Bitonic Sort Insertion Sort Insertion Sort System Resources insertion_sort(...) Elastic Function Library …

Related Work • Parallel cross-compiling programming languages: • Examples: CUDA, OpenCL, DirectX, ImpulseC • Allows a single code file to describe parallel computation that can compile to numerous devices • Single-domain adaptable software libraries: • Examples: FFTW (for FFT) [Frigo 98], ATLAS (for linear algebra) [Whaley 98] • Measures performances of execution alternatives and determines the best way to execute the function for the specific function call and system • General-purpose adaptable software libraries: • Examples: PetaBricks [Ansel 09], SPIRAL [Püschel 05] • Uses custom languages to expose algorithmic/implementation choices to the compiler, and relies on measured performance and learning techniques to determine the best • Examples: Qilin [Luk 09] • Uses dynamic compilation to determine a data graph, and relies on measured performance to determine an efficient partitioning of work across heterogeneous resources • Differentiating features of Elastic Computing: • Allows specification of multiple algorithms for different devices • Automatically determines efficient partitionings of work between heterogeneous devices • Supports both multi-core and heterogeneous devices and are not specific to any domain • Does not require custom programming languages or non-standard compilation • In most cases, previous work can be used in conjunction with Elastic Computing

Optimization Steps • Elastic Computing Framework performs two optimization steps to determine how to execute an Elastic Function efficiently • Implementation Assessment collects performance information about different implementation options for an Elastic Function • Optimization Planning then analyzes the predicted performance to determine efficient execution decisions • To reduce run-time overhead, both optimization steps execute at installation-time and save their results to a file • May require several minutes to an hour to complete • Only needs to occur once per Elastic Function per system • At run-time, the Elastic Function Execution step looks-up the optimization decisions to execute the Elastic Function on behalf of an application Elastic Function Implementation Assessment Installation-time Optimization Planning Optimization Decisions Elastic Function Execution Run-time Application

Optimization Steps • Elastic Functions inform the Elastic Computing Framework of how to execute and optimize a function • May be created for nearly any function (e.g., sort, FFT, matrix multiply) • Elastic Functions contain numerous alternate implementations for executing the function • Implementations may be single-core, multi-core, and/or heterogeneous • All implementations adhere to the same input/output parameters making them interchangeable • Elastic Functions also contain: • Dependent Implementations that specify how to parallelize the function • Adapter to abstract function-specific details from the analyses steps • Details discussed later! Elastic Function Implementation Assessment Installation-time Optimization Planning Optimization Decisions Implementations: Elastic Function Execution Run-time Sort Elastic Function Quick Sort C code Bitonic Sort VHDL code Merge Sort CUDA code Application

Optimization Steps • Implementation Assessment creates performance predictors for the implementations of the Elastic Function • Performance predictors are called Implementation Performance Graphs (IPGs), which are: • Created for each implementation individually • Returns the estimated execution time of the implementation when given the implementation’s invocation parameters • Example: a quick sort implementation Elastic Function Implementation Assessment Installation-time Optimization Planning IPG for Quick Sort Optimization Decisions Quick SortC code Sample Invocation voidmain() { // Other code... int array[10000]; QuickSort(array); // Other code... } 1.3 sec Execution Time Elastic Function Execution Run-time execution time = 1.3 sec Input Parameters 10,000 Application

Optimization Steps • Optimization Planning then analyzes the IPGs to predetermine efficient Elastic Function execution decisions • Goal is to make decisions that minimize the estimated execution time • Answers two main execution questions: • Which implementation is the most efficient for an invocation? • How to efficiently partition computation across multiple resources? • Details discussed later! Elastic Function Implementation Assessment Installation-time Optimization Planning IPG for Quick Sort Optimization Decisions Sample Invocation IPG for Bitonic Sort voidmain() { // Other code... int array[10000]; Sort(array); // Other code... } 1.3 sec Execution Time 1.1 sec Elastic Function Execution Run-time Execution Time Bitonic Sort estimated to be most efficient at 1.1 sec! Input Parameters 10,000 10,000 Application Input Parameters

Optimization Steps • Output of Implementation Assessment and Optimization Planning is saved to a file for lookup at run-time • Applications execute normally until they invoke an Elastic Function • When an Elastic Function is invoked, the Elastic Function Execution step starts which then: • Looks-up predetermined execution decisions based on the invocation parameters and availability of system resources • Executes the Elastic Function using the predetermined decisions • Returns control to the application once the Elastic Function completes Elastic Function Implementation Assessment Installation-time Optimization Planning Optimization Decisions Elastic Function Execution Run-time Application

Design Flow Elastic Function Design Application Design Elastic Function Interface Specification Hardware Vendors Library Designers Application Developer Application Code Open-source Efforts Elastic Function System Run-time Elastic Function Installation Application Installation Compilation Compilation Elastic Function Execution Application Execution Implementation Assessment Elastic Function Invocation Application Launched Installed Elastic Functions Optimization Planning Application Executable

Design Flow Elastic Function Design Application Design Elastic Function Interface Specification Hardware Vendors How does it work? Library Designers Application Developer Application Code Open-source Efforts Elastic Function Implementation Assessment and Optimization Planning are the main research challenges and the focus of on-going research Time for details! System Run-time Elastic Function Installation Application Installation Compilation Compilation Elastic Function Execution Application Execution Implementation Assessment Implementation Assessment Elastic Function Invocation Application Launched Installed Elastic Functions Optimization Planning Optimization Planning Application Executable

Adapter • Implementation Assessment creates Implementation Performance Graphs (IPGs) for each implementation to predict the execution time from the input parameters • IPG is a piece-wise linear graph mapping the input parameters to estimated execution time • Question: how do we map input parameters to the x-axis for every Elastic Function? • Answer: the adapter Sample Invocation voidmain() { // Other code... int array[10000]; QuickSort(array); // Other code... } 1.3 sec execution time = 1.3 sec 10,000 IPG for Convolution IPG for Quick Sort Quick SortC code ConvolutionC code Sample Invocation voidmain() { // Other code... float a[100]; float b[10000]; Convolve(a, b); // Other code... } ? Execution Time Execution Time Input Parameters Input Parameters

Adapter • Adapter maps the input/output parameters to a numeric value, called the work metric • Essentially provides an abstraction layer to allow Elastic Computing to analyze and, thereby, optimize any type of Elastic Function • Developer creates the adapter as part of the Elastic Function • Rules for the Adapter’s Mapping: • 1. Parameters that map to the same Work Metric value should require equal execution times • 2. As the Work Metric value increases, execution time should also increase • Example: sorting Elastic Function • Adapter: set work metric equal to number of elements to sort • Adheres to Rule 1: sorting the same number of elements generally takes the same time • Adheres to Rule 2: sorting more elements generally takes longer 1.3 sec 10,000 Quick SortC code IPG for Quick Sort work metric = 10,000 Sample Invocation Execution Time voidmain() { // Other code... int array[10000]; QuickSort(array); // Other code... } Work Metric Input Parameters execution time = 1.3 sec

Adapter • Any work metric mapping that (mostly) adheres to Rules 1 & 2 is a valid adapter • One technique is to set the mapping equal to the result of an asymptotic analysis on the performance of a function • Asymptotic analysis creates an equation that is approximately proportional to execution time • Use that equation as the work metric mapping • Example: convolution Elastic Function • Time-domain convolution has asymptotic performance equal to Θ(|a|*|b|) • Therefore, set work metric equal to product of the lengths of the two input vectors 1.7 sec work metric = 100 * 10,000 = 1,000,000 1,000,000 ConvolutionC code IPG for Convolution Sample Invocation voidmain() { // Other code... float a[100]; float b[10000]; Convolve(a, b); // Other code... } Execution Time Work Metric Input Parameters execution time = 1.7 sec

Implementation Assessment • Implementation Assessment relies on aheuristic to create IPGs, which: • Samples execution time of the implementation at several work metrics to determine performance • Performs statistical analyses on sets of samples to find work metric intervals with linear trends • Adapts the sampling process to collect fewer samples in regions of linear trends Collected Samples Resulting IPG Implementation Execution Time Execution Time Heuristic collected fewer samples in linear regions Work Metric Work Metric

Optimization Planning • Optimization Planning analyzes the IPGs to predetermine efficient execution decisions, and performs two main optimizations: • Fastest Implementation Planning predetermines the most efficient implementation for different invocation situations • Work Parallelization Planning predetermines how to efficiently parallelize computation • Fastest Implementation Planning (FIP) creates Function Performance Graphs (FPGs) that allow a single lookup to return the best implementation for an invocation • FIP creates an FPG by overlaying IPGs corresponding to the possible implementation alternatives and saving only the lowest-envelope Candidate Implementations Corresponding Candidate IPGs Overlay of IPGs Resulting FPG Quick SortC code Bitonic SortVHDL code Execution Time Execution Time Work Metric Work Metric

Optimization Planning • Work Parallelization Planning (WPP) analyzes FPGs to determine partitionings of computation that minimize estimated execution time • Dependent implementations are a type of implementation that uses WPP results to determine how to efficiently parallelize computation • Developers create dependent implementations based on divide-and-conquer algorithms • Divide-and-conquer algorithms divide a big-instance of a problem into multiple smaller instances, and are common for many types of functions • Example: merge sort algorithm (divide-and-conquer algorithm that performs sort) • Question: How to parallelize computation and resources to maximize performance? • Answer: Determine partitionings that minimize the estimated execution time! Merge Sort Dependent Implementation Merge Sort Algorithm voidMergeSortDepImp(input) { // Partition input [A_in, B_in] = Partition(input); // Perform recursive sorts In Parallel { A_out = sort(A_in); B_out = sort(B_in); } // Merge recursive outputs output = Merge(A_out, B_out); // Return output return output; } Initial Call: Sort( [ 3, 5, 7, 1, 2, 8, 5, 2 ] ) Partition: Sort( [ 8, 5, 2 ] ) Nested Calls: Sort( [ 3, 5, 7, 1, 2 ] ) return [ 1, 2, 3, 5, 7 ] Nested Output: return [ 2, 5, 8 ] Merge: Output: return [ 1, 2, 2, 3, 5, 5, 7, 8 ]

Optimization Planning • WPP uses a sweep-line algorithm to analyze pairs of FPGs and determine efficient partitioning of computation between them • Example: partitioning sort between two resources • Algorithm analyzes all pairs of FPGs to consider all possible resource partitionings • Result of algorithm is optimal, assuming estimated FPG performance is accurate • Implementation Assessment and Optimization Planning iterate to consider repeated nesting of dependent implementations • Repeated nesting of dependent implementations allow for arbitrarily many partitions • Proposed improvements to WPP consider more parallelization options to allow more efficient parallelization decisions FPG for Sort using a CPU FPG for Sort using a FPGA Execution Time sweep line 1.2 sec 1.2 sec Execution Time when sorting 6,000 elements partition 1,000 to CPU and 5,000 to FPGA Work Metric 5,000 1,000 Work Metric

Status of Elastic Computing • Elastic Computing Framework is working! • Consists of over 200 files and 25k lines of code • 13 Elastic Functions (and 35 implementations) created: • Convolution: Circular Convolution, Convolution, 2D Convolution • Linear Algebra:Inner Product, Matrix Multiply • Image Processing:Mean Filter, Optical Flow, Prewitt Filter, Sum-of-Absolute-Differences • Others: Floyd-Warshall, Lattice-Boltzmann, Longest Common Subsequence, and Sort • Easy to add new Elastic Functions and Implementations • 5 processing resources supported: • Multi-threaded implementations support MPI communication/synchronization features • GPU support: any CUDA-supported GPUs • FPGA support: H101PCIXM, PROCeIII, and PROCStarIII • Adding support for new resources requires creating a wrapper for the driver’s interface • Elastic Computing Framework installed on: • Alpha, Delta, Elastic, Marvel, Novo-G, and Warp • Easy to add new platforms

Experimental Results • Results collected on Elastic system • Convolution Elastic Function contains 5 implementations: • Single-threaded CPU implementation using time-domain algorithm • Multi-threaded CPU implementation using time-domain algorithm • GPU implementation using time-domain algorithm • FPGA implementation using frequency-domain algorithm • Dependent implementation using overlap-add partitioning Speedup of Convolution Elastic Function(as more resources are made available) Parallelization Decisions(for a invocation with work metric = 1,024,000)

Experimental Results • Results collected on Delta, Elastic, Marvel, and Novo-G for 11 Elastic Functions: • 2DConv = 2D convolution • Cconv = circular convolution • Conv = 1D convolution • FW = Floyd-Warshall • Inner = inner-product • Mean = mean image filter • MM = matrix multiply • Optical = optical flow • Prewitt = Prewitt edge detection • SAD = sum of absolute differences • Sort = sort

Publication List • Elastic Computing Publications: • J. Wernsing and G. Stitt, “Elastic computing: a framework for transparent, portable, and adaptive multi-core heterogeneous computing,” in LCTES’10: Proceedings of the ACM SIGPLAN/SIGBED 2010 conference on Languages, compilers, and tools for embedded systems, pp. 115–124, 2010. • J. Wernsing and G. Stitt, “A scalable performance prediction heuristic for implementation planning on heterogeneous systems,” in ESTIMedia’10: 8th IEEE Workshop on Embedded Systems for Real-Time Multimedia, pp. 71 –80, 2010. • J. Wernsing and G. Stitt, "RACECAR: A Heuristic for Automatic Function Specialization on Multi-core Heterogeneous Systems," under review in PPoPP'12: 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2012. • J. Wernsing and G. Stitt, Elastic Computing: A Portable Optimization Framework for Hybrid Computers, under review in Parallel Computing Journal (ParCo) Special Issue on Application Accelerators in HPC. • Other Publications: • J. Wernsing, J. Ling, G. Cieslewski, and A. George, "Lightweight Reliable Communications Library for High-Performance Embedded Space Applications," in DSN'07: 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Edinburgh, UK, June 25-28, 2007 (student forum). • J. Coole, J. Wernsing, and G. Stitt, "A Traversal Cache Framework for FPGA Acceleration of Pointer Data Structures: A Case Study on Barnes-Hut N-body Simulation," in ReConFig'09: International Conference on Reconfigurable Computing and FPGAs, pp. 143-148, 2009. • J. Fowers, G. Brown, J. Wernsing, and G. Stitt, A Performance and Energy Comparison of Convolution on GPUs, FPGAs, and Multicore Processors, under review in ACM Transactions on Architecture and Code Optimization (TACO) Special Issue on High-Performance and Embedded Architectures and Compilers.

Conclusions • Elastic Computing enables effective multi-core heterogeneous computing by: • Providing a framework for designing, reusing, and automatically optimizing computation on multi-core heterogeneous systems • Adapting execution decisions to execute efficiently based on the invocation’s input parameters and the availability of system resources • Abstracting application developers from computation and optimization details • Enabling applications to be portable yet efficient across different systems • Main research challenges: • Implementation Planning ,which creates performance predictors for implementations • Optimization Planning, which predetermines efficient execution decisions by analyzing the performance predictors • Proposed improvements: • Improve Implementation Planning to more intelligently sample an implementation when creating an IPG, resulting in a reduced installation-time overhead without reducing accuracy • Improve Optimization Planning to consider more partitioning options, resulting in improved efficiency when parallelizing computation • Questions?

Elastic Computing