Wei Jiang and Gagan Agrawal

MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal

Outline • Background • System Design and Implementation • Auto-Tuning Framework • Applications • Experiments • Related Work • Conclusion

Background (I) • Map-Reduce • Simple API • Easy to write parallel programs: map and reduce • Fault-tolerant for large-scale data centers • Performance? • Always a concern for HPC community • Generalized Reduction as a variant • Shared a similar processing structure • The key difference lies in a programmer-managed reduction-object • Outperformed Map-reduce for a sub-class of data intensive applications

Background (II) • Parallel Computing Environments • CPU clusters(multi-cores) • Most widely used as traditional HPC platforms • Support MapReduce and many of its variants • GPU clusters(many-cores) • Higher performance with better cost and energy efficiency • Low programming productivity • Limited MapReduce-like support: Mars for a single GPU, a recent IDAV for GPU clusters, … • CPU-GPU clusters(heterogeneous systems) • No MapReduce-like support up to date!

Background (III) • Previously developed MATE (a Map-Reduce system with an AlternaTE API) for multi-core environments • Phoenix implemented Map-Reduce in shared-memory systems • MATE adopted Generalized Reduction, first proposed in FREERIDE that was developed at Ohio State 2001-2003 • Comparison between MATE and Phoenix for • Data Mining Applications • Comparing performance and API • Understanding performance overheads • MATE provided an alternative API better than ``Map-Reduce`` for some data-intensive applications • Assumption is that reduction object must fit in memory

Background (IV) • To address the limitation in MATE, we developed Ex-MATE to support the management of large reduction-objects, which are required by graph mining applications • Developed support for managing disk-resident reduction object and efficient updating in distributed environments • Evaluated Ex-MATE with PEGASUS, a Hadoop-based graph mining system • Ex-MATE outperformed PEGASUS for three graph mining applications with factors ranging from 9 to 35

Map-Reduce Execution

Comparing Processing Structures • Reduction Objectrepresents the intermediate state of the execution • Reduce func. is commutative and associative • Sorting, grouping.. .overheads are eliminated with red. func/obj.

Observations on Processing Structures • Map-Reduce is based on functional idea • Does not maintain state • This can lead to overheads of managing intermediate results between map and reduce • Map could generate intermediate results of very large size • Reduction-based approach is based on a programmer- managed reduction object • Not as ‘clean’ • But, avoids sorting of intermediate results • Can also help shared memory parallelization • Helps better fault-recovery

System Design and Implementation Execution Overview of MATE-CG System API Support of heterogeneous computing Data types: Input_Space and Reduction_Object Functions: CPU_Reduction and GPU_Reduction Runtime Partitioning disk-resident dataset among nodes Managing large-sized reduction object on disk Managing large-sized intermediate data Using GPUs to accelerate computation

MATE-CG Overview Execution work-flow

System API Data types and functions

Implementation Considerations (I) A multi-level data partitioning scheme First, partitioning function: partition inputs into blocks and distributed them to different nodes Data locality should be considered Second, heterogeneous data mapping: cut each block into two parts, one for CPU, the other for GPU How to identify the best data mapping? Third, splitting function: split part of data blocks into smaller chunks Observation: smaller chunk size for CPU and larger chunk size for GPU

Implementation Considerations (II) Management of large-sized reduction-object/intermediate data: Reduce disk I/O of large reduction objects: Data access patterns are used to reuse splits of reduction objects as much as possible Transparent to user code Reduce network costs of large intermediate data: A generic solution to invoke a all-to-all broadcast among all nodes would cause severe performance losses Application-driven optimizations can be used to improve performance.

Auto-Tuning Framework • Auto-tuning problem: given an application, find the optimal parameter setting to distribute data to the CPU and the GPU respectively due to different processing capabilities. • For example: 20/80? 50/50? 70/30? • Our approach: exploit the iterative nature of many data-intensive applications with similar computations over a number of iterations • Construct an analytical model to predict performance • The optimal value is learnt over the first few iterations • No compile-time search or tuning is needed • Low runtime overheads with a large number of iterations

The Analytical Model (I) • We focus on the two main components in the overall running time on each node: data processing time on the CPU and/or the GPU and the overheads on the CPU • First, consider the CPU only and we have: • Second, on the GPU, we have: • Third, let Tcg represent the heterogeneous execution time using both CPU and GPU, we have:

The Analytical Model (II) • Let p represent the fraction of data to the CPU and we have: • and • Overall, to relate Tcg with p, we have the following illustration

The Analytical Model (III) • Illustration of the relationship between Tcg and p:

The Analytical Model (IV) • To minimize Tcg by computing the optimal p, we have: • To identify the best p, a simple heuristic way is used: • First, set p to 1: use CPUs only • Second, set p to 0: use GPUs only • Obtain necessary values for other parameters in the above expression and predict an initial p • Adjust p accordingly in future iterations for variances in measured values: make the CPU and the GPU finish simultaneously

Applications: three representatives • Gridding kernel from scientific computing • Single pass: convert visibilities into a grid-model of the sky • The Expectation-Maximization algorithm from data mining • Iterative: estimate a vector of parameters • Two consecutive steps: the Expectation step (E-step) and the Maximization step (M-step) • PageRank from graph mining • Iterative: calculate the relative importance of web pages • Is essentially a matrix-vector multiplication algorithm

Applications: Optimizations (I) • The Expectation-Maximization algorithm • Large intermediate matrix between the E-step and the M-step • Could cause a lot of network communication costs for broadcasting such a large matrix among all nodes • Optimization: On the same node, M-step reads the same subset of intermediate matrix as produced in E-step (use of a common partitioner) • PageRank • Data-copying overheads are significant on GPUs • Smaller input vector splits are shared by larger matrix blocks that need further splitting • Optimization: copy shared input vector splits only once to save copying time (fine-grained copying)

Applications: Optimizations (II) • Outline of data copying and computation on GPUs

Example User Code Gridding Kernel

Experiments Design (I) • Experiments Platform • A heterogeneous CPU-GPU cluster • Each node has one Intel 8-core CPU and a NVIDA Tesla (Fermi) GPU (448 cores) • Used up to 128 CPU cores and 7168 GPU cores on 16 nodes 24 September 9, 2014

Experiments Design (II) • Three representative applications • Gridding kernel, EM, and PageRank. • For each application, we run it in four modes in the cluster: • CPU-1: 1 CPU core per node as baseline • CPU-8: 8 CPU cores per node • GPU-only: only the GPU per node • CPU-8-n-GPU: both 8 CPU cores and GPU per node 25 September 9, 2014

Experiments Design (III) • We focused on three aspects: • For each application: scalability with the increasing number of GPUs and performance improvement over CPUs • Performance improvement of Heterogeneous computing and effectiveness of auto-tuning Framework • Performance impact of application-driven optimizations and examples of system tuning 26 September 9, 2014

Results: Scalability with # of GPUs (I) PageRank: 64GB dataset; a graph of 1 billion nodes and 4 billion edges 7.0 6.8 6.3 5.0 16%

Results: Scalability with # of GPUs (II) Gridding Kernel: 32GB dataset; a collection of 800 million visibilities and a 6.4GB sky grid 7.5 7.2 6.9 6.5 25%

Results: Scalability with # of GPUs (III) EM: 32GB dataset; a cluster of 1 billion points 7.6 6.8 5.0 15.0 3.0

Results: Auto-tuning (I) PageRank: 64GB dataset on 16 nodes 7% P=0.30

Results: Auto-tuning (II) EM: 32GB dataset on 16 nodes E: 29% M: 24% E: p=0.31 M: p=0.27

Results: Heterogeneous Execution Gridding Kernel: 32GB dataset on 16 nodes >=56% >=42%

Results: App-Driven Optimizations (I) EM: 4GB dataset with 20GB intermediate matrix 1.7 7.7

Results: App-Driven Optimizations (II) PageRank: 32GB dataset with a block size of 512MB and GPU chunk size of 128MB 24%

Results: Examples for System Tuning Gridding Kernel: 32GB dataset; varying cpu_chunk_size and gpu_chunk_size 16 MB 512MB

Insights • GPUs can significantly accelerate certain classes of computations but exhibits programming difficulties and introduces data-copying overheads • Suitable data mapping between the CPU and the GPU in heterogeneous computing is crucial for the overall performance • Application-specific opportunities should be exploited to make the best use of a parallel system with optional APIs • Automatic optimization would be desirable to choose the correct set of system parameter settings 36 September 9, 2014

Related Work • Data-intensive computing with map-reduce-like models • Multi-core CPUs: Phoenix, Phoenix-rebirth, … • A single GPU: Mars, MapCG, … • GPU clusters: MITHRA, IDAV, … • CPU-GPU clusters: our MATE-CG • Programming heterogeneous systems • Software end: Merge, EXOCHI, Harmony, Qilin, … • Hardware end: CUBA, CUDA, OpenCL, … • Auto-tuning: a lot of efforts • Basic idea: Search for the best solution among all possibilities • Map solutions to performance metrics • Map hardware/software characteristics to parameters • Very useful for library generators, compilers, runtime systems,... 37 September 9, 2014

Conclusions The MATE-CG supports a map-reduce-like API to ease the programming difficulty of a heterogeneous CPU-GPU cluster The system achieves good scalability with increasing number of nodes and the heterogeneous execution further improves the performance over CPU only or GPU only It introduces a novel and effective auto-tuning approach for choosing the best data mapping between the CPU and the GPU Application-specific optimizations should be considered in the user code and a high-level API should be coupled with significant auto-tuning for identify the right system parameter settings automatically

Questions?

Wei Jiang and Gagan Agrawal