1 / 39

Wei Jiang and Gagan Agrawal

MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters. Wei Jiang and Gagan Agrawal. Outline. Background System Design and Implementation Auto-Tuning Framework Applications Experiments Related Work Conclusion. Background (I).

george
Download Presentation

Wei Jiang and Gagan Agrawal

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal

  2. Outline • Background • System Design and Implementation • Auto-Tuning Framework • Applications • Experiments • Related Work • Conclusion

  3. Background (I) • Map-Reduce • Simple API • Easy to write parallel programs: map and reduce • Fault-tolerant for large-scale data centers • Performance? • Always a concern for HPC community • Generalized Reduction as a variant • Shared a similar processing structure • The key difference lies in a programmer-managed reduction-object • Outperformed Map-reduce for a sub-class of data intensive applications

  4. Background (II) • Parallel Computing Environments • CPU clusters(multi-cores) • Most widely used as traditional HPC platforms • Support MapReduce and many of its variants • GPU clusters(many-cores) • Higher performance with better cost and energy efficiency • Low programming productivity • Limited MapReduce-like support: Mars for a single GPU, a recent IDAV for GPU clusters, … • CPU-GPU clusters(heterogeneous systems) • No MapReduce-like support up to date!

  5. Background (III) • Previously developed MATE (a Map-Reduce system with an AlternaTE API) for multi-core environments • Phoenix implemented Map-Reduce in shared-memory systems • MATE adopted Generalized Reduction, first proposed in FREERIDE that was developed at Ohio State 2001-2003 • Comparison between MATE and Phoenix for • Data Mining Applications • Comparing performance and API • Understanding performance overheads • MATE provided an alternative API better than ``Map-Reduce`` for some data-intensive applications • Assumption is that reduction object must fit in memory

  6. Background (IV) • To address the limitation in MATE, we developed Ex-MATE to support the management of large reduction-objects, which are required by graph mining applications • Developed support for managing disk-resident reduction object and efficient updating in distributed environments • Evaluated Ex-MATE with PEGASUS, a Hadoop-based graph mining system • Ex-MATE outperformed PEGASUS for three graph mining applications with factors ranging from 9 to 35

  7. Map-Reduce Execution

  8. Comparing Processing Structures • Reduction Objectrepresents the intermediate state of the execution • Reduce func. is commutative and associative • Sorting, grouping.. .overheads are eliminated with red. func/obj.

  9. Observations on Processing Structures • Map-Reduce is based on functional idea • Does not maintain state • This can lead to overheads of managing intermediate results between map and reduce • Map could generate intermediate results of very large size • Reduction-based approach is based on a programmer- managed reduction object • Not as ‘clean’ • But, avoids sorting of intermediate results • Can also help shared memory parallelization • Helps better fault-recovery

  10. System Design and Implementation Execution Overview of MATE-CG System API Support of heterogeneous computing Data types: Input_Space and Reduction_Object Functions: CPU_Reduction and GPU_Reduction Runtime Partitioning disk-resident dataset among nodes Managing large-sized reduction object on disk Managing large-sized intermediate data Using GPUs to accelerate computation

  11. MATE-CG Overview Execution work-flow

  12. System API Data types and functions

  13. Implementation Considerations (I) A multi-level data partitioning scheme First, partitioning function: partition inputs into blocks and distributed them to different nodes Data locality should be considered Second, heterogeneous data mapping: cut each block into two parts, one for CPU, the other for GPU How to identify the best data mapping? Third, splitting function: split part of data blocks into smaller chunks Observation: smaller chunk size for CPU and larger chunk size for GPU

  14. Implementation Considerations (II) Management of large-sized reduction-object/intermediate data: Reduce disk I/O of large reduction objects: Data access patterns are used to reuse splits of reduction objects as much as possible Transparent to user code Reduce network costs of large intermediate data: A generic solution to invoke a all-to-all broadcast among all nodes would cause severe performance losses Application-driven optimizations can be used to improve performance.

  15. Auto-Tuning Framework • Auto-tuning problem: given an application, find the optimal parameter setting to distribute data to the CPU and the GPU respectively due to different processing capabilities. • For example: 20/80? 50/50? 70/30? • Our approach: exploit the iterative nature of many data-intensive applications with similar computations over a number of iterations • Construct an analytical model to predict performance • The optimal value is learnt over the first few iterations • No compile-time search or tuning is needed • Low runtime overheads with a large number of iterations

  16. The Analytical Model (I) • We focus on the two main components in the overall running time on each node: data processing time on the CPU and/or the GPU and the overheads on the CPU • First, consider the CPU only and we have: • Second, on the GPU, we have: • Third, let Tcg represent the heterogeneous execution time using both CPU and GPU, we have:

  17. The Analytical Model (II) • Let p represent the fraction of data to the CPU and we have: • and • Overall, to relate Tcg with p, we have the following illustration

  18. The Analytical Model (III) • Illustration of the relationship between Tcg and p:

  19. The Analytical Model (IV) • To minimize Tcg by computing the optimal p, we have: • To identify the best p, a simple heuristic way is used: • First, set p to 1: use CPUs only • Second, set p to 0: use GPUs only • Obtain necessary values for other parameters in the above expression and predict an initial p • Adjust p accordingly in future iterations for variances in measured values: make the CPU and the GPU finish simultaneously

  20. Applications: three representatives • Gridding kernel from scientific computing • Single pass: convert visibilities into a grid-model of the sky • The Expectation-Maximization algorithm from data mining • Iterative: estimate a vector of parameters • Two consecutive steps: the Expectation step (E-step) and the Maximization step (M-step) • PageRank from graph mining • Iterative: calculate the relative importance of web pages • Is essentially a matrix-vector multiplication algorithm

  21. Applications: Optimizations (I) • The Expectation-Maximization algorithm • Large intermediate matrix between the E-step and the M-step • Could cause a lot of network communication costs for broadcasting such a large matrix among all nodes • Optimization: On the same node, M-step reads the same subset of intermediate matrix as produced in E-step (use of a common partitioner) • PageRank • Data-copying overheads are significant on GPUs • Smaller input vector splits are shared by larger matrix blocks that need further splitting • Optimization: copy shared input vector splits only once to save copying time (fine-grained copying)

  22. Applications: Optimizations (II) • Outline of data copying and computation on GPUs

  23. Example User Code Gridding Kernel

  24. Experiments Design (I) • Experiments Platform • A heterogeneous CPU-GPU cluster • Each node has one Intel 8-core CPU and a NVIDA Tesla (Fermi) GPU (448 cores) • Used up to 128 CPU cores and 7168 GPU cores on 16 nodes 24 September 9, 2014

  25. Experiments Design (II) • Three representative applications • Gridding kernel, EM, and PageRank. • For each application, we run it in four modes in the cluster: • CPU-1: 1 CPU core per node as baseline • CPU-8: 8 CPU cores per node • GPU-only: only the GPU per node • CPU-8-n-GPU: both 8 CPU cores and GPU per node 25 September 9, 2014

  26. Experiments Design (III) • We focused on three aspects: • For each application: scalability with the increasing number of GPUs and performance improvement over CPUs • Performance improvement of Heterogeneous computing and effectiveness of auto-tuning Framework • Performance impact of application-driven optimizations and examples of system tuning 26 September 9, 2014

  27. Results: Scalability with # of GPUs (I) PageRank: 64GB dataset; a graph of 1 billion nodes and 4 billion edges 7.0 6.8 6.3 5.0 16%

  28. Results: Scalability with # of GPUs (II) Gridding Kernel: 32GB dataset; a collection of 800 million visibilities and a 6.4GB sky grid 7.5 7.2 6.9 6.5 25%

  29. Results: Scalability with # of GPUs (III) EM: 32GB dataset; a cluster of 1 billion points 7.6 6.8 5.0 15.0 3.0

  30. Results: Auto-tuning (I) PageRank: 64GB dataset on 16 nodes 7% P=0.30

  31. Results: Auto-tuning (II) EM: 32GB dataset on 16 nodes E: 29% M: 24% E: p=0.31 M: p=0.27

  32. Results: Heterogeneous Execution Gridding Kernel: 32GB dataset on 16 nodes >=56% >=42%

  33. Results: App-Driven Optimizations (I) EM: 4GB dataset with 20GB intermediate matrix 1.7 7.7

  34. Results: App-Driven Optimizations (II) PageRank: 32GB dataset with a block size of 512MB and GPU chunk size of 128MB 24%

  35. Results: Examples for System Tuning Gridding Kernel: 32GB dataset; varying cpu_chunk_size and gpu_chunk_size 16 MB 512MB

  36. Insights • GPUs can significantly accelerate certain classes of computations but exhibits programming difficulties and introduces data-copying overheads • Suitable data mapping between the CPU and the GPU in heterogeneous computing is crucial for the overall performance • Application-specific opportunities should be exploited to make the best use of a parallel system with optional APIs • Automatic optimization would be desirable to choose the correct set of system parameter settings 36 September 9, 2014

  37. Related Work • Data-intensive computing with map-reduce-like models • Multi-core CPUs: Phoenix, Phoenix-rebirth, … • A single GPU: Mars, MapCG, … • GPU clusters: MITHRA, IDAV, … • CPU-GPU clusters: our MATE-CG • Programming heterogeneous systems • Software end: Merge, EXOCHI, Harmony, Qilin, … • Hardware end: CUBA, CUDA, OpenCL, … • Auto-tuning: a lot of efforts • Basic idea: Search for the best solution among all possibilities • Map solutions to performance metrics • Map hardware/software characteristics to parameters • Very useful for library generators, compilers, runtime systems,... 37 September 9, 2014

  38. Conclusions The MATE-CG supports a map-reduce-like API to ease the programming difficulty of a heterogeneous CPU-GPU cluster The system achieves good scalability with increasing number of nodes and the heterogeneous execution further improves the performance over CPU only or GPU only It introduces a novel and effective auto-tuning approach for choosing the best data mapping between the CPU and the GPU Application-specific optimizations should be considered in the user code and a high-level API should be coupled with significant auto-tuning for identify the right system parameter settings automatically

  39. Questions?

More Related