1 / 33

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Optimizing MapReduce for GPUs with Effective Shared Memory Usage. Department of Computer Science and Engineering The Ohio State University. Linchuan Chen and Gagan Agrawal. Outline. Introduction Background System Design Experiment Results Related Work Conclusions and Future Work.

stacia
Download Presentation

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizing MapReduce for GPUs with EffectiveShared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen and Gagan Agrawal

  2. Outline • Introduction • Background • System Design • Experiment Results • Related Work • Conclusions and Future Work

  3. Introduction • Motivations • GPUs • Suitable for extreme-scale computing • Cost-effective and Power-efficient • MapReduce Programming Model • Emerged with the development of Data-Intensive Computing • GPUs have been proved to be suitable for implementing MapReduce • Utilizing the fast but small shared memory for MapReduce is chanllenging • Storing (Key, Value) pairs leads to high memory overhead, prohibiting the use of shared memory

  4. Introduction • Our approach • Reduction-based method • Reduce the (key, value) pair to the reduction object immediately after it is generated by the map function • Very suitable for reduction-intensive applications • A general and efficient MapReduce framework • Dynamic memory allocation within a reduction object • Maintaining a memory hierarchy • Multi-group mechanism • Overflow handling

  5. Outline • Introduction • Background • System Design • Experiment Results • Related Work • Conclusions and Future Work

  6. MapReduce M M M M M M M M Group by Key R R R R R

  7. MapReduce • Programming Model • Map() • Generates a large number of (key, value) pairs • Reduce() • Merges the values associated with the same key • Efficient Runtime System • Parallelization • Concurrency Control • Resource Management • Fault Tolerance • … …

  8. Host Device (Device) Grid Kernel 1 Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Kernel 2 Local Memory Local Memory Local Memory Local Memory Host Device Memory Constant Memory Texture Memory Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 0) Thread (0, 2) Thread (0, 1) Thread (1, 0) Thread (1, 1) Thread (1, 2) Thread (2, 0) Thread (2, 2) Thread (2, 1) Thread (3, 0) Thread (3, 2) Thread (3, 1) Thread (4, 2) Thread (4, 1) Thread (4, 0) GPUs Processing Component Memory Component

  9. Outline • Introduction • Background • System Design • Experiment Results • Related Work • Conclusions and Future Work

  10. System Design • Traditional MapReduce map(input) { (key, value) = process(input); emit(key, value); } grouping the key-value pairs (by runtime system) reduce(key, iterator) { for each value in iterator result = operation(result, value); emit(key, result); }

  11. System Design • Reduction-based approach map(input) { (key, value) = process(input); reductionobject->insert(key, value); } reduce(value1, value2) { value1 = operation(value1, value2); } • Reduces the memory overhead of storing key-value pairs • Makes it possible to effectively utilize shared memory on a GPU • Eliminates the need of grouping • Especially suitable for reduction-intensive applications

  12. Chanllenges • Result collection and overflow handling • Maintain a memory hierarchy • Trade off space requirement and locking overhead • A multi-group scheme • To keep the framework general and efficient • A well defined data structure for the reduction object

  13. Memory Hierarchy GPU Reduction Object 0 Reduction Object 1 Reduction Object 0 Reduction Object 1 … … … … … … Block 0’s Shared Memory Block 0’s Shared Memory Device Memory Reduction Object Result Array Device Memory CPU Host Memory

  14. Reduction Object • Updating the reduction object • Use locks to synchronize • Memory allocation in reduction object • Dynamic memory allocation • Multiple offsets in device memory reduction object

  15. Reduction Object … … … … ValIdx[1] KeyIdx[1] Memory Allocator Key Size Val Size Val Data Key Data Key Size Val Size Key Data Val Data

  16. Multi-group Scheme • Locks are used for synchronization • Large number of threads in each thread block • Lead to severe contention on the shared memory RO • One solution: full replication • every thread owns a shared memory RO • leads to memory overhead and combination overhead • Trade-off • multi-group scheme • divide threads in each thread block into multiple sub-groups • each sub-group owns a shared memory RO • Choice of groups numbers • Contention overhead • Combination overhead

  17. Overflow Handling • Swapping • Merge the full shared memory ROs to the device memory RO • Empty the full shared memory ROs • In-object sorting • Sort the buckets in the reduction object and delete the unuseful data • Users define the way of comparing two buckets

  18. Discussion • Reduction-intensive applications • Our framework has a big advantage • Applications with few or no reduction • No need to use shared memory • Users need to setup system parameters • Develop auto-tuning techniques in future work

  19. Extension for Multi-GPU • Shared memory usage can speed up single node execution • Potentially benefits the overall performance • Reduction objects can avoid global shuffling overhead • Can also reduce communication overhead

  20. Outline • Introduction • Background • System Design • Experiment Results • Related Work • Conclusions and Future Work

  21. Experiment Results • Applications used • 5 reduction-intensive • 2 map computation-intensive • Tested with small, medium and large datasets • Evaluation of the multi-group scheme • 1, 2, 4 groups • Comparison with other implementations • Sequential implementations • MapCG • Ji et al.'s work • Evaluating the swapping mechanism • Test with large number of distinct keys

  22. Evaluation of the Multi-group Scheme

  23. Comparison with Sequential Implementations

  24. Comparison with MapCG • With reduction-intensive applications

  25. Comparison with MapCG • With other applications

  26. Comparison with Ji et al.'s work

  27. Evaluation of the Swapping Mechamism • VS MapCG and Ji et al.’s work

  28. Evaluation of the Swapping Mechamism • VS MapCG

  29. Evaluation of the Swapping Mechamism • swap_frequency = num_swaps / num_tasks

  30. Outline • Introduction • Background • System Design • Experiment Results • Related Work • Conclusions and Future Work

  31. Related Work • MapReduce for multi-core systems • Phoenix, Phoenix Rebirth • MapReduce on GPUs • Mars, MapCG • MapReduce-like framework on GPUs for SVM • Catanzaro et al. • MapReduce in heterogeneous environments • MITHRA, IDAV • Utilizing shared memory of GPUs for specific applications • Nyland et al., Gutierrez et al. • Compiler optimizations for utilizing shared memory • Baskaran et al. (PPoPP '08), Moazeni et al. (SASP '09)

  32. Conclusions and Future Work • Reduction-based MapReduce • Storing the reduction object on the memory hierarchy of the GPU • A multi-group scheme • Improved performance compared with previous implementations • Future work: extend our framework to support new architectures

  33. Thank you!

More Related