1 / 42

Mars: A MapReduce Framework on Graphics Processors

Mars: A MapReduce Framework on Graphics Processors. Bingsheng He 1 , Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong Wang Microsoft Corp. Sina Corp. 1, Currently in Microsoft Research Asia. Overview.

callie
Download Presentation

Mars: A MapReduce Framework on Graphics Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mars: A MapReduce Framework on Graphics Processors Bingsheng He1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong Wang Microsoft Corp. Sina Corp. 1, Currently in Microsoft Research Asia

  2. Overview • Motivation • Design • Implementation • Evaluation • Conclusion

  3. Overview • Motivation • Design • Implementation • Evaluation • Conclusion

  4. Graphics Processing Units • Massively multi-threaded co-processors • 240 streaming processors on NV GTX 280 • ~1 TFLOPS of peak performance • High bandwidth memory • 10+x more than peak bandwidth of the main memory • 142 GB/s, 1 GB GDDR3 memory on GTX280

  5. Graphics Processing Units (Cont.) • High latency GDDR memory • 200 clock cycles of latency • Latency hiding using large number of concurrent threads (>8K) • Low context-switch overhead • Better architectural support for memory • Inter-processor communication using a local memory • Coalesced access • High speed bus with the main memory • Current: PCI-E express (4GB/sec)

  6. GPGPU • Linear algebra [Larsen 01, Fatahalian 04, Galoppo 05] • FFT [Moreland 03, Horn 06] • Matrix operations [Jiang 05] • Folding@home, Seti@home • Database applications • Basic Operators [Naga 04] • Sorting [Govindaraju 06] • Join [He 08]

  7. GPGPU Programming • “Assembly languages” • DirectX, OpenGL • Graphics rendering pipelines • “C/C++” • NVIDIA CUDA, ATI CAL or Brook+ • Different programming models • Low portability among different hardware vendors • NV GPU code cannot run on AMD GPU • “Functional language”?

  8. Without worrying about hardware details— • Make GPGPU programming much easier. • Well harness high parallelism and high • computational capability of GPUs. MapReduce

  9. MapReduce Functions • Process lots of data to produce other data • Input & Output: a set of records in the form of key/value pair • Programmer specifies two functions • map (in_key, in_value) -> emit list(intermediate_key, intermediate_value) • reduce (out_key, list(intermediate_value)) -> emit list(out_key, out_value)

  10. MapReduce Workflow From http://labs.google.com/papers/mapreduce.html

  11. MapReduce outside google • Hadoop [Apache project] • MapReduce on multicore CPUs -- Phoenix [HPCA'07, Ranger et al.] • MapReduce on Cell [07, Kruijf et al.] • Merge [ASPLOS '08, Linderman et al.] • MapReduce on GPUs [stmcs'08, Catanzaro et al.)]

  12. Overview • Motivation • Design • Implementation • Evaluation • Conclusion

  13. MapReduce on GPU Web Analysis Data Mining MapReduce (Mars) Rendering APIs (DirectX) GPGPU languages (CUDA, Brook+) GPU Drivers

  14. MapReduce on Multi-core CPU (Phoenix [HPCA'07]) Input Split Map Partition Reduce Merge Output

  15. Limitations on GPUs • Rely on the CPU to allocate memory • How to support variant length data? • How to allocate output buffer on GPUs? • Lack of lock support • How to synchronize to avoid write conflict?

  16. Data Structure for Mars Support variant length record! A Record = <Key, Value, Index entry> An index entry = <key size, key offset, val size, val offset>

  17. Lock-free scheme for result output Basic idea: Calculate the offset for each thread on the output buffer.

  18. Lock-free scheme example Pick up odd numbers from the array [1, 3, 2, 3, 4, 6, 9, 8]. map function as a filter – filter all odd numbers

  19. Lock-free scheme example T1 T2 T3 T4 [ 1, 3, 2, 3, 4, 7, 9, 8 ] 1 3 2 3 4 7 9 8 Step1: Histogram Step2: Prefix sum (5)

  20. Lock-free scheme example T1 T2 T3 T4 [ 1, 3, 2, 3, 4, 7, 9 ] (5) Histogram Step3: Allocate

  21. Lock-free scheme example T1 T2 T3 T4 [ 1, 3, 2, 3, 4, 7, 9, 8 ] Step4: Computation 1 3 7 9 3 Prefix sum

  22. Lock-free scheme • Histogram on key size, value size, and record count. • Prefix sum on key size, value size, and record count. • Allocate output buffer on GPU memory. • Perform computing. Avoid write conflict. Allocate output buffer exactly once.

  23. Mars Workflow Input MapCount Prefixsum Allocate intermediate buffer on GPU Map Sort and Group ReduceCount Prefixsum Allocate output buffer on GPU Reduce Output

  24. Mars Workflow– Map Only Input MapCount Prefixsum Allocate intermediate buffer on GPU Map Output Map only, without grouping and reduce

  25. Mars Workflow – Without Reduce Input MapCount Prefix Sum Allocate intermediate buffer on GPU Map Sort and Group Output Map and grouping, without reduce

  26. APIs of Mars • Runtime Provided: • AddMapInput • MapReduce • EmitInterCount • EmitIntermediate • EmitCount (optional) • Emit (optional) User-defined: MapCount Map Compare (optional) ReduceCount (optional) Reduce (optional)

  27. Overview • Motivation • Design • Implementation • Evaluation • Conclusion

  28. Mars-GPU • NVIDIA CUDA • Each map instance or reduce instance is a GPU thread. • Operating system’s thread APIs • Each map instance or reduce instance is a CPU thread. Mars-CPU

  29. Optimization According to CUDA features • Coalesced Access • Multiple accesses to consecutive memory addresses are combined into one transfer. • Build-in vector type (int4, char4 etc) • Multiple small data items are fetched in one memory request.

  30. Overview • Motivation • Design • Implementation • Evaluation • Conclusion

  31. Experimental Setup • Comparison • CPU: Phoenix, Mars-CPU • GPU: Mars-GPU

  32. Applications • String Match (SM): Find the position of a string in a file. [S: 32MB, M: 64MB, L: 128MB] • Inverted Index (II): Build inverted index for links in HTML files. [S: 16MB, M: 32MB, L: 64MB] • Similarity Score (SS): Compute the pair-wise similarity score for a set of documents. [S: 512x128, M: 1024x128, L: 2048x128]

  33. Applications (Cont.) • Matrix Multiplication (MM): Multiply two matrices. [S: 512x512, M: 1024x10242, L: 2048x2048] • Page View Rank (PVR): Count the number of distinct page views from web logs. [S: 32MB, M: 64MB, L: 96MB] • Page View Count (PVC): Find the top-10 hot pages in the web log. [S: 32MB, M: 64MB, L: 96MB]

  34. Effect of Coalessed Access Coalessed access achieves a speedup of 1.2-2X

  35. Effect of Built-In Data Types Built-in data types achieve a speedup up to 2 times

  36. Time Breakdown of Mars-GPU GPU accelerates computation in MapReduce

  37. Mars-GPU vs. Phoenix on Quadcore CPU The speedup is 1.5-16 times with various data sizes

  38. Mars-GPU vs. Mars-CPU The GPU accelerates MapReduce up to 7 times

  39. Mars-CPU vs. Phoenix Mars-CPU is 1-5 times as fast as Phoenix

  40. Overview • Motivation • Design • Implementation • Evaluation • Conclusion

  41. Conclusion • MapReduce framework on GPUs • Ease of GPU application development • Performance acceleration • Want a Copy of Mars?http://www.cse.ust.hk/gpuqp/Mars.html

  42. Discussion • A uniform co-processing framework between the CPU and the GPU • High performance computation routines • Index serving • Data mining (on-going) • Power consumption benchmarking of the GPU • The GPU is a test bed for the future CPU. • …

More Related