1 / 27

Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission. Haicheng Wu*, Gregory Diamos # , Jin Wang*, Srihari Cadambi ^, Sudhakar Yalamanchili *, Srimat Chakradhar ^ * Georgia Institute of Technology # NVIDIA Research ^ NEC Laboratories America.

raheem
Download Presentation

Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission Haicheng Wu*, Gregory Diamos#, Jin Wang*, SrihariCadambi^, SudhakarYalamanchili*, SrimatChakradhar^ *Georgia Institute of Technology #NVIDIA Research ^NEC Laboratories America Sponsors: National Science Foundation, LogicBlox Inc. , IBM, and NVIDIA

  2. The General Purpose GPU • GPU is a many core co-processor • 10s to 100s of cores • 1000s to 10,000s of concurrent threads • CUDA and OpenCL are the dominant programming models • Well suited for data parallel apps • Molecular Dynamics, Options Pricing, Ray Tracing, etc. • Commodity: led by NVIDIA, AMD, and Intel CPU (Multi Core) 2-10 Cores ③Execute ① Input Data GPU ~1500 Cores ② Launch Kernel ④ Result MAIN MEM ~128GB PCI-E GPU MEM ~6GB

  3. Enterprise: Amazon EC2 GPU Instance NVIDIA Tesla Amazon EC2 GPU Instances

  4. Data Warehousing Applications on GPUs • The good • Lots of potential data parallelism • If data fits in GPU mem, 2x—27x speedup has been shown • The bad • Very large data set (will not even fit in host memory) • I/O bound (GPU has no disk) • PCI data transfer takes 15–90% of the total time* • B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander. Relational query co-processing on graphics processors. • In TODS, 2009.

  5. This Work • Goal: Demonstrate the benefits of Kernel Fusion/Kernel Fission in enabling Large data warehousing applications on GPUs • Assumptions • In-memory system • Host memory, not GPU memory • Not OLTP (Online Transaction Processing) type simple queries • Focus on data analysis instead of data entry/retrieval

  6. Two Optimizations for Data Movement CPU (Multi Core) 2-10 Cores Our solutions are: Kernel Fusion– Aggregate computation to reuse data Kernel Fission– Overlap computation with PCI transfer GPU ~1500 Cores MAIN MEM ~128GB PCI-E GPU MEM ~6GB ~16GB/s This is the problem!!!

  7. Relational Algebra (RA) Operators RA are building blocks of DB APPs 7

  8. Common RA Combinations of TPC-H 8

  9. Experimental Environment Using a sequence of SELECTs to demonstrate the benefits of Kernel Fusion/Fission 9

  10. PCI Bandwidth vs. GPU Computation Capacity GPU Computation Capacity (1 SELECT) < PCI Bandwidth 10

  11. Kernel Fusion A1: A2: A1: A2: A3: KernelA + +/- A3: Fused Kernel A1 A2 A3 A1 A2 - Kernel A A3 Fused Kernel A , B Kernel B KernelB Result Result

  12. Benefits of Kernel Fusion-Reduce Data Footprint (1) Temporal Locality Spatial Locality GPU GPU A1 A2 temp temp A3 Result GPU temp A1 A2 A3 Result Traverse the data only ONCE 12

  13. Benefits of Kernel Fusion-Reduce Data Footprint (2) Memory Efficiency Reduce Data Transfer A1 A2 A1 A2 Kernel A A3 input1 GPU MEM Temp A3 Temp Kernel B CPU MEM GPU MEM result1 Result A1 input2 A1 A2 A3 A2 result2 Fused Kernel A , B A3 GPU MEM Result

  14. Benefits of Kernel Fusion-Enlarge Optimization Scope Eliminate Common Stages Enable More Opt Kernel A Fused Kernel A, B Kernel B Larger code is good for other optimizations: a) instruction scheduling, b) register assignment, c) constant propagation ……

  15. Examples of Kernel Fusion Original 1 SELECT Fused 2 SELECTs 15

  16. Kernel Fusion-Overall Performance Including PCI PCI-e noise Excluding PCI 1.80x speedup 16

  17. Kernel Fusion-Breakdown Execution Time Not needed Faster filter and gather 17

  18. Kernel Fusion-Sensitivity Lower selected rate is better Fusing more kernels is better 18

  19. Kernel Fission-CUDA Stream • Commands (Kernel or Memcpy) of different CUDA STREAM can run in parallel • Commands in the same CUDA STREAM have to run in sequential 19

  20. Kernel Fission-Stream Pool Stream Pool is a library that abstracts away the details of CUDA STREAM 20

  21. Kernel Fission-Different Ways to Use CUDA Stream Concurrently running two kernels is not always beneficial small uses half resource asbig 21

  22. Example of Kernel Fission 1.37x speedup 22

  23. Kernel Fusion + Kernel Fission 1.41x serial 1.31x fusion only 1.10x fission only 23

  24. Real Queries-Q1 Totally 1.26x speedup Query Plan

  25. Real Queries-Q21 Totally 1.13x speedup Query Plan 25

  26. Conclusions • Two Data movement optimizations (Kernel Fusion & Kernel Fission) saves the memory transfer time and speeds up the computation time for Data Warehousing Apps. • Kernel Fusion • Does not need to dump intermediate temporary data • Enlarge the optimization scope • Kernel Fission works like double buffer that can overlap data transfer with GPU Computation

  27. Thank You Questions? 27

More Related