GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Data Management - PowerPoint PPT Presentation

emily
gputerasort high performance graphics co processor sorting for large data management l.
Skip this Video
Loading SlideShow in 5 Seconds..
GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Data Management PowerPoint Presentation
Download Presentation
GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Data Management

play fullscreen
1 / 55
Download Presentation
GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Data Management
421 Views
Download Presentation

GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Data Management

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Data Management Naga K. Govindaraju Jim Gray Ritesh Kumar Dinesh Manocha http://gamma.cs.unc.edu/GPUTERASORT

  2. Sorting “I believe that virtually every important aspect of programming arises somewhere in the context of sorting or searching!” -Don Knuth

  3. Sorting • Well studied • High performance computing • Databases • Computer graphics • Programming languages • ... • Google map reduce algorithm • Spec benchmark routine!

  4. Massive Databases • Terabyte-data sets are common • Google sorts more than 100 billion terms in its index • > 1 Trillion records in web indexed! • Database sizes are rapidly increasing! • Max DB sizes increases 3x per year (http://www.wintercorp.com) • Processor improvements not matching information explosion

  5. CPU(3 GHz) AGP Memory(512 MB) CPU vs. GPU GPU (690 MHz) Video Memory(512 MB) 2 x 1 MB Cache System Memory(2 GB) PCI-E Bus(4 GB/s) GPU (690 MHz) Video Memory(512 MB)

  6. External Memory Sorting • Performed on Terabyte-scale databases • Two phases algorithm [Vitter01, Salzberg90, Nyberg94, Nyberg95] • Limited main memory • First phase – partitions input file into large data chunks and writes sorted chunks known as “Runs” • Second phase – Merge the “Runs” to generate the sorted file

  7. External Memory Sorting • Performance mainly governed by I/O Salzberg Analysis: Given the main memory size M and the file size N, if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN)

  8. External Memory Sorting Given the main memory size M and the file size N,if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN) N

  9. External Memory Sorting Given the main memory size M and the file size N, if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performanceif the run size R in phase 1is given by R ≈ √(TN) R

  10. External Memory Sorting Given the main memory size M and the file size N,if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN) T

  11. Salzberg Analysis • If N=100GB, T=2MB, then R ≈ 230MB • Large data sorting on CPUs can achieve high I/O performance by sorting large runs

  12. Massive Data Handling on CPUs • Require random memory accesses • Small CPU caches (< 2MB) • Slower than even sequential disk accesses – bottleneck shift from I/O to memory • Widening memory to compute gap! • External memory sorting on CPUs can have low performance due to • High memory latency on account of cache misses • Or low I/O performance • Sorting is hard!

  13. Graphics Processing Units (GPUs) • Commodity processor for graphics applications • Massively parallel vector processors • High memory bandwidth • Low memory latency pipeline • Programmable • High growth rate

  14. GPU: Commodity Processor Laptops Consoles Cell phones PSP Desktops

  15. Graphics Processing Units (GPUs) • Commodity processor for graphics applications • Massively parallel vector processors • 10x more operations per sec than CPUs • High memory bandwidth • Low memory latency pipeline • Programmable • High growth rate

  16. Parallelism on GPUs Graphics FLOPS GPU – 1.3 TFLOPS CPU – 25.6 GFLOPS

  17. Graphics Processing Units (GPUs) • Commodity processor for graphics applications • Massively parallel vector processors • High memory bandwidth • Better hides memory latency • Programmable • 10x more memory bandwidth than CPUs • High growth rate

  18. Low pipeline depth Graphics Pipeline 56 GB/s programmable vertex processing (fp32) vertex polygon setup, culling, rasterization setup polygon rasterizer Hides memory latency!! programmable per- pixel math (fp32) pixel per-pixel texture, fp16 blending texture Z-buf, fp16 blending, anti-alias (MRT) memory image

  19. NON-Graphics Pipeline Abstraction programmable MIMD processing (fp32) data Courtesy: David Kirk,Chief Scientist, NVIDIA SIMD “rasterization” setup lists rasterizer programmable SIMD processing (fp32) data data fetch, fp16 blending data predicated write, fp16 blend, multiple output memory data

  20. Graphics Processing Units (GPUs) • Commodity processor for graphics applications • Massively parallel vector processors • High memory bandwidth • Low memory latency pipeline • Programmable • High growth rate

  21. Graphics Req’mts (enhanced experience) Moore’s Law 3 for 18 mo Then Moore’s Law trajectory Leading Edge 31 GHz GPU Cooling (Cost) Limitations Value / UMA Enthusiast / Specialty Moore’s Law Trajectory Mainstream Desktop Log of Relative Processing Power 4.4 GHz Leading Edge 11.2 DT ‘Replacement’ 2.2 GHz 4.2 Mobile CPU 1.6 GHz Value Corporate DT SW Requirements 0.8 GHz 2002 2004 2006 2008 Technology Trends: CPU and GPU CPU ?

  22. Architecture of Phase 1: GPUTeraSort

  23. GPUs for Sorting: Issues • No support for arbitrary writes • Optimized CPU algorithms do not map! • Requires new algorithms – sorting networks • Lack of support for general data types • Out-of-core algorithms • Limited GPU memory • Difficult to program

  24. General Sorting on GPUs • Sorting networks: No data dependencies • Utilize high parallelism on GPUs • To handle large keys, use bitonic radix sort • Perform bitonic sort on the 4 most significant bytes (MSB) using GPUs, compute sorted records with equal 4 MSBs, proceed to the next 4 bytes on those and so on • Can handle any length keys

  25. GPU-Based Sorting Networks • Represent data as 2D arrays • Multi-stage algorithm • Each stage involves multiple steps • In each step • Compare one array element against exactly one other element at fixed distance • Perform a conditional assignment (MIN or MAX) at each element location

  26. Flash animation removed to save (46MB !)

  27. 2D Memory Addressing • GPUs optimized for 2D representations • Map 1D arrays to 2D arrays • Minimum and maximum regions mapped to row-aligned or column-aligned quads

  28. 1D – 2D Mapping MIN MAX

  29. MIN 1D – 2D Mapping Effectively reduce instructions per element

  30. Sorting on GPU: Pipelining and Parallelism Input Vertices Texturing, Caching and 2D Quad Comparisons Sequential Writes

  31. Comparison with GPU-Based Algorithms 3-6x faster than prior GPU-based algorithms!

  32. GPU vs. High-End Multi-Core CPUs 2-2.5x faster thanIntel high-end processors Single GPU performance comparable tohigh-end dual core Athlon Hand-optimized CPU code from Intel Corporation!

  33. Slash Dot News and Toms Hardware News Headlines Super-Moore’s Law Growth 50 GB/s on a single GPU Peak Performance: Effectively hide memory latency with 15 GOP/s Download URL: http://gamma.cs.unc.edu/GPUSORT

  34. Implementation & Results • Pentium IV PC ($170) • NVIDIA 7800 GT ($270) • 2 GB RAM ($152) • 9 80GB SATA disks ($477) • SuperMicro Motherboard & SATA Controller ($325) • Windows XP • PC costs $1469

  35. Implementation & Results • Indy SortBenchmark • 10 byte random string keys • 100 byte long records • Sort maximum amount in 644 seconds

  36. Overall Performance Faster and more scalable than Dual Xeon processors (3.6 GHz)!

  37. Performance/$ 1.8x faster than current Terabyte sorter World’s best price-to-performance system http://research.microsoft.com/barc/SortBenchmark

  38. Analysis: I/O Performance Salzberg Analysis: 100 MB Run Size Peak sequential throughput in MB/s

  39. Analysis: I/O Performance Salzberg Analysis: 100 MB Run Size Pentium IV: 25MB Run Size (to reduce memory latency) Less work and only 75% IO efficient!

  40. Analysis: I/O Performance Salzberg Analysis: 100 MB Run Size Dual 3.6 GHz Xeons: 25MB Run size (to reduce memory latency) More cores, less work but only 85% IO efficient!

  41. Analysis: I/O Performance Salzberg Analysis: 100 MB Run Size 7800 GT: 100MB run size Ideal work, and 92% IO efficient with single CPU!

  42. Task Parallelism Performance limited by IO and memory Reorder or Sequential IO Sorting 100MB on GPU Sorting 100MB on GPU: 3x > reorder or sequential IO

  43. Why GPU-like Architectures for Large Data Management? GPU Plateau: Data Management Performance Crisis

  44. Advantages • Exploit high memory bandwidth on GPUs • Higher memory performance than CPU-based algorithms • High I/O performance due to large run sizes

  45. Advantages • Offload work from CPUs • CPU cycles well-utilized for resource management • Scalable solution for large databases • Best performance/price solution for terabyte sorting

  46. Limitations • May not work well on variable-sized keys and almost sorted databases • Requires programmable GPUs (GPUs manufactured after 2003)

  47. Conclusions • Designed new sorting algorithms on GPUs • Handles wide keys and long records • Achieves 10x higher memory performance • Memory efficient sorting algorithm with peak memory performance of (50 GB/s) on GPUs • 15 GOP/sec on a single GPU

  48. Conclusions • Novel external memory sorting algorithm as a scalable solution • Achieves peak I/O performance on CPUs • Best performance/price solution – world’s fastest sorting system • High performance growth rate characteristics • Improve 2-3 times/yr

  49. Future Work • Designed high performance/price solutions • High wattage and cooling requirements of CPUs and GPUs • To exploit GPUs, we need easy-to-use programming APIs • Promising directions: BrookGPU, Microsoft Accelerator, Sh, etc. • Scientific libraries utilizing high parallelism and memory bandwidth • Scientific routines on LU, QR, SVD, FFT, etc. • BLAS library on GPUs • Eventually, build GPU-LAPACK and Matlab routines

  50. N. Govindaraju,S. Larsen, J. Gray and D. Manocha, Proc. of ACM SuperComputing, 2006 (to appear) GPUFFTW 4x faster than IMKL on high-end Quad cores SlashDot Headlines, May 2006 Download URL: http://gamma.cs.unc.edu/GPUFFTW