Gputerasort high performance graphics co processor sorting for large data management
Download
1 / 55

GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Data Management - PowerPoint PPT Presentation


  • 375 Views
  • Updated On :

GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Data Management Naga K. Govindaraju Jim Gray Ritesh Kumar Dinesh Manocha http://gamma.cs.unc.edu/GPUTERASORT Sorting

Related searches for GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Data Management

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Data Management' - emily


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Gputerasort high performance graphics co processor sorting for large data management l.jpg

GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Data Management

Naga K. Govindaraju Jim Gray Ritesh Kumar Dinesh Manocha

http://gamma.cs.unc.edu/GPUTERASORT


Sorting l.jpg
Sorting

“I believe that virtually every important aspect of programming arises somewhere in the context of sorting or searching!”

-Don Knuth


Sorting3 l.jpg
Sorting

  • Well studied

    • High performance computing

    • Databases

    • Computer graphics

    • Programming languages

    • ...

  • Google map reduce algorithm

  • Spec benchmark routine!


Massive databases l.jpg
Massive Databases

  • Terabyte-data sets are common

    • Google sorts more than 100 billion terms in its index

    • > 1 Trillion records in web indexed!

  • Database sizes are rapidly increasing!

    • Max DB sizes increases 3x per year (http://www.wintercorp.com)

    • Processor improvements not matching information explosion


Cpu vs gpu l.jpg

CPU(3 GHz)

AGP Memory(512 MB)

CPU vs. GPU

GPU (690 MHz)

Video Memory(512 MB)

2 x 1 MB Cache

System Memory(2 GB)

PCI-E Bus(4 GB/s)

GPU (690 MHz)

Video Memory(512 MB)


External memory sorting l.jpg
External Memory Sorting

  • Performed on Terabyte-scale databases

  • Two phases algorithm [Vitter01, Salzberg90, Nyberg94, Nyberg95]

    • Limited main memory

    • First phase – partitions input file into large data chunks and writes sorted chunks known as “Runs”

    • Second phase – Merge the “Runs” to generate the sorted file


External memory sorting7 l.jpg
External Memory Sorting

  • Performance mainly governed by I/O

    Salzberg Analysis: Given the main memory size M and the file size N, if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN)


External memory sorting8 l.jpg
External Memory Sorting

Given the main memory size M and the file size N,if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN)

N


External memory sorting9 l.jpg
External Memory Sorting

Given the main memory size M and the file size N, if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performanceif the run size R in phase 1is given by R ≈ √(TN)

R


External memory sorting10 l.jpg
External Memory Sorting

Given the main memory size M and the file size N,if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN)

T


Salzberg analysis l.jpg
Salzberg Analysis

  • If N=100GB, T=2MB, then R ≈ 230MB

  • Large data sorting on CPUs can achieve high I/O performance by sorting large runs


Massive data handling on cpus l.jpg
Massive Data Handling on CPUs

  • Require random memory accesses

    • Small CPU caches (< 2MB)

    • Slower than even sequential disk accesses – bottleneck shift from I/O to memory

    • Widening memory to compute gap!

  • External memory sorting on CPUs can have low performance due to

    • High memory latency on account of cache misses

    • Or low I/O performance

  • Sorting is hard!


G raphics p rocessing u nits gpu s l.jpg
Graphics Processing Units (GPUs)

  • Commodity processor for graphics applications

  • Massively parallel vector processors

  • High memory bandwidth

    • Low memory latency pipeline

    • Programmable

  • High growth rate


Gpu commodity processor l.jpg
GPU: Commodity Processor

Laptops

Consoles

Cell phones

PSP

Desktops


G raphics p rocessing u nits gpu s15 l.jpg
Graphics Processing Units (GPUs)

  • Commodity processor for graphics applications

  • Massively parallel vector processors

    • 10x more operations per sec than CPUs

  • High memory bandwidth

    • Low memory latency pipeline

    • Programmable

  • High growth rate


Parallelism on gpus l.jpg
Parallelism on GPUs

Graphics FLOPS

GPU – 1.3 TFLOPS

CPU – 25.6 GFLOPS


G raphics p rocessing u nits gpu s17 l.jpg
Graphics Processing Units (GPUs)

  • Commodity processor for graphics applications

  • Massively parallel vector processors

  • High memory bandwidth

    • Better hides memory latency

    • Programmable

    • 10x more memory bandwidth than CPUs

  • High growth rate


Graphics pipeline l.jpg

Low pipeline depth

Graphics Pipeline

56 GB/s

programmable vertex

processing (fp32)

vertex

polygon setup,

culling, rasterization

setup

polygon

rasterizer

Hides memory latency!!

programmable per-

pixel math (fp32)

pixel

per-pixel texture,

fp16 blending

texture

Z-buf, fp16 blending,

anti-alias (MRT)

memory

image


Non graphics pipeline abstraction l.jpg
NON-Graphics Pipeline Abstraction

programmable MIMD

processing (fp32)

data

Courtesy: David Kirk,Chief Scientist, NVIDIA

SIMD

“rasterization”

setup

lists

rasterizer

programmable SIMD

processing (fp32)

data

data fetch,

fp16 blending

data

predicated write, fp16

blend, multiple output

memory

data


G raphics p rocessing u nits gpu s20 l.jpg
Graphics Processing Units (GPUs)

  • Commodity processor for graphics applications

  • Massively parallel vector processors

  • High memory bandwidth

    • Low memory latency pipeline

    • Programmable

  • High growth rate


Technology trends cpu and gpu l.jpg

Graphics Req’mts

(enhanced experience)

Moore’s Law 3 for 18 mo

Then Moore’s Law trajectory

Leading Edge

31 GHz

GPU

Cooling (Cost)

Limitations

Value / UMA

Enthusiast / Specialty

Moore’s Law Trajectory

Mainstream Desktop

Log of Relative Processing Power

4.4

GHz

Leading

Edge

11.2

DT ‘Replacement’

2.2

GHz

4.2

Mobile

CPU

1.6 GHz

Value

Corporate DT SW Requirements

0.8 GHz

2002

2004

2006

2008

Technology Trends: CPU and GPU

CPU

?



Gpus for sorting issues l.jpg
GPUs for Sorting: Issues

  • No support for arbitrary writes

    • Optimized CPU algorithms do not map!

    • Requires new algorithms – sorting networks

  • Lack of support for general data types

  • Out-of-core algorithms

    • Limited GPU memory

  • Difficult to program


General sorting on gpus l.jpg
General Sorting on GPUs

  • Sorting networks: No data dependencies

    • Utilize high parallelism on GPUs

  • To handle large keys, use bitonic radix sort

    • Perform bitonic sort on the 4 most significant bytes (MSB) using GPUs, compute sorted records with equal 4 MSBs, proceed to the next 4 bytes on those and so on

    • Can handle any length keys


Gpu based sorting networks l.jpg
GPU-Based Sorting Networks

  • Represent data as 2D arrays

  • Multi-stage algorithm

    • Each stage involves multiple steps

  • In each step

    • Compare one array element against exactly one other element at fixed distance

    • Perform a conditional assignment (MIN or MAX) at each element location



2d memory addressing l.jpg
2D Memory Addressing

  • GPUs optimized for 2D representations

    • Map 1D arrays to 2D arrays

    • Minimum and maximum regions mapped to row-aligned or column-aligned quads



1d 2d mapping29 l.jpg

MIN

1D – 2D Mapping

Effectively reduce instructions per element


Sorting on gpu pipelining and parallelism l.jpg
Sorting on GPU: Pipelining and Parallelism

Input Vertices

Texturing, Caching and 2D Quad

Comparisons

Sequential Writes


Comparison with gpu based algorithms l.jpg
Comparison with GPU-Based Algorithms

3-6x faster than prior GPU-based algorithms!


Gpu vs high end multi core cpus l.jpg
GPU vs. High-End Multi-Core CPUs

2-2.5x faster thanIntel high-end processors

Single GPU performance comparable tohigh-end dual core Athlon

Hand-optimized CPU code from Intel Corporation!


Super moore s law growth l.jpg

Slash Dot News and Toms Hardware News Headlines

Super-Moore’s Law Growth

50 GB/s on a single GPU

Peak Performance: Effectively hide memory latency with 15 GOP/s

Download URL: http://gamma.cs.unc.edu/GPUSORT


Implementation results l.jpg
Implementation & Results

  • Pentium IV PC ($170)

  • NVIDIA 7800 GT ($270)

  • 2 GB RAM ($152)

  • 9 80GB SATA disks ($477)

  • SuperMicro Motherboard & SATA Controller ($325)

  • Windows XP

  • PC costs $1469


Implementation results35 l.jpg
Implementation & Results

  • Indy SortBenchmark

    • 10 byte random string keys

    • 100 byte long records

    • Sort maximum amount in 644 seconds


Overall performance l.jpg
Overall Performance

Faster and more scalable than Dual Xeon processors (3.6 GHz)!


Performance l.jpg
Performance/$

1.8x faster than current Terabyte sorter

World’s best price-to-performance system

http://research.microsoft.com/barc/SortBenchmark


Analysis i o performance l.jpg
Analysis: I/O Performance

Salzberg Analysis: 100 MB Run Size

Peak sequential throughput in MB/s


Analysis i o performance39 l.jpg
Analysis: I/O Performance

Salzberg Analysis: 100 MB Run Size

Pentium IV: 25MB Run Size (to reduce memory latency)

Less work and only 75% IO efficient!


Analysis i o performance40 l.jpg
Analysis: I/O Performance

Salzberg Analysis: 100 MB Run Size

Dual 3.6 GHz Xeons: 25MB Run size (to reduce memory latency)

More cores, less work but only 85% IO efficient!


Analysis i o performance41 l.jpg
Analysis: I/O Performance

Salzberg Analysis: 100 MB Run Size

7800 GT: 100MB run size

Ideal work, and 92% IO efficient with single CPU!


Task parallelism l.jpg
Task Parallelism

Performance limited by IO and memory

Reorder or Sequential IO

Sorting 100MB on GPU

Sorting 100MB on GPU: 3x > reorder or sequential IO


Why gpu like architectures for large data management l.jpg
Why GPU-like Architectures for Large Data Management?

GPU

Plateau: Data Management Performance Crisis


Advantages l.jpg
Advantages

  • Exploit high memory bandwidth on GPUs

    • Higher memory performance than CPU-based algorithms

  • High I/O performance due to large run sizes


Advantages45 l.jpg
Advantages

  • Offload work from CPUs

    • CPU cycles well-utilized for resource management

  • Scalable solution for large databases

  • Best performance/price solution for terabyte sorting


Limitations l.jpg
Limitations

  • May not work well on variable-sized keys and almost sorted databases

  • Requires programmable GPUs (GPUs manufactured after 2003)


Conclusions l.jpg
Conclusions

  • Designed new sorting algorithms on GPUs

    • Handles wide keys and long records

  • Achieves 10x higher memory performance

    • Memory efficient sorting algorithm with peak memory performance of (50 GB/s) on GPUs

    • 15 GOP/sec on a single GPU


Conclusions48 l.jpg
Conclusions

  • Novel external memory sorting algorithm as a scalable solution

    • Achieves peak I/O performance on CPUs

    • Best performance/price solution – world’s fastest sorting system

  • High performance growth rate characteristics

    • Improve 2-3 times/yr


Future work l.jpg
Future Work

  • Designed high performance/price solutions

    • High wattage and cooling requirements of CPUs and GPUs

  • To exploit GPUs, we need easy-to-use programming APIs

    • Promising directions: BrookGPU, Microsoft Accelerator, Sh, etc.

  • Scientific libraries utilizing high parallelism and memory bandwidth

    • Scientific routines on LU, QR, SVD, FFT, etc.

    • BLAS library on GPUs

    • Eventually, build GPU-LAPACK and Matlab routines


Gpufftw l.jpg

N. Govindaraju,S. Larsen, J. Gray and D. Manocha, Proc. of ACM SuperComputing, 2006 (to appear)

GPUFFTW

4x faster than IMKL on high-end Quad cores

SlashDot Headlines, May 2006

Download URL: http://gamma.cs.unc.edu/GPUFFTW


Gpu roadmap l.jpg
GPU Roadmap

  • GPUs are becoming more general purpose

    • Fewer limitations in Microsoft DirectX10 API

      • Better and consistent floating point support,

      • Integer instruction support,

      • More programmable stages, etc.

    • Significant advance in performance

  • GPUs are being widely adopted in commercial applications

    • Eg. Microsoft Vista


Call to action l.jpg

40 gops

40 gBps

Call to Action

  • Don’t put all your eggs in the Multi-core basket

  • If you want TeraOps – go where they are

  • If you want memory bandwidth– go where the memory bandwidth is.

  • CPU-GPU gap is widening

  • Microsoft Xbox is ½ TeraOP today.


Acknowledgements l.jpg
Acknowledgements

Research Sponsors:

  • Army Research Office

  • Defense and Advanced Research Projects Agency

  • National Science Foundation

  • Naval Research Laboratory

  • Intel Corporation

  • Microsoft Corporation

    • Craig Peeper, Peter-Pike Sloan, David Blythe, Jingren Zhou

  • NVIDIA Corporation

  • RDECOM


Acknowledgements54 l.jpg
Acknowledgements

  • David Tuft (UNC)

  • UNC Systems, GAMMA and Walkthrough groups


Thank you l.jpg
Thank You

  • Questions or Comments?

{naga,ritesh,dm}@cs.unc.edu

Jim.Gray@microsoft.com

http://www.cs.unc.edu/~naga

http://research.microsoft.com/~Gray


ad