gputerasort high performance graphics co processor sorting for large data management l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Data Management PowerPoint Presentation
Download Presentation
GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Data Management

Loading in 2 Seconds...

play fullscreen
1 / 55

GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Data Management - PowerPoint PPT Presentation


  • 279 Views
  • Uploaded on

GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Data Management. Naga K. Govindaraju Jim Gray Ritesh Kumar Dinesh Manocha http://gamma.cs.unc.edu/GPUTERASORT. Sorting.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Data Management' - lotus


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
gputerasort high performance graphics co processor sorting for large data management

GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Data Management

Naga K. Govindaraju Jim Gray Ritesh Kumar Dinesh Manocha

http://gamma.cs.unc.edu/GPUTERASORT

sorting
Sorting

“I believe that virtually every important aspect of programming arises somewhere in the context of sorting or searching!”

-Don Knuth

sorting3
Sorting
  • Well studied
    • High performance computing
    • Databases
    • Computer graphics
    • Programming languages
    • ...
  • Google map reduce algorithm
  • Spec benchmark routine!
massive databases
Massive Databases
  • Terabyte-data sets are common
    • Google sorts more than 100 billion terms in its index
    • > 1 Trillion records in web indexed!
  • Database sizes are rapidly increasing!
    • Max DB sizes increases 3x per year (http://www.wintercorp.com)
    • Processor improvements not matching information explosion
cpu vs gpu

CPU(3 GHz)

AGP Memory(512 MB)

CPU vs. GPU

GPU (690 MHz)

Video Memory(512 MB)

2 x 1 MB Cache

System Memory(2 GB)

PCI-E Bus(4 GB/s)

GPU (690 MHz)

Video Memory(512 MB)

external memory sorting
External Memory Sorting
  • Performed on Terabyte-scale databases
  • Two phases algorithm [Vitter01, Salzberg90, Nyberg94, Nyberg95]
    • Limited main memory
    • First phase – partitions input file into large data chunks and writes sorted chunks known as “Runs”
    • Second phase – Merge the “Runs” to generate the sorted file
external memory sorting7
External Memory Sorting
  • Performance mainly governed by I/O

Salzberg Analysis: Given the main memory size M and the file size N, if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN)

external memory sorting8
External Memory Sorting

Given the main memory size M and the file size N,if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN)

N

external memory sorting9
External Memory Sorting

Given the main memory size M and the file size N, if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performanceif the run size R in phase 1is given by R ≈ √(TN)

R

external memory sorting10
External Memory Sorting

Given the main memory size M and the file size N,if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN)

T

salzberg analysis
Salzberg Analysis
  • If N=100GB, T=2MB, then R ≈ 230MB
  • Large data sorting on CPUs can achieve high I/O performance by sorting large runs
massive data handling on cpus
Massive Data Handling on CPUs
  • Require random memory accesses
    • Small CPU caches (< 2MB)
    • Slower than even sequential disk accesses – bottleneck shift from I/O to memory
    • Widening memory to compute gap!
  • External memory sorting on CPUs can have low performance due to
    • High memory latency on account of cache misses
    • Or low I/O performance
  • Sorting is hard!
g raphics p rocessing u nits gpu s
Graphics Processing Units (GPUs)
  • Commodity processor for graphics applications
  • Massively parallel vector processors
  • High memory bandwidth
    • Low memory latency pipeline
    • Programmable
  • High growth rate
gpu commodity processor
GPU: Commodity Processor

Laptops

Consoles

Cell phones

PSP

Desktops

g raphics p rocessing u nits gpu s15
Graphics Processing Units (GPUs)
  • Commodity processor for graphics applications
  • Massively parallel vector processors
    • 10x more operations per sec than CPUs
  • High memory bandwidth
    • Low memory latency pipeline
    • Programmable
  • High growth rate
parallelism on gpus
Parallelism on GPUs

Graphics FLOPS

GPU – 1.3 TFLOPS

CPU – 25.6 GFLOPS

g raphics p rocessing u nits gpu s17
Graphics Processing Units (GPUs)
  • Commodity processor for graphics applications
  • Massively parallel vector processors
  • High memory bandwidth
    • Better hides memory latency
    • Programmable
    • 10x more memory bandwidth than CPUs
  • High growth rate
graphics pipeline

Low pipeline depth

Graphics Pipeline

56 GB/s

programmable vertex

processing (fp32)

vertex

polygon setup,

culling, rasterization

setup

polygon

rasterizer

Hides memory latency!!

programmable per-

pixel math (fp32)

pixel

per-pixel texture,

fp16 blending

texture

Z-buf, fp16 blending,

anti-alias (MRT)

memory

image

non graphics pipeline abstraction
NON-Graphics Pipeline Abstraction

programmable MIMD

processing (fp32)

data

Courtesy: David Kirk,Chief Scientist, NVIDIA

SIMD

“rasterization”

setup

lists

rasterizer

programmable SIMD

processing (fp32)

data

data fetch,

fp16 blending

data

predicated write, fp16

blend, multiple output

memory

data

g raphics p rocessing u nits gpu s20
Graphics Processing Units (GPUs)
  • Commodity processor for graphics applications
  • Massively parallel vector processors
  • High memory bandwidth
    • Low memory latency pipeline
    • Programmable
  • High growth rate
technology trends cpu and gpu

Graphics Req’mts

(enhanced experience)

Moore’s Law 3 for 18 mo

Then Moore’s Law trajectory

Leading Edge

31 GHz

GPU

Cooling (Cost)

Limitations

Value / UMA

Enthusiast / Specialty

Moore’s Law Trajectory

Mainstream Desktop

Log of Relative Processing Power

4.4

GHz

Leading

Edge

11.2

DT ‘Replacement’

2.2

GHz

4.2

Mobile

CPU

1.6 GHz

Value

Corporate DT SW Requirements

0.8 GHz

2002

2004

2006

2008

Technology Trends: CPU and GPU

CPU

?

gpus for sorting issues
GPUs for Sorting: Issues
  • No support for arbitrary writes
    • Optimized CPU algorithms do not map!
    • Requires new algorithms – sorting networks
  • Lack of support for general data types
  • Out-of-core algorithms
    • Limited GPU memory
  • Difficult to program
general sorting on gpus
General Sorting on GPUs
  • Sorting networks: No data dependencies
    • Utilize high parallelism on GPUs
  • To handle large keys, use bitonic radix sort
    • Perform bitonic sort on the 4 most significant bytes (MSB) using GPUs, compute sorted records with equal 4 MSBs, proceed to the next 4 bytes on those and so on
    • Can handle any length keys
gpu based sorting networks
GPU-Based Sorting Networks
  • Represent data as 2D arrays
  • Multi-stage algorithm
    • Each stage involves multiple steps
  • In each step
    • Compare one array element against exactly one other element at fixed distance
    • Perform a conditional assignment (MIN or MAX) at each element location
2d memory addressing
2D Memory Addressing
  • GPUs optimized for 2D representations
    • Map 1D arrays to 2D arrays
    • Minimum and maximum regions mapped to row-aligned or column-aligned quads
1d 2d mapping29

MIN

1D – 2D Mapping

Effectively reduce instructions per element

sorting on gpu pipelining and parallelism
Sorting on GPU: Pipelining and Parallelism

Input Vertices

Texturing, Caching and 2D Quad

Comparisons

Sequential Writes

comparison with gpu based algorithms
Comparison with GPU-Based Algorithms

3-6x faster than prior GPU-based algorithms!

gpu vs high end multi core cpus
GPU vs. High-End Multi-Core CPUs

2-2.5x faster thanIntel high-end processors

Single GPU performance comparable tohigh-end dual core Athlon

Hand-optimized CPU code from Intel Corporation!

super moore s law growth

Slash Dot News and Toms Hardware News Headlines

Super-Moore’s Law Growth

50 GB/s on a single GPU

Peak Performance: Effectively hide memory latency with 15 GOP/s

Download URL: http://gamma.cs.unc.edu/GPUSORT

implementation results
Implementation & Results
  • Pentium IV PC ($170)
  • NVIDIA 7800 GT ($270)
  • 2 GB RAM ($152)
  • 9 80GB SATA disks ($477)
  • SuperMicro Motherboard & SATA Controller ($325)
  • Windows XP
  • PC costs $1469
implementation results35
Implementation & Results
  • Indy SortBenchmark
    • 10 byte random string keys
    • 100 byte long records
    • Sort maximum amount in 644 seconds
overall performance
Overall Performance

Faster and more scalable than Dual Xeon processors (3.6 GHz)!

performance
Performance/$

1.8x faster than current Terabyte sorter

World’s best price-to-performance system

http://research.microsoft.com/barc/SortBenchmark

analysis i o performance
Analysis: I/O Performance

Salzberg Analysis: 100 MB Run Size

Peak sequential throughput in MB/s

analysis i o performance39
Analysis: I/O Performance

Salzberg Analysis: 100 MB Run Size

Pentium IV: 25MB Run Size (to reduce memory latency)

Less work and only 75% IO efficient!

analysis i o performance40
Analysis: I/O Performance

Salzberg Analysis: 100 MB Run Size

Dual 3.6 GHz Xeons: 25MB Run size (to reduce memory latency)

More cores, less work but only 85% IO efficient!

analysis i o performance41
Analysis: I/O Performance

Salzberg Analysis: 100 MB Run Size

7800 GT: 100MB run size

Ideal work, and 92% IO efficient with single CPU!

task parallelism
Task Parallelism

Performance limited by IO and memory

Reorder or Sequential IO

Sorting 100MB on GPU

Sorting 100MB on GPU: 3x > reorder or sequential IO

why gpu like architectures for large data management
Why GPU-like Architectures for Large Data Management?

GPU

Plateau: Data Management Performance Crisis

advantages
Advantages
  • Exploit high memory bandwidth on GPUs
    • Higher memory performance than CPU-based algorithms
  • High I/O performance due to large run sizes
advantages45
Advantages
  • Offload work from CPUs
    • CPU cycles well-utilized for resource management
  • Scalable solution for large databases
  • Best performance/price solution for terabyte sorting
limitations
Limitations
  • May not work well on variable-sized keys and almost sorted databases
  • Requires programmable GPUs (GPUs manufactured after 2003)
conclusions
Conclusions
  • Designed new sorting algorithms on GPUs
    • Handles wide keys and long records
  • Achieves 10x higher memory performance
    • Memory efficient sorting algorithm with peak memory performance of (50 GB/s) on GPUs
    • 15 GOP/sec on a single GPU
conclusions48
Conclusions
  • Novel external memory sorting algorithm as a scalable solution
    • Achieves peak I/O performance on CPUs
    • Best performance/price solution – world’s fastest sorting system
  • High performance growth rate characteristics
    • Improve 2-3 times/yr
future work
Future Work
  • Designed high performance/price solutions
    • High wattage and cooling requirements of CPUs and GPUs
  • To exploit GPUs, we need easy-to-use programming APIs
    • Promising directions: BrookGPU, Microsoft Accelerator, Sh, etc.
  • Scientific libraries utilizing high parallelism and memory bandwidth
    • Scientific routines on LU, QR, SVD, FFT, etc.
    • BLAS library on GPUs
    • Eventually, build GPU-LAPACK and Matlab routines
gpufftw

N. Govindaraju,S. Larsen, J. Gray and D. Manocha, Proc. of ACM SuperComputing, 2006 (to appear)

GPUFFTW

4x faster than IMKL on high-end Quad cores

SlashDot Headlines, May 2006

Download URL: http://gamma.cs.unc.edu/GPUFFTW

gpu roadmap
GPU Roadmap
  • GPUs are becoming more general purpose
    • Fewer limitations in Microsoft DirectX10 API
      • Better and consistent floating point support,
      • Integer instruction support,
      • More programmable stages, etc.
    • Significant advance in performance
  • GPUs are being widely adopted in commercial applications
    • Eg. Microsoft Vista
call to action

40 gops

40 gBps

Call to Action
  • Don’t put all your eggs in the Multi-core basket
  • If you want TeraOps – go where they are
  • If you want memory bandwidth– go where the memory bandwidth is.
  • CPU-GPU gap is widening
  • Microsoft Xbox is ½ TeraOP today.
acknowledgements
Acknowledgements

Research Sponsors:

  • Army Research Office
  • Defense and Advanced Research Projects Agency
  • National Science Foundation
  • Naval Research Laboratory
  • Intel Corporation
  • Microsoft Corporation
    • Craig Peeper, Peter-Pike Sloan, David Blythe, Jingren Zhou
  • NVIDIA Corporation
  • RDECOM
acknowledgements54
Acknowledgements
  • David Tuft (UNC)
  • UNC Systems, GAMMA and Walkthrough groups
thank you
Thank You
  • Questions or Comments?

{naga,ritesh,dm}@cs.unc.edu

Jim.Gray@microsoft.com

http://www.cs.unc.edu/~naga

http://research.microsoft.com/~Gray