high performance sorting and searching using graphics processors
Download
Skip this Video
Download Presentation
High Performance Sorting and Searching using Graphics Processors

Loading in 2 Seconds...

play fullscreen
1 / 115

high performance sorting and searching - PowerPoint PPT Presentation


  • 318 Views
  • Uploaded on

High Performance Sorting and Searching using Graphics Processors. Naga K. Govindaraju Microsoft Concurrency. Sorting and Searching. “I believe that virtually every important aspect of programming arises somewhere in the context of sorting or searching !” -Don Knuth. Sorting and Searching.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'high performance sorting and searching ' - ryanadan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
high performance sorting and searching using graphics processors

High Performance Sorting and Searching using Graphics Processors

Naga K. GovindarajuMicrosoft Concurrency

sorting and searching
Sorting and Searching

“I believe that virtually every important aspect of programming arises somewhere in the context of sorting or searching!”

-Don Knuth

sorting and searching3
Sorting and Searching
  • Well studied
    • High performance computing
    • Databases
    • Computer graphics
    • Programming languages
    • ...
  • Google map reduce algorithm
  • Spec benchmark routine!
massive databases
Massive Databases
  • Terabyte-data sets are common
    • Google sorts more than 100 billion terms in its index
    • > 1 Trillion records in web indexed!
  • Database sizes are rapidly increasing!
    • Max DB sizes increases 3x per year (http://www.wintercorp.com)
    • Processor improvements not matching information explosion
cpu vs gpu

CPU(3 GHz)

AGP Memory(512 MB)

CPU vs. GPU

GPU (690 MHz)

Video Memory(512 MB)

2 x 1 MB Cache

System Memory(2 GB)

PCI-E Bus(4 GB/s)

GPU (690 MHz)

Video Memory(512 MB)

massive data handling on cpus
Massive Data Handling on CPUs
  • Require random memory accesses
    • Small CPU caches (< 2MB)
    • Random memory accesses slower than even sequential disk accesses
  • High memory latency
    • Huge memory to compute gap!
  • CPUs are deeply pipelined
    • Pentium 4 has 30 pipeline stages
    • Do not hide latency - high cycles per instruction (CPI)
  • CPU is under-utilized for data intensive applications
massive data handling on cpus7
Massive Data Handling on CPUs
  • Sorting is hard!
  • GPU a potentially scalable solution to terabyte sorting and scientific computing
    • We beat the sorting benchmark with GPUs and provide a scaleable solution
g raphics p rocessing u nits gpu s
Graphics Processing Units (GPUs)
  • Commodity processor for graphics applications
  • Massively parallel vector processors
  • High memory bandwidth
    • Low memory latency pipeline
    • Programmable
  • High growth rate
  • Power-efficient
gpu commodity processor
GPU: Commodity Processor

Laptops

Consoles

Cell phones

PSP

Desktops

g raphics p rocessing u nits gpu s10
Graphics Processing Units (GPUs)
  • Commodity processor for graphics applications
  • Massively parallel vector processors
    • 10x more operations per sec than CPUs
  • High memory bandwidth
    • Better hides memory latency pipeline
    • Programmable
  • High growth rate
  • Power-efficient
parallelism on gpus
Parallelism on GPUs

Graphics FLOPS

GPU – 1.3 TFLOPS

CPU – 25.6 GFLOPS

g raphics p rocessing u nits gpu s12
Graphics Processing Units (GPUs)
  • Commodity processor for graphics applications
  • Massively parallel vector processors
  • High memory bandwidth
    • Better hides latency pipeline
    • Programmable
    • 10x more memory bandwidth than CPUs
  • High growth rate
  • Power-efficient
graphics pipeline

Low pipeline depth

Graphics Pipeline

56 GB/s

programmable vertex

processing (fp32)

vertex

polygon setup,

culling, rasterization

setup

polygon

rasterizer

Hides memory latency!!

programmable per-

pixel math (fp32)

pixel

per-pixel texture,

fp16 blending

texture

Z-buf, fp16 blending,

anti-alias (MRT)

memory

image

non graphics pipeline abstraction
NON-Graphics Pipeline Abstraction

programmable MIMD

processing (fp32)

data

Courtesy: David Kirk,Chief Scientist, NVIDIA

SIMD

“rasterization”

setup

lists

rasterizer

programmable SIMD

processing (fp32)

data

data fetch,

fp16 blending

data

predicated write, fp16

blend, multiple output

memory

data

g raphics p rocessing u nits gpu s15
Graphics Processing Units (GPUs)
  • Commodity processor for graphics applications
  • Massively parallel vector processors
  • High memory bandwidth
    • Better hides latency pipeline
    • Programmable
  • High growth rate
  • Power-efficient
slide16

GPU Growth Rate

CPU Growth Rate

Exploiting Technology Moving

Faster than Moore’s Law

g raphics p rocessing u nits gpu s17
Graphics Processing Units (GPUs)
  • Commodity processor for graphics applications
  • Massively parallel vector processors
  • High memory bandwidth
    • Better hides latency pipeline
    • Programmable
  • High growth rate
  • Power-efficient
gpus for sorting and searching issues
GPUs for Sorting and Searching: Issues
  • No support for arbitrary writes
    • Optimized CPU algorithms do not map!
  • Lack of support for general data types
  • Cache-efficient algorithms
    • Small data caches
    • No cache information from vendors
  • Out-of-core algorithms
    • Limited GPU memory
outline
Outline
  • Overview
  • Sorting and Searching on GPUs
  • Applications
  • Conclusions and Future Work
sorting on gpus
Sorting on GPUs
  • Adaptive sorting algorithms
    • Extent of sorted order in a sequence
  • General sorting algorithms
  • External memory sorting algorithms
adaptive sorting on gpus
Adaptive Sorting on GPUs
  • Prior adaptive sorting algorithms require random data writes
  • GPUs optimized for minimum depth or visible surface computation
    • Using depth test functionality
  • Design adaptive sorting using only minimum computations
adaptive sorting algorithm

N. Govindaraju, M. Henson, M. Lin and D. Manocha, Proc. Of ACM I3D, 2005

Adaptive Sorting Algorithm
  • Multiple iterations
  • Each iteration uses a two pass algorithm
    • First pass – Compute an increasing sequence M
    • Second pass - Compute the sorted elements in M
    • Iterate on the remaining unsorted elements
increasing sequence
Increasing Sequence

Given a sequence S={x1,…, xn}, an element xi belongs to M if and only if xi ≤ xj, i<j, xj in S

increasing sequence25

Increasing Sequence

X1 X2 X3… Xi-1 Xi Xi+1 … Xn-2 Xn-1 Xn

M is an increasing sequence

increasing sequence computation

Compute

Increasing SequenceComputation

X1 X2 … Xi-1 Xi Xi+1 … Xn-1 Xn

slide27

Compute

Xn≤ ∞

Increasing SequenceComputation

Xn

slide28

Compute

Yes. Prepend xi to MMin = xi

xi≤Min?

Increasing SequenceComputation

XiXi+1 … Xn-1 Xn

slide29

Compute

Increasing SequenceComputation

X1 X2 … Xi-1 Xi Xi+1 … Xn-1 Xn

x1≤{x2,…,xn}?

computing sorted elements
Computing Sorted Elements

Theorem 1:Given the increasing sequence M, rank of an element xi in M is determined if xi < min (I-M)

computing sorted elements31
Computing Sorted Elements

X1 X2 X3… Xi-1 Xi Xi+1 … Xn-2 Xn-1 Xn

computing sorted elements32

Computing Sorted Elements

X1X3… XiXi+1 … Xn-2 Xn-1

X2 … Xi-1 … Xn

computing sorted elements33
Computing Sorted Elements
  • Linear-time algorithm
    • Maintaining minimum
computing sorted elements34

Compute

Computing Sorted Elements

X1X3… XiXi+1 … Xn-2 Xn-1

X2 … Xi-1 … Xn

computing sorted elements35

Compute

No. Update min

Xi in M?

Yes. Append Xito sorted list

Xi≤ min?

Computing Sorted Elements

X1 X2 … Xi-1Xi

computing sorted elements36

Compute

Computing Sorted Elements

X1 X2 … Xi-1 Xi Xi+1 … Xn-1Xn

algorithm analysis
Algorithm Analysis

Knuth’s measure of disorder:

Given a sequence I and its longest increasing sequence LIS(I), the sequence of disordered elements Y = I - LIS(I)

Theorem 2: Given a sequence I and LIS(I), our adaptive algorithm sorts in at most (2 ||Y|| + 1) iterations

pictorial proof
Pictorial Proof

X1 X2 …Xl Xl+1 Xl+2 …Xm Xm+1 Xm+2 ...Xq Xq+1 Xq+2 ...Xn

pictorial proof39

2 iterations

2 iterations

2 iterations

Pictorial Proof

X1 X2 …XlXl+1 Xl+2 …XmXm+1 Xm+2 ...XqXq+1 Xq+2 ...Xn

example

Example

8

1 2 3 4 5 6 7 9

example41

Sorted

Example

8

9

1 2 3 4 5 6 7

advantages
Advantages
  • Linear in the input size and sorted extent
    • Works well on almost sorted input
  • Maps well to GPUs
    • Uses depth test functionality for minimum operations
    • Useful for performing 3D visibility ordering
    • Perform transparency computations on dynamic 3D environments
  • Cons:
    • Expected time: O(n2 – 2 n√n) on random sequences
video transparent powerplant
Video: Transparent PowerPlant
  • 790K polygons
  • Depth complexity ~ 13
  • 1600x1200 resolution
  • NVIDIA GeForce 6800
  • 5-8 fps
general sorting on gpus47
General Sorting on GPUs
  • Design sorting algorithms with deterministic memory accesses – “Texturing” on GPUs
    • 56 GB/s peak memory bandwidth
    • Can better hide the memory latency!!
  • Require minimum and maximum computations – “Blending functionality” on GPUs
    • Low branching overhead
  • No data dependencies
    • Utilize high parallelism on GPUs
gpu based sorting networks
GPU-Based Sorting Networks
  • Represent data as 2D arrays
  • Multi-stage algorithm
    • Each stage involves multiple steps
  • In each step
    • Compare one array element against exactly one other element at fixed distance
    • Perform a conditional assignment (MIN or MAX) at each element location
2d memory addressing
2D Memory Addressing
  • GPUs optimized for 2D representations
    • Map 1D arrays to 2D arrays
    • Minimum and maximum regions mapped to row-aligned or column-aligned quads
1d 2d mapping52

MIN

1D – 2D Mapping

Effectively reduce instructions per element

sorting on gpu pipelining and parallelism
Sorting on GPU: Pipelining and Parallelism

Input Vertices

Texturing, Caching and 2D Quad

Comparisons

Sequential Writes

comparison with gpu based algorithms
Comparison with GPU-Based Algorithms

3-6x faster than prior GPU-based algorithms!

gpu vs high end multi core cpus
GPU vs. High-End Multi-Core CPUs

2-2.5x faster thanIntel high-end processors

Single GPU performance comparable tohigh-end dual core Athlon

Hand-optimized CPU code from Intel Corporation!

gpu vs high end multi core cpus56
GPU vs. High-End Multi-Core CPUs

2-2.5x faster thanIntel high-end processors

Single GPU performance comparable tohigh-end dual core Athlon

Slash Dot and Toms Hardware Guide Headlines, June 2005

gpu cache model

N. Govindaraju,S. Larsen,J. Gray, and D. Manocha, Proc. Of ACM SuperComputing, 2006

GPU Cache Model
  • Small data caches
    • Better hide the memory latency
    • Vendors do not disclose cache information – critical for scientific computing on GPUs
  • We design simple model
    • Determine cache parameters (block and cache sizes)
    • Improve sorting, FFT and SGEMM performance
cache evictions
Cache Evictions

Cache

Cache Eviction

Cache Eviction

Cache Eviction

cache issues
Cache issues

h

Cache misses per step = 2 W H/ (h B)

Cache

Cache Eviction

Cache Eviction

Cache Eviction

analysis60
Analysis
  • lg n possible steps in bitonic sorting network
  • Step k is performed (lg n – k+1) times and h = 2k-1
  • Data fetched from memory = 2 n f(B) where f(B)=(B-1) (lg n -1) + 0.5 (lg n –lg B)2
super moore s law growth
Super-Moore’s Law Growth

50 GB/s on a single GPU

Peak Performance: Effectively hide memory latency with 15 GOP/s

external memory sorting

N. Govindaraju,J. Gray, R. Kumar and D. Manocha, Proc. of ACM SIGMOD 2006 (to appear)

External Memory Sorting
  • Performed on Terabyte-scale databases
  • Two phases algorithm [Vitter01, Salzberg90, Nyberg94, Nyberg95]
    • Limited main memory
    • First phase – partitions input file into large data chunks and writes sorted chunks known as “Runs”
    • Second phase – Merge the “Runs” to generate the sorted file
external memory sorting67
External Memory Sorting
  • Performance mainly governed by I/O

Salzberg Analysis: Given the main memory size M and the file size N, if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN)

salzberg analysis
Salzberg Analysis
  • If N=100GB, T=2MB, then R ≈ 230MB
  • Large data sorting is inefficient on CPUs
    • R » CPU cache sizes – memory latency
external memory sorting69
External memory sorting
  • External memory sorting on CPUs can have low performance due to
    • High memory latency
    • Or low I/O performance
  • Our algorithm
    • Sorts large data arrays on GPUs
    • Perform I/O operations in parallel on CPUs
i o performance
I/O Performance

Salzberg Analysis: 100 MB Run Size

i o performance72
I/O Performance

Salzberg Analysis: 100 MB Run Size

Pentium IV: 25MB Run Size

Less work and only 75% IO efficient!

i o performance73
I/O Performance

Salzberg Analysis: 100 MB Run Size

Dual 3.6 GHz Xeons: 25MB Run size

More cores, less work but only 85% IO efficient!

i o performance74
I/O Performance

Salzberg Analysis: 100 MB Run Size

7800 GT: 100MB run size

Ideal work, and 92% IO efficient with single CPU!

task parallelism
Task Parallelism

Performance limited by IO and memory

overall performance
Overall Performance

Faster and more scalable than Dual Xeon processors (3.6 GHz)!

performance
Performance/$

1.8x faster than current Terabyte sorter

World’s best performance/$ system

advantages78
Advantages
  • Exploit high memory bandwidth on GPUs
    • Higher memory performance than CPU-based algorithms
  • High I/O performance due to large run sizes
advantages79
Advantages
  • Offload work from CPUs
    • CPU cycles well-utilized for resource management
  • Scalable solution for large databases
  • Best performance/price solution for terabyte sorting
searching for quantiles

N. Govindaraju,B. Lloyd, W. Wang, M. Lin and D. Manocha, Proc. Of ACM SIGMOD, 2004

Searching for Quantiles
  • Given a set of values in the GPU, compute the Kth-largest number
  • Traditional CPU algorithms require arbitrary data writes - require new algorithm without
    • Data rearrangement
    • Data readback to CPU
  • Our solution – search for the Kth-largest number
k th largest number
K-th Largest Number
  • Let vk denote the k-th largest number
  • How do we generate a number m equal to vk?
    • Without knowing vk’s value
    • Count the number of values ≥ some given value
    • Starting from the most significant bit, determine the value of each bit at a time
k th largest number82
K-th Largest Number
  • Given a set S of values
    • c(m) —number of values ≥ m
    • vk — the k-th largest number
  • We have
    • If c(m) ≥ k, then m ≤ vk
    • If c(m) < k, then m > vk
  • c(m) computed using occlusion queries
2 nd largest in 9 values
2nd Largest in 9 Values

0011

1011

1101

m = 0000

v2 = 1011

0111

0101

0001

0111

1010

0010

draw a quad at depth 8 compute c 1000
Draw a Quad at Depth 8 Compute c(1000)

0011

1011

1101

m = 1000

v2 = 1011

0111

0101

0001

0111

1010

0010

1 st bit 1
1st bit = 1

0011

1011

1101

m = 1000

v2 = 1011

0111

0101

0001

c(m) = 3

0111

1010

0010

draw a quad at depth 12 compute c 1100
Draw a Quad at Depth 12 Compute c(1100)

0011

1011

1101

m = 1100

v2 = 1011

0111

0101

0001

0111

1010

0010

2 nd bit 0
2nd bit = 0

0011

1011

1101

m = 1100

v2 = 1011

0111

0101

0001

c(m) = 1

0111

1010

0010

draw a quad at depth 10 compute c 1010
Draw a Quad at Depth 10 Compute c(1010)

0011

1011

1101

m = 1010

v2 = 1011

0111

0101

0001

0111

1010

0010

3 rd bit 1
3rd bit = 1

0011

1011

1101

m = 1010

v2 = 1011

0111

0101

0001

c(m) = 3

0111

1010

0010

draw a quad at depth 11 compute c 1011
Draw a Quad at Depth 11 Compute c(1011)

0011

1011

1101

m = 1011

v2 = 1011

0111

0101

0001

0111

1010

0010

4 th bit 1
4th bit = 1

0011

1011

1101

m = 1011

v2 = 1011

0111

0101

0001

c(m) = 2

0111

1010

0010

our algorithm
Our algorithm
  • Initialize m to 0
  • Start with the MSB and scan all bits till LSB
  • At each bit, put 1 in the corresponding bit-position of m
  • If c(m) < k, make that bit 0
  • Proceed to the next bit
median
Median

3x performance improvement per year!

application collision detection
Application: Collision Detection
  • Compute the overlapping geometric primitives
  • Fundamental to
    • Robotics and motion planning, computer graphics, computational geometry, computer animations, virtual reality, etc.
  • Overlap computation is a sorting problem!
application collision detection96
Application: Collision Detection
  • Exploit interactivity in graphics applications
    • Objects do not move significantly – use coherence
    • Adaptive visibility ordering for dynamic data sets!
  • Define precedence based on visibility
    • An object O precedes a set of objects S if it is completely in front of S
visibility for collisions geometric interpretation

View

O

Visibility for Collisions: Geometric Interpretation

Sufficient but not a necessary condition for existence of separating surface with unit depth complexity

More culling efficiency than bounding volumes

O1

O2

interactive cloth simulation

Govindaraju et al., Proc. Of ACM SIGGRAPH, 2005

Interactive Cloth Simulation
  • 450 ms per frame
  • Over 100x performance improvement over prior algorithms
numerical algorithms on gpus
Numerical Algorithms on GPUs
  • Fundamental to scientific, data mining, multimedia and high performance computing applications
  • Highly compute- and memory-intensive
    • LU, QR and singular value decomposition O(n3)
    • SORT O(n log2n)
    • FFT O(n logn)
    • SGEMM O(n3)
lu decompostion

N. Galoppo, N. Govindaraju,M. Henson and D. Manocha, Proc. of ACM SuperComputing 2005

LU-Decompostion

Faster than Atlas!

(Aug 2004)

(Jun 2005)

numerical algorithms on gpus102
Numerical Algorithms on GPUs

Slash Dot News Headlines, May 2006

numerical libraries
Numerical Libraries
  • GPUSORT – http://gamma.cs.unc.edu/GPUSORT(Over 1200 downloads since 07/2005)
  • GPUFFTW – http://gamma.cs.unc.edu/GPUFFTW(Over 900 downloads since 06/2006)
  • LUGPU – http://gamma.cs.unc.edu/LUGPULIB(Over 300 downloads since 11/2005)
conclusions
Conclusions
  • Designed new sorting algorithms on GPUs
    • Adaptive, general and external memory sorting algorithms
    • Applied to interactive graphics applications
  • Presented novel GPU memory models
    • Memory efficient sorting algorithm with peak memory performance of (50 GB/s) on GPUs
    • Applicable to all scientific computing algorithms on GPUs and GPU-like architectures
conclusions105
Conclusions
  • Novel external memory sorting algorithm as a scalable solution
    • Achieves peak I/O performance on CPUs
    • Best performance/price solution – world’s fastest sorting system
  • High performance growth rate characteristics
    • Improve 2-3 times/yr
future work power and heat
Future Work: Power and Heat
  • Designed high performance/price solutions
    • High wattage of CPUs - major scalability bottleneck!
    • Effective cooling solutions for CPUs
  • Design high performance/watt solutions using GPUs
    • Two orders of magnitude more power efficient!
    • Improve the reliability, fault tolerance and stability
  • Adaptive power management algorithms on GPUs
future work scientific libraries
Future Work: Scientific libraries
  • Scientific libraries utilizing high parallelism and memory bandwidth
    • Scientific routines on LU, QR, SVD, FFT, etc.
    • BLAS library on GPUs
    • Build GPU-LAPACK and Matlab routines
  • Multi-cores are emerging
    • Exploit parallelism in multiple CPUs and GPUs;
    • Quad-GPU solutions provide > 200GB/s
future work
Future Work
  • Design and analyze new scientific and HPC algorithms
    • Exploit processor parallelism
    • Lower memory latency
acknowledgements
Acknowledgements

Research Sponsors:

  • Army Research Office
  • Defense and Advanced Research Projects Agency
  • National Science Foundation
  • Naval Research Laboratory
  • Intel Corporation
  • RDECOM
acknowledgements110
Acknowledgements
  • NVIDIA, ATI and Microsoft Corporation
    • Architecture
      • David Kirk, Michael Doggett, Steven Molnar
    • Drivers
      • Evan Hart, Paul Keller, Nick Triantos
    • Developer relations
      • Mark Harris, Mark Segal
    • Microsoft DirectX
      • David Blythe, Craig Peeper, Peter-Pike Sloan
acknowledgements111
Acknowledgements

Collaborators:

  • Anastassia Ailamaki (CMU)
  • Ian Buck (NVIDIA)
  • Jim Gray (MSR)
  • Ming Lin (UNC Chapel Hill)
  • David Luebke (NVIDIA)
  • Dinesh Manocha (UNC)
  • John Owens (UC Davis)
  • Timothy Purcell (NVIDIA)
  • Stephane Redon (INRIA)
  • Rasmus Tamstorf (Walt Disney Animation Feature)
acknowledgements112
Acknowledgements
  • Supporters:
    • Fred Brooks (UNC Chapel Hill)
    • Pat Hanrahan (Stanford University)
    • Michael Macedonia (ARDA)
    • Jan Prins (UNC Chapel Hill)
    • David Tarditi (Microsoft Research)
    • Jingren Zhou (Microsoft Research)
    • Faculty members of UNC Chapel Hill
acknowledgements113
Acknowledgements
  • Ritesh Kumar
  • Brandon Lloyd
  • Nikunj Raghuvanshi
  • Avneesh Sud
  • Talha Zaman
  • Students:
    • Nico Galoppo
    • Russel Gayle
    • Michael Henson
    • Nitin Jain
    • Ilknur Kabul
  • UNC Systems, GAMMA and Walkthrough groups
thank you
Thank You
  • Questions or Comments?

[email protected]

http://www.cs.unc.edu/~naga

page rank update
Page rank update
  • http://www.webmasterworld.com/forum3/2657.htmhttp://www.dnlodge.com/dnl-resource-collection/1291-google-page-rank-update-history.html
ad