Loading in 5 sec....

High Performance Sorting and Searching using Graphics ProcessorsPowerPoint Presentation

High Performance Sorting and Searching using Graphics Processors

Download Presentation

High Performance Sorting and Searching using Graphics Processors

Loading in 2 Seconds...

- 301 Views
- Updated On :
- Presentation posted in: Internet / Web

High Performance Sorting and Searching using Graphics Processors. Naga K. Govindaraju Microsoft Concurrency. Sorting and Searching. “I believe that virtually every important aspect of programming arises somewhere in the context of sorting or searching !” -Don Knuth. Sorting and Searching.

High Performance Sorting and Searching using Graphics Processors

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

High Performance Sorting and Searching using Graphics Processors

Naga K. GovindarajuMicrosoft Concurrency

“I believe that virtually every important aspect of programming arises somewhere in the context of sorting or searching!”

-Don Knuth

- Well studied
- High performance computing
- Databases
- Computer graphics
- Programming languages
- ...

- Google map reduce algorithm
- Spec benchmark routine!

- Terabyte-data sets are common
- Google sorts more than 100 billion terms in its index
- > 1 Trillion records in web indexed!

- Database sizes are rapidly increasing!
- Max DB sizes increases 3x per year (http://www.wintercorp.com)
- Processor improvements not matching information explosion

CPU(3 GHz)

AGP Memory(512 MB)

GPU (690 MHz)

Video Memory(512 MB)

2 x 1 MB Cache

System Memory(2 GB)

PCI-E Bus(4 GB/s)

GPU (690 MHz)

Video Memory(512 MB)

- Require random memory accesses
- Small CPU caches (< 2MB)
- Random memory accesses slower than even sequential disk accesses

- High memory latency
- Huge memory to compute gap!

- CPUs are deeply pipelined
- Pentium 4 has 30 pipeline stages
- Do not hide latency - high cycles per instruction (CPI)

- CPU is under-utilized for data intensive applications

- Sorting is hard!
- GPU a potentially scalable solution to terabyte sorting and scientific computing
- We beat the sorting benchmark with GPUs and provide a scaleable solution

- Commodity processor for graphics applications
- Massively parallel vector processors
- High memory bandwidth
- Low memory latency pipeline
- Programmable

- High growth rate
- Power-efficient

Laptops

Consoles

Cell phones

PSP

Desktops

- Commodity processor for graphics applications
- Massively parallel vector processors
- 10x more operations per sec than CPUs

- High memory bandwidth
- Better hides memory latency pipeline
- Programmable

- High growth rate
- Power-efficient

Graphics FLOPS

GPU – 1.3 TFLOPS

CPU – 25.6 GFLOPS

- Commodity processor for graphics applications
- Massively parallel vector processors
- High memory bandwidth
- Better hides latency pipeline
- Programmable
- 10x more memory bandwidth than CPUs

- High growth rate
- Power-efficient

Low pipeline depth

56 GB/s

programmable vertex

processing (fp32)

vertex

polygon setup,

culling, rasterization

setup

polygon

rasterizer

Hides memory latency!!

programmable per-

pixel math (fp32)

pixel

per-pixel texture,

fp16 blending

texture

Z-buf, fp16 blending,

anti-alias (MRT)

memory

image

programmable MIMD

processing (fp32)

data

Courtesy: David Kirk,Chief Scientist, NVIDIA

SIMD

“rasterization”

setup

lists

rasterizer

programmable SIMD

processing (fp32)

data

data fetch,

fp16 blending

data

predicated write, fp16

blend, multiple output

memory

data

- Commodity processor for graphics applications
- Massively parallel vector processors
- High memory bandwidth
- Better hides latency pipeline
- Programmable

- High growth rate
- Power-efficient

GPU Growth Rate

CPU Growth Rate

Exploiting Technology Moving

Faster than Moore’s Law

- Commodity processor for graphics applications
- Massively parallel vector processors
- High memory bandwidth
- Better hides latency pipeline
- Programmable

- High growth rate
- Power-efficient

- No support for arbitrary writes
- Optimized CPU algorithms do not map!

- Lack of support for general data types
- Cache-efficient algorithms
- Small data caches
- No cache information from vendors

- Out-of-core algorithms
- Limited GPU memory

- Overview
- Sorting and Searching on GPUs
- Applications
- Conclusions and Future Work

- Adaptive sorting algorithms
- Extent of sorted order in a sequence

- General sorting algorithms
- External memory sorting algorithms

- Prior adaptive sorting algorithms require random data writes
- GPUs optimized for minimum depth or visible surface computation
- Using depth test functionality

- Design adaptive sorting using only minimum computations

N. Govindaraju, M. Henson, M. Lin and D. Manocha, Proc. Of ACM I3D, 2005

- Multiple iterations
- Each iteration uses a two pass algorithm
- First pass – Compute an increasing sequence M
- Second pass - Compute the sorted elements in M
- Iterate on the remaining unsorted elements

Given a sequence S={x1,…, xn}, an element xi belongs to M if and only if xi ≤ xj, i<j, xj in S

≤

X1 X2 X3… Xi-1 Xi Xi+1 … Xn-2 Xn-1 Xn

M is an increasing sequence

Compute

X1 X2 … Xi-1 Xi Xi+1 … Xn-1 Xn

Compute

Xn≤ ∞

Increasing SequenceComputation

Xn

Compute

Yes. Prepend xi to MMin = xi

xi≤Min?

Increasing SequenceComputation

XiXi+1 … Xn-1 Xn

Compute

Increasing SequenceComputation

X1 X2 … Xi-1 Xi Xi+1 … Xn-1 Xn

x1≤{x2,…,xn}?

Theorem 1:Given the increasing sequence M, rank of an element xi in M is determined if xi < min (I-M)

X1 X2 X3… Xi-1 Xi Xi+1 … Xn-2 Xn-1 Xn

≥

≤

≤

X1X3… XiXi+1 … Xn-2 Xn-1

X2 … Xi-1 … Xn

- Linear-time algorithm
- Maintaining minimum

Compute

X1X3… XiXi+1 … Xn-2 Xn-1

X2 … Xi-1 … Xn

Compute

No. Update min

Xi in M?

Yes. Append Xito sorted list

Xi≤ min?

X1 X2 … Xi-1Xi

Compute

X1 X2 … Xi-1 Xi Xi+1 … Xn-1Xn

Knuth’s measure of disorder:

Given a sequence I and its longest increasing sequence LIS(I), the sequence of disordered elements Y = I - LIS(I)

Theorem 2: Given a sequence I and LIS(I), our adaptive algorithm sorts in at most (2 ||Y|| + 1) iterations

X1 X2 …Xl Xl+1 Xl+2 …Xm Xm+1 Xm+2 ...Xq Xq+1 Xq+2 ...Xn

2 iterations

2 iterations

2 iterations

X1 X2 …XlXl+1 Xl+2 …XmXm+1 Xm+2 ...XqXq+1 Xq+2 ...Xn

≤

8

12345679

Sorted

8

9

1234567

- Linear in the input size and sorted extent
- Works well on almost sorted input

- Maps well to GPUs
- Uses depth test functionality for minimum operations
- Useful for performing 3D visibility ordering
- Perform transparency computations on dynamic 3D environments

- Cons:
- Expected time: O(n2 – 2 n√n) on random sequences

- 790K polygons
- Depth complexity ~ 13
- 1600x1200 resolution
- NVIDIA GeForce 6800
- 5-8 fps

N. Govindaraju,N. Raghuvanshi and D. Manocha, Proc. Of ACM SIGMOD, 2005

- General datasets
- High performance

- Design sorting algorithms with deterministic memory accesses – “Texturing” on GPUs
- 56 GB/s peak memory bandwidth
- Can better hide the memory latency!!

- Require minimum and maximum computations – “Blending functionality” on GPUs
- Low branching overhead

- No data dependencies
- Utilize high parallelism on GPUs

- Represent data as 2D arrays
- Multi-stage algorithm
- Each stage involves multiple steps

- In each step
- Compare one array element against exactly one other element at fixed distance
- Perform a conditional assignment (MIN or MAX) at each element location

Sorting Flash Animation here

- GPUs optimized for 2D representations
- Map 1D arrays to 2D arrays
- Minimum and maximum regions mapped to row-aligned or column-aligned quads

MIN

MAX

MIN

Effectively reduce instructions per element

Input Vertices

Texturing, Caching and 2D Quad

Comparisons

Sequential Writes

3-6x faster than prior GPU-based algorithms!

2-2.5x faster thanIntel high-end processors

Single GPU performance comparable tohigh-end dual core Athlon

Hand-optimized CPU code from Intel Corporation!

2-2.5x faster thanIntel high-end processors

Single GPU performance comparable tohigh-end dual core Athlon

Slash Dot and Toms Hardware Guide Headlines, June 2005

N. Govindaraju,S. Larsen,J. Gray, and D. Manocha, Proc. Of ACM SuperComputing, 2006

- Small data caches
- Better hide the memory latency
- Vendors do not disclose cache information – critical for scientific computing on GPUs

- We design simple model
- Determine cache parameters (block and cache sizes)
- Improve sorting, FFT and SGEMM performance

Cache

Cache Eviction

Cache Eviction

Cache Eviction

h

Cache misses per step = 2 W H/ (h B)

Cache

Cache Eviction

Cache Eviction

Cache Eviction

- lg n possible steps in bitonic sorting network
- Step k is performed (lg n – k+1) times and h = 2k-1
- Data fetched from memory = 2 n f(B) where f(B)=(B-1) (lg n -1) + 0.5 (lg n –lg B)2

h

Cache

50 GB/s on a single GPU

Peak Performance: Effectively hide memory latency with 15 GOP/s

N. Govindaraju,J. Gray, R. Kumar and D. Manocha, Proc. of ACM SIGMOD 2006 (to appear)

- Performed on Terabyte-scale databases
- Two phases algorithm [Vitter01, Salzberg90, Nyberg94, Nyberg95]
- Limited main memory
- First phase – partitions input file into large data chunks and writes sorted chunks known as “Runs”
- Second phase – Merge the “Runs” to generate the sorted file

- Performance mainly governed by I/O
Salzberg Analysis: Given the main memory size M and the file size N, if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN)

- If N=100GB, T=2MB, then R ≈ 230MB
- Large data sorting is inefficient on CPUs
- R » CPU cache sizes – memory latency

- External memory sorting on CPUs can have low performance due to
- High memory latency
- Or low I/O performance

- Our algorithm
- Sorts large data arrays on GPUs
- Perform I/O operations in parallel on CPUs

Salzberg Analysis: 100 MB Run Size

Salzberg Analysis: 100 MB Run Size

Pentium IV: 25MB Run Size

Less work and only 75% IO efficient!

Salzberg Analysis: 100 MB Run Size

Dual 3.6 GHz Xeons: 25MB Run size

More cores, less work but only 85% IO efficient!

Salzberg Analysis: 100 MB Run Size

7800 GT: 100MB run size

Ideal work, and 92% IO efficient with single CPU!

Performance limited by IO and memory

Faster and more scalable than Dual Xeon processors (3.6 GHz)!

1.8x faster than current Terabyte sorter

World’s best performance/$ system

- Exploit high memory bandwidth on GPUs
- Higher memory performance than CPU-based algorithms

- High I/O performance due to large run sizes

- Offload work from CPUs
- CPU cycles well-utilized for resource management

- Scalable solution for large databases
- Best performance/price solution for terabyte sorting

N. Govindaraju,B. Lloyd, W. Wang, M. Lin and D. Manocha, Proc. Of ACM SIGMOD, 2004

- Given a set of values in the GPU, compute the Kth-largest number
- Traditional CPU algorithms require arbitrary data writes - require new algorithm without
- Data rearrangement
- Data readback to CPU

- Our solution – search for the Kth-largest number

- Let vk denote the k-th largest number
- How do we generate a number m equal to vk?
- Without knowing vk’s value
- Count the number of values ≥ some given value
- Starting from the most significant bit, determine the value of each bit at a time

- Given a set S of values
- c(m) —number of values ≥ m
- vk — the k-th largest number

- We have
- If c(m) ≥ k, then m ≤ vk
- If c(m) < k, then m > vk

- c(m) computed using occlusion queries

0011

1011

1101

m = 0000

v2 = 1011

0111

0101

0001

0111

1010

0010

0011

1011

1101

m = 1000

v2 = 1011

0111

0101

0001

0111

1010

0010

0011

1011

1101

m = 1000

v2 = 1011

0111

0101

0001

c(m) = 3

0111

1010

0010

0011

1011

1101

m = 1100

v2 = 1011

0111

0101

0001

0111

1010

0010

0011

1011

1101

m = 1100

v2 = 1011

0111

0101

0001

c(m) = 1

0111

1010

0010

0011

1011

1101

m = 1010

v2 = 1011

0111

0101

0001

0111

1010

0010

0011

1011

1101

m = 1010

v2 = 1011

0111

0101

0001

c(m) = 3

0111

1010

0010

0011

1011

1101

m = 1011

v2 = 1011

0111

0101

0001

0111

1010

0010

0011

1011

1101

m = 1011

v2 = 1011

0111

0101

0001

c(m) = 2

0111

1010

0010

- Initialize m to 0
- Start with the MSB and scan all bits till LSB
- At each bit, put 1 in the corresponding bit-position of m
- If c(m) < k, make that bit 0
- Proceed to the next bit

NV35

3x performance improvement per year!

- Compute the overlapping geometric primitives
- Fundamental to
- Robotics and motion planning, computer graphics, computational geometry, computer animations, virtual reality, etc.

- Overlap computation is a sorting problem!

- Exploit interactivity in graphics applications
- Objects do not move significantly – use coherence
- Adaptive visibility ordering for dynamic data sets!

- Define precedence based on visibility
- An object O precedes a set of objects S if it is completely in front of S

View

O

O1

O2

View

O

Sufficient but not a necessary condition for existence of separating surface with unit depth complexity

More culling efficiency than bounding volumes

O1

O2

Govindaraju et al., Proc. Of ACM SIGGRAPH, 2005

- 450 ms per frame
- Over 100x performance improvement over prior algorithms

- Fundamental to scientific, data mining, multimedia and high performance computing applications
- Highly compute- and memory-intensive
- LU, QR and singular value decomposition O(n3)
- SORT O(n log2n)
- FFT O(n logn)
- SGEMM O(n3)

N. Galoppo, N. Govindaraju,M. Henson and D. Manocha, Proc. of ACM SuperComputing 2005

Faster than Atlas!

(Aug 2004)

(Jun 2005)

Slash Dot News Headlines, May 2006

- GPUSORT – http://gamma.cs.unc.edu/GPUSORT(Over 1200 downloads since 07/2005)
- GPUFFTW – http://gamma.cs.unc.edu/GPUFFTW(Over 900 downloads since 06/2006)
- LUGPU – http://gamma.cs.unc.edu/LUGPULIB(Over 300 downloads since 11/2005)

- Designed new sorting algorithms on GPUs
- Adaptive, general and external memory sorting algorithms
- Applied to interactive graphics applications

- Presented novel GPU memory models
- Memory efficient sorting algorithm with peak memory performance of (50 GB/s) on GPUs
- Applicable to all scientific computing algorithms on GPUs and GPU-like architectures

- Novel external memory sorting algorithm as a scalable solution
- Achieves peak I/O performance on CPUs
- Best performance/price solution – world’s fastest sorting system

- High performance growth rate characteristics
- Improve 2-3 times/yr

- Designed high performance/price solutions
- High wattage of CPUs - major scalability bottleneck!
- Effective cooling solutions for CPUs

- Design high performance/watt solutions using GPUs
- Two orders of magnitude more power efficient!
- Improve the reliability, fault tolerance and stability

- Adaptive power management algorithms on GPUs

- Scientific libraries utilizing high parallelism and memory bandwidth
- Scientific routines on LU, QR, SVD, FFT, etc.
- BLAS library on GPUs
- Build GPU-LAPACK and Matlab routines

- Multi-cores are emerging
- Exploit parallelism in multiple CPUs and GPUs;
- Quad-GPU solutions provide > 200GB/s

- Design and analyze new scientific and HPC algorithms
- Exploit processor parallelism
- Lower memory latency

Research Sponsors:

- Army Research Office
- Defense and Advanced Research Projects Agency
- National Science Foundation
- Naval Research Laboratory
- Intel Corporation
- RDECOM

- NVIDIA, ATI and Microsoft Corporation
- Architecture
- David Kirk, Michael Doggett, Steven Molnar

- Drivers
- Evan Hart, Paul Keller, Nick Triantos

- Developer relations
- Mark Harris, Mark Segal

- Microsoft DirectX
- David Blythe, Craig Peeper, Peter-Pike Sloan

- Architecture

Collaborators:

- Anastassia Ailamaki (CMU)
- Ian Buck (NVIDIA)
- Jim Gray (MSR)
- Ming Lin (UNC Chapel Hill)
- David Luebke (NVIDIA)
- Dinesh Manocha (UNC)
- John Owens (UC Davis)
- Timothy Purcell (NVIDIA)
- Stephane Redon (INRIA)
- Rasmus Tamstorf (Walt Disney Animation Feature)

- Supporters:
- Fred Brooks (UNC Chapel Hill)
- Pat Hanrahan (Stanford University)
- Michael Macedonia (ARDA)
- Jan Prins (UNC Chapel Hill)
- David Tarditi (Microsoft Research)
- Jingren Zhou (Microsoft Research)
- Faculty members of UNC Chapel Hill

- Ritesh Kumar
- Brandon Lloyd
- Nikunj Raghuvanshi
- Avneesh Sud
- Talha Zaman

- Students:
- Nico Galoppo
- Russel Gayle
- Michael Henson
- Nitin Jain
- Ilknur Kabul

- UNC Systems, GAMMA and Walkthrough groups

- Questions or Comments?

naga@cs.unc.edu

http://www.cs.unc.edu/~naga

- http://www.webmasterworld.com/forum3/2657.htmhttp://www.dnlodge.com/dnl-resource-collection/1291-google-page-rank-update-history.html