Loading in 2 Seconds...

High Performance Sorting and Searching using Graphics Processors

Loading in 2 Seconds...

- 318 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'high performance sorting and searching ' - ryanadan

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### High Performance Sorting and Searching using Graphics Processors

Naga K. GovindarajuMicrosoft Concurrency

Sorting and Searching

“I believe that virtually every important aspect of programming arises somewhere in the context of sorting or searching!”

-Don Knuth

Sorting and Searching

- Well studied
- High performance computing
- Databases
- Computer graphics
- Programming languages
- ...
- Google map reduce algorithm
- Spec benchmark routine!

Massive Databases

- Terabyte-data sets are common
- Google sorts more than 100 billion terms in its index
- > 1 Trillion records in web indexed!
- Database sizes are rapidly increasing!
- Max DB sizes increases 3x per year (http://www.wintercorp.com)
- Processor improvements not matching information explosion

AGP Memory(512 MB)

CPU vs. GPUGPU (690 MHz)

Video Memory(512 MB)

2 x 1 MB Cache

System Memory(2 GB)

PCI-E Bus(4 GB/s)

GPU (690 MHz)

Video Memory(512 MB)

Massive Data Handling on CPUs

- Require random memory accesses
- Small CPU caches (< 2MB)
- Random memory accesses slower than even sequential disk accesses
- High memory latency
- Huge memory to compute gap!
- CPUs are deeply pipelined
- Pentium 4 has 30 pipeline stages
- Do not hide latency - high cycles per instruction (CPI)
- CPU is under-utilized for data intensive applications

Massive Data Handling on CPUs

- Sorting is hard!
- GPU a potentially scalable solution to terabyte sorting and scientific computing
- We beat the sorting benchmark with GPUs and provide a scaleable solution

Graphics Processing Units (GPUs)

- Commodity processor for graphics applications
- Massively parallel vector processors
- High memory bandwidth
- Low memory latency pipeline
- Programmable
- High growth rate
- Power-efficient

Graphics Processing Units (GPUs)

- Commodity processor for graphics applications
- Massively parallel vector processors
- 10x more operations per sec than CPUs
- High memory bandwidth
- Better hides memory latency pipeline
- Programmable
- High growth rate
- Power-efficient

Graphics Processing Units (GPUs)

- Commodity processor for graphics applications
- Massively parallel vector processors
- High memory bandwidth
- Better hides latency pipeline
- Programmable
- 10x more memory bandwidth than CPUs
- High growth rate
- Power-efficient

Graphics Pipeline

56 GB/s

programmable vertex

processing (fp32)

vertex

polygon setup,

culling, rasterization

setup

polygon

rasterizer

Hides memory latency!!

programmable per-

pixel math (fp32)

pixel

per-pixel texture,

fp16 blending

texture

Z-buf, fp16 blending,

anti-alias (MRT)

memory

image

NON-Graphics Pipeline Abstraction

programmable MIMD

processing (fp32)

data

Courtesy: David Kirk,Chief Scientist, NVIDIA

SIMD

“rasterization”

setup

lists

rasterizer

programmable SIMD

processing (fp32)

data

data fetch,

fp16 blending

data

predicated write, fp16

blend, multiple output

memory

data

Graphics Processing Units (GPUs)

- Commodity processor for graphics applications
- Massively parallel vector processors
- High memory bandwidth
- Better hides latency pipeline
- Programmable
- High growth rate
- Power-efficient

Graphics Processing Units (GPUs)

- Commodity processor for graphics applications
- Massively parallel vector processors
- High memory bandwidth
- Better hides latency pipeline
- Programmable
- High growth rate
- Power-efficient

GPUs for Sorting and Searching: Issues

- No support for arbitrary writes
- Optimized CPU algorithms do not map!
- Lack of support for general data types
- Cache-efficient algorithms
- Small data caches
- No cache information from vendors
- Out-of-core algorithms
- Limited GPU memory

Outline

- Overview
- Sorting and Searching on GPUs
- Applications
- Conclusions and Future Work

Sorting on GPUs

- Adaptive sorting algorithms
- Extent of sorted order in a sequence
- General sorting algorithms
- External memory sorting algorithms

Adaptive Sorting on GPUs

- Prior adaptive sorting algorithms require random data writes
- GPUs optimized for minimum depth or visible surface computation
- Using depth test functionality
- Design adaptive sorting using only minimum computations

N. Govindaraju, M. Henson, M. Lin and D. Manocha, Proc. Of ACM I3D, 2005

Adaptive Sorting Algorithm- Multiple iterations
- Each iteration uses a two pass algorithm
- First pass – Compute an increasing sequence M
- Second pass - Compute the sorted elements in M
- Iterate on the remaining unsorted elements

Increasing Sequence

Given a sequence S={x1,…, xn}, an element xi belongs to M if and only if xi ≤ xj, i<j, xj in S

Computing Sorted Elements

Theorem 1:Given the increasing sequence M, rank of an element xi in M is determined if xi < min (I-M)

Computing Sorted Elements

X1 X2 X3… Xi-1 Xi Xi+1 … Xn-2 Xn-1 Xn

Computing Sorted Elements

- Linear-time algorithm
- Maintaining minimum

Algorithm Analysis

Knuth’s measure of disorder:

Given a sequence I and its longest increasing sequence LIS(I), the sequence of disordered elements Y = I - LIS(I)

Theorem 2: Given a sequence I and LIS(I), our adaptive algorithm sorts in at most (2 ||Y|| + 1) iterations

Pictorial Proof

X1 X2 …Xl Xl+1 Xl+2 …Xm Xm+1 Xm+2 ...Xq Xq+1 Xq+2 ...Xn

Advantages

- Linear in the input size and sorted extent
- Works well on almost sorted input
- Maps well to GPUs
- Uses depth test functionality for minimum operations
- Useful for performing 3D visibility ordering
- Perform transparency computations on dynamic 3D environments
- Cons:
- Expected time: O(n2 – 2 n√n) on random sequences

Video: Transparent PowerPlant

- 790K polygons
- Depth complexity ~ 13
- 1600x1200 resolution
- NVIDIA GeForce 6800
- 5-8 fps

N. Govindaraju,N. Raghuvanshi and D. Manocha, Proc. Of ACM SIGMOD, 2005

General Sorting on GPUs- General datasets
- High performance

General Sorting on GPUs

- Design sorting algorithms with deterministic memory accesses – “Texturing” on GPUs
- 56 GB/s peak memory bandwidth
- Can better hide the memory latency!!
- Require minimum and maximum computations – “Blending functionality” on GPUs
- Low branching overhead
- No data dependencies
- Utilize high parallelism on GPUs

GPU-Based Sorting Networks

- Represent data as 2D arrays
- Multi-stage algorithm
- Each stage involves multiple steps
- In each step
- Compare one array element against exactly one other element at fixed distance
- Perform a conditional assignment (MIN or MAX) at each element location

2D Memory Addressing

- GPUs optimized for 2D representations
- Map 1D arrays to 2D arrays
- Minimum and maximum regions mapped to row-aligned or column-aligned quads

Sorting on GPU: Pipelining and Parallelism

Input Vertices

Texturing, Caching and 2D Quad

Comparisons

Sequential Writes

Comparison with GPU-Based Algorithms

3-6x faster than prior GPU-based algorithms!

GPU vs. High-End Multi-Core CPUs

2-2.5x faster thanIntel high-end processors

Single GPU performance comparable tohigh-end dual core Athlon

Hand-optimized CPU code from Intel Corporation!

GPU vs. High-End Multi-Core CPUs

2-2.5x faster thanIntel high-end processors

Single GPU performance comparable tohigh-end dual core Athlon

Slash Dot and Toms Hardware Guide Headlines, June 2005

N. Govindaraju,S. Larsen,J. Gray, and D. Manocha, Proc. Of ACM SuperComputing, 2006

GPU Cache Model- Small data caches
- Better hide the memory latency
- Vendors do not disclose cache information – critical for scientific computing on GPUs
- We design simple model
- Determine cache parameters (block and cache sizes)
- Improve sorting, FFT and SGEMM performance

Analysis

- lg n possible steps in bitonic sorting network
- Step k is performed (lg n – k+1) times and h = 2k-1
- Data fetched from memory = 2 n f(B) where f(B)=(B-1) (lg n -1) + 0.5 (lg n –lg B)2

Super-Moore’s Law Growth

50 GB/s on a single GPU

Peak Performance: Effectively hide memory latency with 15 GOP/s

N. Govindaraju,J. Gray, R. Kumar and D. Manocha, Proc. of ACM SIGMOD 2006 (to appear)

External Memory Sorting- Performed on Terabyte-scale databases
- Two phases algorithm [Vitter01, Salzberg90, Nyberg94, Nyberg95]
- Limited main memory
- First phase – partitions input file into large data chunks and writes sorted chunks known as “Runs”
- Second phase – Merge the “Runs” to generate the sorted file

External Memory Sorting

- Performance mainly governed by I/O

Salzberg Analysis: Given the main memory size M and the file size N, if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN)

Salzberg Analysis

- If N=100GB, T=2MB, then R ≈ 230MB
- Large data sorting is inefficient on CPUs
- R » CPU cache sizes – memory latency

External memory sorting

- External memory sorting on CPUs can have low performance due to
- High memory latency
- Or low I/O performance
- Our algorithm
- Sorts large data arrays on GPUs
- Perform I/O operations in parallel on CPUs

I/O Performance

Salzberg Analysis: 100 MB Run Size

I/O Performance

Salzberg Analysis: 100 MB Run Size

Pentium IV: 25MB Run Size

Less work and only 75% IO efficient!

I/O Performance

Salzberg Analysis: 100 MB Run Size

Dual 3.6 GHz Xeons: 25MB Run size

More cores, less work but only 85% IO efficient!

I/O Performance

Salzberg Analysis: 100 MB Run Size

7800 GT: 100MB run size

Ideal work, and 92% IO efficient with single CPU!

Task Parallelism

Performance limited by IO and memory

Overall Performance

Faster and more scalable than Dual Xeon processors (3.6 GHz)!

Advantages

- Exploit high memory bandwidth on GPUs
- Higher memory performance than CPU-based algorithms
- High I/O performance due to large run sizes

Advantages

- Offload work from CPUs
- CPU cycles well-utilized for resource management
- Scalable solution for large databases
- Best performance/price solution for terabyte sorting

N. Govindaraju,B. Lloyd, W. Wang, M. Lin and D. Manocha, Proc. Of ACM SIGMOD, 2004

Searching for Quantiles- Given a set of values in the GPU, compute the Kth-largest number
- Traditional CPU algorithms require arbitrary data writes - require new algorithm without
- Data rearrangement
- Data readback to CPU
- Our solution – search for the Kth-largest number

K-th Largest Number

- Let vk denote the k-th largest number
- How do we generate a number m equal to vk?
- Without knowing vk’s value
- Count the number of values ≥ some given value
- Starting from the most significant bit, determine the value of each bit at a time

K-th Largest Number

- Given a set S of values
- c(m) —number of values ≥ m
- vk — the k-th largest number
- We have
- If c(m) ≥ k, then m ≤ vk
- If c(m) < k, then m > vk
- c(m) computed using occlusion queries

Our algorithm

- Initialize m to 0
- Start with the MSB and scan all bits till LSB
- At each bit, put 1 in the corresponding bit-position of m
- If c(m) < k, make that bit 0
- Proceed to the next bit

Kth-Largest

NV35

Median

3x performance improvement per year!

Application: Collision Detection

- Compute the overlapping geometric primitives
- Fundamental to
- Robotics and motion planning, computer graphics, computational geometry, computer animations, virtual reality, etc.
- Overlap computation is a sorting problem!

Application: Collision Detection

- Exploit interactivity in graphics applications
- Objects do not move significantly – use coherence
- Adaptive visibility ordering for dynamic data sets!
- Define precedence based on visibility
- An object O precedes a set of objects S if it is completely in front of S

O

Visibility for Collisions: Geometric InterpretationSufficient but not a necessary condition for existence of separating surface with unit depth complexity

More culling efficiency than bounding volumes

O1

O2

Govindaraju et al., Proc. Of ACM SIGGRAPH, 2005

Interactive Cloth Simulation- 450 ms per frame
- Over 100x performance improvement over prior algorithms

Numerical Algorithms on GPUs

- Fundamental to scientific, data mining, multimedia and high performance computing applications
- Highly compute- and memory-intensive
- LU, QR and singular value decomposition O(n3)
- SORT O(n log2n)
- FFT O(n logn)
- SGEMM O(n3)

N. Galoppo, N. Govindaraju,M. Henson and D. Manocha, Proc. of ACM SuperComputing 2005

LU-DecompostionFaster than Atlas!

(Aug 2004)

(Jun 2005)

Numerical Algorithms on GPUs

Slash Dot News Headlines, May 2006

Numerical Libraries

- GPUSORT – http://gamma.cs.unc.edu/GPUSORT(Over 1200 downloads since 07/2005)
- GPUFFTW – http://gamma.cs.unc.edu/GPUFFTW(Over 900 downloads since 06/2006)
- LUGPU – http://gamma.cs.unc.edu/LUGPULIB(Over 300 downloads since 11/2005)

Conclusions

- Designed new sorting algorithms on GPUs
- Adaptive, general and external memory sorting algorithms
- Applied to interactive graphics applications
- Presented novel GPU memory models
- Memory efficient sorting algorithm with peak memory performance of (50 GB/s) on GPUs
- Applicable to all scientific computing algorithms on GPUs and GPU-like architectures

Conclusions

- Novel external memory sorting algorithm as a scalable solution
- Achieves peak I/O performance on CPUs
- Best performance/price solution – world’s fastest sorting system
- High performance growth rate characteristics
- Improve 2-3 times/yr

Future Work: Power and Heat

- Designed high performance/price solutions
- High wattage of CPUs - major scalability bottleneck!
- Effective cooling solutions for CPUs
- Design high performance/watt solutions using GPUs
- Two orders of magnitude more power efficient!
- Improve the reliability, fault tolerance and stability
- Adaptive power management algorithms on GPUs

Future Work: Scientific libraries

- Scientific libraries utilizing high parallelism and memory bandwidth
- Scientific routines on LU, QR, SVD, FFT, etc.
- BLAS library on GPUs
- Build GPU-LAPACK and Matlab routines
- Multi-cores are emerging
- Exploit parallelism in multiple CPUs and GPUs;
- Quad-GPU solutions provide > 200GB/s

Future Work

- Design and analyze new scientific and HPC algorithms
- Exploit processor parallelism
- Lower memory latency

Acknowledgements

Research Sponsors:

- Army Research Office
- Defense and Advanced Research Projects Agency
- National Science Foundation
- Naval Research Laboratory
- Intel Corporation
- RDECOM

Acknowledgements

- NVIDIA, ATI and Microsoft Corporation
- Architecture
- David Kirk, Michael Doggett, Steven Molnar
- Drivers
- Evan Hart, Paul Keller, Nick Triantos
- Developer relations
- Mark Harris, Mark Segal
- Microsoft DirectX
- David Blythe, Craig Peeper, Peter-Pike Sloan

Acknowledgements

Collaborators:

- Anastassia Ailamaki (CMU)
- Ian Buck (NVIDIA)
- Jim Gray (MSR)
- Ming Lin (UNC Chapel Hill)
- David Luebke (NVIDIA)
- Dinesh Manocha (UNC)
- John Owens (UC Davis)
- Timothy Purcell (NVIDIA)
- Stephane Redon (INRIA)
- Rasmus Tamstorf (Walt Disney Animation Feature)

Acknowledgements

- Supporters:
- Fred Brooks (UNC Chapel Hill)
- Pat Hanrahan (Stanford University)
- Michael Macedonia (ARDA)
- Jan Prins (UNC Chapel Hill)
- David Tarditi (Microsoft Research)
- Jingren Zhou (Microsoft Research)
- Faculty members of UNC Chapel Hill

Acknowledgements

- Ritesh Kumar
- Brandon Lloyd
- Nikunj Raghuvanshi
- Avneesh Sud
- Talha Zaman

- Students:
- Nico Galoppo
- Russel Gayle
- Michael Henson
- Nitin Jain
- Ilknur Kabul
- UNC Systems, GAMMA and Walkthrough groups

Page rank update

- http://www.webmasterworld.com/forum3/2657.htmhttp://www.dnlodge.com/dnl-resource-collection/1291-google-page-rank-update-history.html

Download Presentation

Connecting to Server..