High Performance Sorting and Searching using Graphics Processors

High Performance Sorting and Searching using Graphics Processors Naga K. GovindarajuMicrosoft Concurrency

Sorting and Searching “I believe that virtually every important aspect of programming arises somewhere in the context of sorting or searching!” -Don Knuth

Sorting and Searching • Well studied • High performance computing • Databases • Computer graphics • Programming languages • ... • Google map reduce algorithm • Spec benchmark routine!

Massive Databases • Terabyte-data sets are common • Google sorts more than 100 billion terms in its index • > 1 Trillion records in web indexed! • Database sizes are rapidly increasing! • Max DB sizes increases 3x per year (http://www.wintercorp.com) • Processor improvements not matching information explosion

CPU(3 GHz) AGP Memory(512 MB) CPU vs. GPU GPU (690 MHz) Video Memory(512 MB) 2 x 1 MB Cache System Memory(2 GB) PCI-E Bus(4 GB/s) GPU (690 MHz) Video Memory(512 MB)

Massive Data Handling on CPUs • Require random memory accesses • Small CPU caches (< 2MB) • Random memory accesses slower than even sequential disk accesses • High memory latency • Huge memory to compute gap! • CPUs are deeply pipelined • Pentium 4 has 30 pipeline stages • Do not hide latency - high cycles per instruction (CPI) • CPU is under-utilized for data intensive applications

Massive Data Handling on CPUs • Sorting is hard! • GPU a potentially scalable solution to terabyte sorting and scientific computing • We beat the sorting benchmark with GPUs and provide a scaleable solution

Graphics Processing Units (GPUs) • Commodity processor for graphics applications • Massively parallel vector processors • High memory bandwidth • Low memory latency pipeline • Programmable • High growth rate • Power-efficient

GPU: Commodity Processor Laptops Consoles Cell phones PSP Desktops

Graphics Processing Units (GPUs) • Commodity processor for graphics applications • Massively parallel vector processors • 10x more operations per sec than CPUs • High memory bandwidth • Better hides memory latency pipeline • Programmable • High growth rate • Power-efficient

Parallelism on GPUs Graphics FLOPS GPU – 1.3 TFLOPS CPU – 25.6 GFLOPS

Graphics Processing Units (GPUs) • Commodity processor for graphics applications • Massively parallel vector processors • High memory bandwidth • Better hides latency pipeline • Programmable • 10x more memory bandwidth than CPUs • High growth rate • Power-efficient

Low pipeline depth Graphics Pipeline 56 GB/s programmable vertex processing (fp32) vertex polygon setup, culling, rasterization setup polygon rasterizer Hides memory latency!! programmable per- pixel math (fp32) pixel per-pixel texture, fp16 blending texture Z-buf, fp16 blending, anti-alias (MRT) memory image

NON-Graphics Pipeline Abstraction programmable MIMD processing (fp32) data Courtesy: David Kirk,Chief Scientist, NVIDIA SIMD “rasterization” setup lists rasterizer programmable SIMD processing (fp32) data data fetch, fp16 blending data predicated write, fp16 blend, multiple output memory data

Graphics Processing Units (GPUs) • Commodity processor for graphics applications • Massively parallel vector processors • High memory bandwidth • Better hides latency pipeline • Programmable • High growth rate • Power-efficient

GPU Growth Rate CPU Growth Rate Exploiting Technology Moving Faster than Moore’s Law

Graphics Processing Units (GPUs) • Commodity processor for graphics applications • Massively parallel vector processors • High memory bandwidth • Better hides latency pipeline • Programmable • High growth rate • Power-efficient

CPU vs. GPU(Henry Moreton: NVIDIA, Aug. 2005)

GPUs for Sorting and Searching: Issues • No support for arbitrary writes • Optimized CPU algorithms do not map! • Lack of support for general data types • Cache-efficient algorithms • Small data caches • No cache information from vendors • Out-of-core algorithms • Limited GPU memory

Outline • Overview • Sorting and Searching on GPUs • Applications • Conclusions and Future Work

Sorting on GPUs • Adaptive sorting algorithms • Extent of sorted order in a sequence • General sorting algorithms • External memory sorting algorithms

Adaptive Sorting on GPUs • Prior adaptive sorting algorithms require random data writes • GPUs optimized for minimum depth or visible surface computation • Using depth test functionality • Design adaptive sorting using only minimum computations

N. Govindaraju, M. Henson, M. Lin and D. Manocha, Proc. Of ACM I3D, 2005 Adaptive Sorting Algorithm • Multiple iterations • Each iteration uses a two pass algorithm • First pass – Compute an increasing sequence M • Second pass - Compute the sorted elements in M • Iterate on the remaining unsorted elements

Increasing Sequence Given a sequence S={x1,…, xn}, an element xi belongs to M if and only if xi ≤ xj, i<j, xj in S

≤ Increasing Sequence X1 X2 X3… Xi-1 Xi Xi+1 … Xn-2 Xn-1 Xn M is an increasing sequence

Compute Increasing SequenceComputation X1 X2 … Xi-1 Xi Xi+1 … Xn-1 Xn

Compute Xn≤ ∞ Increasing SequenceComputation Xn

Compute Yes. Prepend xi to MMin = xi xi≤Min? Increasing SequenceComputation XiXi+1 … Xn-1 Xn

Compute Increasing SequenceComputation X1 X2 … Xi-1 Xi Xi+1 … Xn-1 Xn x1≤{x2,…,xn}?

Computing Sorted Elements Theorem 1:Given the increasing sequence M, rank of an element xi in M is determined if xi < min (I-M)

Computing Sorted Elements X1 X2 X3… Xi-1 Xi Xi+1 … Xn-2 Xn-1 Xn

≥ ≤ ≤ Computing Sorted Elements X1X3… XiXi+1 … Xn-2 Xn-1 X2 … Xi-1 … Xn

Computing Sorted Elements • Linear-time algorithm • Maintaining minimum

Compute Computing Sorted Elements X1X3… XiXi+1 … Xn-2 Xn-1 X2 … Xi-1 … Xn

Compute No. Update min Xi in M? Yes. Append Xito sorted list Xi≤ min? Computing Sorted Elements X1 X2 … Xi-1Xi

Compute Computing Sorted Elements X1 X2 … Xi-1 Xi Xi+1 … Xn-1Xn

Algorithm Analysis Knuth’s measure of disorder: Given a sequence I and its longest increasing sequence LIS(I), the sequence of disordered elements Y = I - LIS(I) Theorem 2: Given a sequence I and LIS(I), our adaptive algorithm sorts in at most (2 ||Y|| + 1) iterations

Pictorial Proof X1 X2 …Xl Xl+1 Xl+2 …Xm Xm+1 Xm+2 ...Xq Xq+1 Xq+2 ...Xn

2 iterations 2 iterations 2 iterations Pictorial Proof X1 X2 …XlXl+1 Xl+2 …XmXm+1 Xm+2 ...XqXq+1 Xq+2 ...Xn

≤ Example 8 1 2 3 4 5 6 7 9

Sorted Example 8 9 1 2 3 4 5 6 7

Analysis

Advantages • Linear in the input size and sorted extent • Works well on almost sorted input • Maps well to GPUs • Uses depth test functionality for minimum operations • Useful for performing 3D visibility ordering • Perform transparency computations on dynamic 3D environments • Cons: • Expected time: O(n2 – 2 n√n) on random sequences

Video: Transparent PowerPlant • 790K polygons • Depth complexity ~ 13 • 1600x1200 resolution • NVIDIA GeForce 6800 • 5-8 fps

Video: Transparent PowerPlant

N. Govindaraju,N. Raghuvanshi and D. Manocha, Proc. Of ACM SIGMOD, 2005 General Sorting on GPUs • General datasets • High performance

General Sorting on GPUs • Design sorting algorithms with deterministic memory accesses – “Texturing” on GPUs • 56 GB/s peak memory bandwidth • Can better hide the memory latency!! • Require minimum and maximum computations – “Blending functionality” on GPUs • Low branching overhead • No data dependencies • Utilize high parallelism on GPUs

GPU-Based Sorting Networks • Represent data as 2D arrays • Multi-stage algorithm • Each stage involves multiple steps • In each step • Compare one array element against exactly one other element at fixed distance • Perform a conditional assignment (MIN or MAX) at each element location

Sorting Flash Animation here

2D Memory Addressing • GPUs optimized for 2D representations • Map 1D arrays to 2D arrays • Minimum and maximum regions mapped to row-aligned or column-aligned quads

High Performance Sorting and Searching using Graphics Processors

High Performance Sorting and Searching using Graphics Processors

Presentation Transcript

Searching and Sorting

Searching and Sorting

Sorting and Searching

Searching and Sorting

High Performance Processors and Systems

Sorting and Searching

Searching and Sorting

Searching and Sorting

SEARCHING AND SORTING

Gnort : High Performance Network Intrusion Detection Using Graphics Processors

High Performance Processors

High Performance Discrete Fourier Transforms on Graphics Processors

Searching and Sorting

Searching and Sorting

Sorting and Searching

Searching and Sorting

Searching and Sorting

Sorting and Searching

Searching and Sorting

Sorting and Searching