- 345 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Database Operations on GPU' - Albert_Lan

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Outline

- Database Operations on GPU
- Point List Generation on GPU
- Nearest Neighbor Searching on GPU

Design Issues

- Low bandwidth between GPU and CPU
- Avoid frame buffer readbacks
- No arbitrary writes
- Avoid data rearrangements
- Programmable pipeline has poor branching
- Evaluate branches using fixed function tests

Design Overview

- Use depth test functionality of GPUs for performing comparisons
- Implements all possible comparisons <, <=, >=, >, ==, !=, ALWAYS, NEVER
- Use stencil test for data validation and storing results of comparison operations
- Use occlusion query to count number of elements that satisfy some condition

Basic Operations

Basic SQL query

Select A

From T

Where C

A= attributes or aggregations (SUM, COUNT, MAX etc)

T=relational table

C= Boolean Combination of Predicates (using operators AND, OR, NOT)

Basic Operations

- Predicates – ai op constant or ai op aj
- Op is one of <,>,<=,>=,!=, =, TRUE, FALSE
- Boolean combinations – Conjunctive Normal Form (CNF) expression evaluation
- Aggregations – COUNT, SUM, MAX, MEDIAN, AVG

Predicate Evaluation

- ai op constant (d)
- Copy the attribute values ai into depth buffer
- Define the comparison operation using depth test
- Draw a screen filling quad at depth d

glDepthFunc(…)

glStencilOp(fail,zfail,zpass);

Predicate Evaluation

- Comparing two attributes:
- ai op ajis treated as (ai – aj) op 0
- Semi-linear queries
- Easy to compute with fragment shader

Boolean Combinations

- Expression provided as a CNF
- CNF is of form (A1 AND A2 AND … AND Ak)

where Ai = (Bi1 OR Bi2 OR … OR Bimi )

- CNF does not have NOT operator
- If CNF has a NOT operator, invert comparison operation to eliminate NOT

Eg. NOT (ai < d) => (ai >= d)

- For example, compute ai within [low, high]
- Evaluated as ( ai >= low ) AND ( ai <= high )

Range Query

- Compute ai within [low, high]
- Evaluated as ( ai >= low ) AND ( ai <= high )

Aggregations

- COUNT, MAX, MIN, SUM, AVG
- No data rearrangements

COUNT

- Use occlusion queries to get pixel pass count
- Syntax:
- Begin occlusion query
- Perform database operation
- End occlusion query
- Get count of number of attributes that passed database operation
- Involves no additional overhead!

MAX, MIN, MEDIAN

- We compute Kth-largest number
- Traditional algorithms require data rearrangements
- We perform no data rearrangements, no frame buffer readbacks

K-th Largest Number

- By comparing and counting, determinate every bit in order of MSB to LSB

Example: Parallel Max

- S={10,24,37,99,192,200,200,232}
- Step 1: Draw Quad at 128(10000000)
- S = {10,24,37,99,192,200,200,232}
- Step 2: Draw Quad at 192(11000000)
- S = {10,24,37,192,200,200,232}
- Step 3: Draw Quad at 224(11100000)
- S = {10,24,37,192,200,200,232}
- Step 4: Draw Quad at 240(11110000)
- – No values pass
- Step 5: Draw Quad at 232(11101000)
- S = {10,24,37,192,200,200,232}
- Step 6,7,8: Draw Quads at 236,234,233 – No values pass, Max is 232

Accumulator, Mean

- Accumulator - Use sorting algorithm and add all the values
- Mean – Use accumulator and divide by n
- Interval range arithmetic
- Alternative algorithm
- Use fragment programs – requires very few renderings
- Use mipmaps [Harris et al. 02], fragment programs [Coombe et al. 03]

Accumulator

- Data representation is of form

ak 2k + ak-1 2k-1 + … + a0

Sum = sum(ak) 2k+ sum(ak-1) 2k-1+…+sum(a0)

Current GPUs support no bit-masking operations

The Algorithm

>=0.5 means i-th bit is 1

Implementation

- Algorithm
- CPU – Intel compiler 7.1 with hyper-threading, multi-threading, SIMD optimizations
- GPU – NVIDIA Cg Compiler
- Hardware
- Dell Precision Workstation with Dual 2.8GHz Xeon Processor
- NVIDIA GeForce FX 5900 Ultra GPU
- 2GB RAM

Benchmarks

- TCP/IP database with 1 million records and four attributes
- Census database with 360K records

Analysis: Issues

- Precision
- Copy time
- Integer arithmetic
- Depth compare masking
- Memory management
- No Branching
- No random writes

Analysis: Performance

- Relative Performance Gain
- High Performance – Predicate evaluation, multi-attribute queries, semi-linear queries, count
- Medium Performance – Kth-largest number
- Low Performance - Accumulator

High Performance

- Parallel pixel processing engines
- Pipelining
- Early Z-cull
- Eliminate branch mispredictions

Medium Performance

- Parallelism
- FX 5900 has clock speed 450MHz, 8 pixel processing engines
- Rendering single 1000x1000 quad takes 0.278ms
- Rendering 19 such quads take 5.28ms. Observed time is 6.6ms
- 80% efficiency in parallelism!!

Low Performance

- No gain over SIMD based CPU implementation
- Two main reasons:
- Lack of integer-arithmetic
- Clock rate

Advantages

- Algorithms progress at GPU growth rate
- Offload CPU work
- Fast due to massive parallelism on GPUs
- Algorithms could be generalized to any geometric shape
- Eg. Max value within a triangular region
- Commodity hardware!

GPU Point List Generation

- Data compaction

Timing

Reduces a highly sparse matrix with N

elements to a list of its M active entries

in O(N) + M (log N) steps,

Applications

- Image Analysis
- Feature Detection
- Volume Analysis
- Sparse Matrix Generation

Searching

- 1D Binary Search
- Nearest Neighbor Search for High dimension space
- K-NN Search

Binary Search

- Find a specific element in an ordered list
- Implement just like CPU algorithm
- Assuming hardware supports long enough shaders
- Finds the first element of a given value v
- If v does not exist, find next smallest element > v
- Search algorithm is sequential, but many searches can be executed in parallel
- Number of pixels drawn determines number of searches executed in parallel
- 1 pixel == 1 search

Binary Search

- Search for v0

Search starts at center of sorted array

v2 >= v0 so search left half of sub-array

Initialize

4

Sorted List

v0

v0

v0

v2

v2

v2

v5

v5

0

1

2

3

4

5

6

7

Binary Search

- Search for v0

v0 >= v0 so search left half of sub-array

Initialize

4

Step 1

2

Sorted List

v0

v0

v0

v2

v2

v2

v5

v5

0

1

2

3

4

5

6

7

Binary Search

- Search for v0

v0 >= v0 so search left half of sub-array

Initialize

4

Step 1

2

Step 2

1

Sorted List

v0

v0

v0

v2

v2

v2

v5

v5

0

1

2

3

4

5

6

7

Binary Search

- Search for v0

At this point, we either have found v0 or are 1 element too far left

One last step to resolve

Initialize

4

Step 1

2

Step 2

1

Step 3

0

Sorted List

v0

v0

v0

v2

v2

v2

v5

v5

0

1

2

3

4

5

6

7

Binary Search

- Search for v0

Done!

Initialize

4

Step 1

2

Step 2

1

Step 3

0

Step 4

0

Sorted List

v0

v0

v0

v2

v2

v2

v5

v5

0

1

2

3

4

5

6

7

Binary Search

- Search for v0 and v2

Search starts at center of sorted array

Both searches proceed to the left half of the array

Initialize

4

4

Sorted List

v0

v0

v0

v2

v2

v2

v5

v5

0

1

2

3

4

5

6

7

Binary Search

- Search for v0 and v2

The search for v0 continues as before

The search for v2 overshot, so go back to the right

Initialize

4

4

Step 1

2

2

Sorted List

v0

v0

v0

v2

v2

v2

v5

v5

0

1

2

3

4

5

6

7

Binary Search

- Search for v0 and v2

We’ve found the proper v2, but are still looking for v0

Both searches continue

Initialize

4

4

Step 1

2

2

Step 2

1

3

Sorted List

v0

v0

v0

v2

v2

v2

v5

v5

0

1

2

3

4

5

6

7

Binary Search

- Search for v0 and v2

Now, we’ve found the proper v0, but overshot v2

The cleanup step takes care of this

Initialize

4

4

Step 1

2

2

Step 2

1

3

Step 3

0

2

Sorted List

v0

v0

v0

v2

v2

v2

v5

v5

0

1

2

3

4

5

6

7

Binary Search

- Search for v0 and v2

Done! Both v0 and v2 are located properly

Initialize

4

4

Step 1

2

2

Step 2

1

3

Step 3

0

2

Step 4

0

3

Sorted List

v0

v0

v0

v2

v2

v2

v5

v5

0

1

2

3

4

5

6

7

Binary Search Summary

- Single rendering pass
- Each pixel drawn performs independent search
- O(log n) steps

Nearest Neighbor Search

- Very fundamental step in similarity search of data mining, retrieval…
- Curse of dimensionality,
- When dimensionality is very high, structures like k-d tree does not help
- Use GPU to improve linear scan

Distances

- N-norm distance
- Cosine distance acos(dot(x,y))

Data Representation

- Use separate textures to store different dimensions.

Distance Computation

- Accumulating distance component of different dimensions

K-Nearest Neighbor Search

- Given a sample point p, find the k points nearest p within a data set
- On the CPU, this is easily done with a heap or priority queue
- Can add or reject neighbors as search progresses
- Don’t know how to build one efficiently on GPU
- kNN-grid
- Can only add neighbors…

Candidate neighbors must be within max search radius

Visit voxels in order of distance to sample point

sample point

candidate neighbor

neighbors found

kNN-grid AlgorithmWant 4 neighbors

If current number of neighbors found is less than the number requested, grow search radius

sample point

candidate neighbor

neighbors found

kNN-grid Algorithm1

Want 4 neighbors

If current number of neighbors found is less than the number requested, grow search radius

sample point

candidate neighbor

neighbors found

kNN-grid Algorithm2

Want 4 neighbors

Don’t add neighbors outside maximum search radius

Don’t grow search radius when neighbor is outside maximum radius

sample point

candidate neighbor

neighbors found

kNN-grid Algorithm2

Want 4 neighbors

Add neighbors within search radius

sample point

candidate neighbor

neighbors found

kNN-grid Algorithm3

Want 4 neighbors

Add neighbors within search radius

sample point

candidate neighbor

neighbors found

kNN-grid Algorithm4

Want 4 neighbors

Don’t expand search radius if enough neighbors already found

sample point

candidate neighbor

neighbors found

kNN-grid Algorithm4

Want 4 neighbors

Add neighbors within search radius

sample point

candidate neighbor

neighbors found

kNN-grid Algorithm5

Want 4 neighbors

Visit all other voxels accessible within determined search radius

Add neighbors within search radius

sample point

candidate neighbor

neighbors found

kNN-grid Algorithm6

Want 4 neighbors

Finds all neighbors within a sphere centered about sample point

May locate more than requested k-nearest neighbors

sample point

candidate neighbor

neighbors found

kNN-grid Summary6

Want 4 neighbors

References

- Naga Govindaraju, Brandon Lloyd, Wei Wang, Ming Lin and Dinesh Manocha, Fast Computation of Database Operations using Graphics Processorshttp://www.gpgpu.org/s2004/slides/govindaraju.DatabaseOperations.ppt
- Benjamin Bustos, Oliver Deussen, Stefan Hiller, and Daniel Keim, A Graphic Hardware Accelerated Algorithm for Nearest Neighbor Search
- Gernot Ziegler, Art Tevs, Christian Theobalt, Hans-Peter Seidel, GPU Point List Generation through Histogram Pyramids

http://www.mpi-inf.mpg.de/~gziegler/gpu_pointlist/

- Tim Purcell, Sorting and Searching http://www.gpgpu.org/s2005/slides/purcell.SortingAndSearching.ppt

Download Presentation

Connecting to Server..