A Hardware Processing Unit For Point Sets

A Hardware Processing Unit For Point Sets. S. Heinzle, G. Guennebaud, M. Botsch, M. Gross Graphics Hardware 2008. Motivation. Point-based graphics established Powerful algorithms Representation Processing Manipulation Rendering Decomposition Get neighborhood Operate on neighbors.

### A Hardware Processing Unit For Point Sets

Outline

### A Hardware Processing Unit For Point Sets

S. Heinzle, G. Guennebaud,M. Botsch, M. Gross

Motivation

- Point-based graphics established
- Powerful algorithms
- Representation
- Processing
- Manipulation
- Rendering

- Decomposition
- Get neighborhood
- Operate on neighbors

Motivation

- GPUs not suited for getting neighborhood
- SIMD
- Incoherent branching
- Dynamic data structures slow
- Recursive calls not supported

- CPUs
- Small number of FPUs
- Inflexible memory caches

Courtesy of NVIDIA

Courtesy of Intel

Contributions

- Hardware architecture for point sets
- Neighbor search module
- Novel advanced caching mechanism
- Reconfigurable processing module
- Programmability using FPGA compiler

- FPGA prototype and measurements
- Small & Lean
Integration into multi-core CPU/GPU possible

Outline

- Related Work
- Spatial Searching and Caching
- Architecture and Prototype
- Results
- Conclusion

Related Work

Kd-Tree

[Bentley 75]

kNN on GPUs[Ma and McCool 02]

Kd-Tree on GPUs

[Popov et al. 07]

Kd-Tree Hardware

[Woop et al. 05]

[Woop et al. 06]

Related Work

Adaptive SPH Fluid Simulation

[Adams et al. ‘07]

Algebraic Moving Least Squares,

[Guennebaud and Gross ‘07]

Linear Moving Least Squares,

[Adamson and Alexa ’04]

Linear Moving Least Squares

- Implicit surface definition defined by set of points

Linear Moving Least Squares

- Implicit surface definition defined by set of points

x

Linear Moving Least Squares

- Surface defined by points projecting onto themselves

x

Graphics Hardware 2008

Outline

- Related Work
- Spatial Searching and Caching
- Architecture & Prototype
- Results
- Conclusion

Spatial Search

- Spatial search: kNN and eNN
- Common in most point operations
- Based on kd-tree

- Example eNN:

Spatial Search

- kNN search similar to eNN search:
- Start with infinite radius
- Sort leaf points into priority queue
- Shrink radius with every point sorted

Coherent Neighbor Cache(eNN)

- Find neighbors in slightly bigger radius
- Re-use result for spatially close query

Re-use if

Coherent Neighbor Cache(kNN, exact)

- Find (k+1) neighbors
- Re-use result for spatially close query

Re-use if

Coherent Neighbor Cache(kNN, approximation)

- Approximation error e
- Enlarge radius

Re-use if

Outline

- Related Work
- Spatial Searching and Caching
- Architecture & Prototype
- Results
- Conclusion

Coherent Neighbor Cache

0

0

0

1

1

1

n

n

n

- Eight cached neighborhoods
- Problem: parallel queries in kd-tree module
- Interleave spatially similar queries

Kd-Tree Traversal

NodeRecurse

- Kd-tree structure on chip
- 16 threads
- Pipelining and multi-threading

Stacks

- 16 stacks
- Parallel read/write
- Bounded in depth
- 6 bytes per thread per recursion

Leaf

- 16 parallel priority queues (1-cycle ops)
- Queues store pointers and distances
- Bandwidth bottleneck

Processing Module

- Multithreaded quad-port bank of 16 registers
- 128 threads
- Programmability using FPGA-technology

Further Data

- Implemented on two FPGAs
- 64 bit DDR DRAM
- Interconnection: no overhead

- Resource usage regs and LUTs
- Virtex 2 Pro 100 (kNN): 26% registers, 38% LUTs
- Virtex 2 Pro 70 (MLS):47% registers, 52% LUTs

- Clock frequency: 75 MHz

Outline

- Related Work
- Spatial Searching and Caching
- Architecture & Prototype
- Results
- Conclusion

Applications

- Tested on various applications
- PCI interface of prototype slow

- [Weyrich et al. 04]

- [Adams et al. 07]

Results kNN

75 MHz

2200 MHz

1200 MHz

CUDA: x4

ASIC estimate, 500 MHz

x6.6

Number of queries

CUDA w/o sort: x4.0

CPU: x1.5

CUDA: x2.4

CUDA w/o sort: x3.1

CPU: x1.4

CUDA: x1.6

FPGA: x1

CPU: x1.1

FPGA: x1

FPGA: x1

Number of Neighbors

Results kNN

- Small hardware footprint
- FPGA slightly slower
- Realistic clock frequency
Prototype faster than CPU/GPU

75 MHz

2200 MHz

1200 MHz

CUDA: x4

ASIC estimate, 500 MHz

x6.6

Number of queries

CUDA w/o sort: x4.0

CPU: x1.5

CUDA: x2.4

CUDA w/o sort: x3.1

CPU: x1.4

CUDA: x1.6

FPGA: x1

CPU: x1.1

FPGA: x1

FPGA: x1

Number of Neighbors

Results MLS

FPGA faster than CPU

75 MHz

2200 MHz

1200 MHz

Number of queries

MLS CUDA x3.8

- kNN bottleneck
- FPGA
- GPU

FPGA: x1

MLS CPU: x0.4

Number of Neighbors

Coherent Neighbor Cache

CPU,

e=0.1

Number of queries

FPGA,

e=0.1

FPGA, exact

Level of coherence

Results Approximation Error (MLS projection)

MLS Error

e approximation

no approx.

Results Approximation Error (MLS projection)

Cache hits

Cache Hits

e approximation

Approximation Error (visual)

Approximation Error (visual)

- Coherent Neighbor Cache:
- Not optimal for exact queries
- Approximate queries
- Can be tolerated in most cases
- Greatly increases performance
- Even for small approximations

- Related Work
- Spatial Searching and Caching
- Architecture & Prototype
- Results
- Conclusion

Conclusion

- Novel hardware architecture for
- Nearest-neighbor searches
- Generic meshless processing operators

- Cache exploiting spatial coherence
- Good performance considering resources
- Possible GPU integration

Future Work

- Programmable data structure
- Support different data structures
- Programmability in data structure
- Construction on-chip

- ‘Real’ programmability in point processing module

S. Heinzle, G. Guennebaud,M. Botsch, M. Gross

