1 / 20

CUDA ITK

CUDA ITK. Won-Ki Jeong SCI Institute University of Utah. NVIDIA G80. New architecture for computing on the GPU GPU as massively parallel multithreaded machine One step further from streaming model New hardware features Unified shaders (ALUs) Flexible memory access (scatter)

Download Presentation

CUDA ITK

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CUDA ITK Won-Ki Jeong SCI Institute University of Utah

  2. NVIDIA G80 • New architecture for computing on the GPU • GPU as massively parallel multithreaded machine • One step further from streaming model • New hardware features • Unified shaders (ALUs) • Flexible memory access (scatter) • Fast user-controllable on-chip memory • Integer, bitwise operations

  3. NVIDIA CUDA • C-extension NVIDIA GPU programming language • No graphics API overhead • Easy to learn • Support development tools • Extensions / API • Function type : __global__, __device__, __host__ • Variable type : __shared__, __constant__ • cudaMalloc(), cudaFree(), cudaMemcpy(),… • __syncthread(), atomicAdd(),… • Program types • Device program (kernel) : run on the GPU • Host program : run on the CPU to call device programs

  4. CUDA ITK • ITK powered by CUDA • Many registration / image processing functions are still computationally expensive and parallelizable • Current ITK parallelization is bound by # of CPUs (cores) • Our approach • Implement several well-known ITK image filters using NVIDIA CUDA • Focus on 3D volume processing • CT / MRI datasets are mostly 3D volume

  5. CUDA ITK • CUDA code is integrated into ITK • Transparent to the itk users • No need to modify current code using ITK • Check environment variable ITK_CUDA • Entry point : GenerateData() or ThreadedGenerateData() • If ITK_CUDA == 0 • Execute original ITK code • If ITK_CUDA == 1 • Execute CUDA code

  6. ITK image space filters • Convolution filters • Mean filter • Gaussian filter • Derivative filter • Hessian of Gaussian filter • Statistical filter • Median filter • PDE-based filter • Anisotropic diffusion filter

  7. Speed up using CUDA • Mean filter : ~ 140x • Median filter : ~ 25x • Gaussian filter : ~ 60x • Anisotropic diffusion : ~ 70x

  8. Convolution filters • Separable filter • N-dimensional convolution = N*1D convolution • For filter radius r, • Example • 2D Gaussian = 2 * 1D Gaussian

  9. Input (global memory) Output (global memory) kernel * Shared memory GPU implementation • Apply 1D convolution along each axis • Minimize overlapping

  10. 1 2 1 2 4 2 1 1 2 Multiple overlapping No overlapping Minimize overlapping • Usually kernel width is large ( > 20 for Gaussian) • Max block size ~ 8x8x8 • Each pixel has 6 neighbors in 3D • Use long and thin blocks to minimize overlapping 1 1 1 1

  11. 1 2 0 3 4 1 1 2 8 1 3 4 1 0 1 2 8 1 0 1 4 3 1 8 2 8 1 0 1 4 3 1 16 4 5 11 Median filter • Viola et al. [VIS 03] • Finding median by bisection of histogram bins • Log(# bins) iterations (e.g., 8-bit pixel : 8 iterations) Intensity : 0 1 2 3 4 5 6 7 1. 3. 2. 4.

  12. Pseudo code (GPU median filter) Copy current block from global to shared memory min = 0; max = 255; pivot = (min+max)/2.0f; For(i=0; i<8; i++) { count = 0; For(j=0; j<kernelsize; j++) { if(kernel[j] > pivot) count++: } if(count < kernelsize/2) max = floor(pivot); else min = ceil(pivot); pivot = (min + max)/2.0f; } return floor(pivot);

  13. Perona & Malik anisotropic PDE • Nonlinear diffusion • Fall-off function c (conductance)controls anisotropy • Less smoothing across high gradient • Contrast parameterk • Numerical solution • Euler explicit integration (iterative method) • Finite difference for derivative computation

  14. Gradient & Conductance map • Half x / y / z direction gradients / conductance for each pixel • 2D example • For n^2 block, 4(n+1)^2 + (n+2)^2 shared memory required Shared memory Global memory n*n (n+2)*(n+2) (n+1)*(n+1) * 4 (grad x, grad y, cond x, cond y)

  15. Euler integration • Use pre-computed gradients and conductance • Each gradient / conductance is used twice • Avoid redundant computation by using pre-computed gradient / conductance map

  16. Experiments • Test environment • CPU : AMD Opteron Dual Core 1.8GHz • GPU : Tesla C870 • Input volume is 128^3

  17. Result • Mean filter • Gaussian filter

  18. Result • Median filter • Anisotropic diffusion

  19. Summary • ITK powered by CUDA • Image space filters using CUDA • Up to 140x speed up • Future work • GPU image class for ITK • Reduce CPU to GPU memory I/O • Pipelining support • Image registration • Numerical library (vnl) • Out-of-GPU-core processing • Seismic volumes (~10s to 100s GB)

  20. Questions?

More Related