Implementation of Voxel Volume Projection Operators Using CUDA

1 / 17

# Implementation of Voxel Volume Projection Operators Using CUDA - PowerPoint PPT Presentation

Implementation of Voxel Volume Projection Operators Using CUDA. Applications to Iterative Cone-Beam CT reconstruction. Three questions to keep in mind during the presentation. Name three differences between the GPU and CPU contributing to the observed performance improvements.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Implementation of Voxel Volume Projection Operators Using CUDA' - chogan

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Implementation of Voxel Volume Projection Operators Using CUDA

Applications to Iterative Cone-Beam CT reconstruction

Three questions to keep in mind during the presentation
• Name three differences between the GPU and CPU contributing to the observed performance improvements.
• Scattered accumulation (a[ix] = a[ix] + b) can result in reduced performance and unexpected results in CUDA. Why? Please state two possible problems!
• Suppose that each thread in a CUDA program is responsible for writing to one unique output element. Give one possible solution for handling ”odd sized” output data.
What are voxel volume projectors?

Detector

• Object: f
• Projection data: p (line integrals through f)
• Forward projection: P, mapping f to p
• Backprojection: PT, mapping p to f

p

f

X-ray source

The tomography problem
• Given p, find an image/volume f, such that the difference between Pf and p is small in some sense, e.g. z(f)=||Pf-p||2 attains a minimum.
• Steepest descent approach:
• Update step: fk+1=fk-2aPT(Pfk-p)
• We use this method only for illustration of why P and PT are needed. In practice, faster and better regularized methods are needed.
Why implement P and PT on the GPU?
• Reasonable sizes of f and p are around 2GB, implying a size of 2GBx2GB for P and PT.
• Although these matrices are sparse, the number of nonzero elements is approximately 2000GB, i.e., matrix-vector products involving these matrices are computationally demanding.
• The GPU offers many solutions that help speeding up the computations:
• Massive parallelism
• Hardware linear interpolation
• Very high memory bandwidth
• Texture caches ”optimized for 2D locality”
The Joseph forward projector [1]
• Illustration in the 2D case:
• Generalization to 3D is straightforward. Instead of linear intepolation, bilinear interpolation is used.

pi= L (x0+x1+x2+x3)

x3

x2

x1

x0

L

[1] Joseph, P. M., An improved algorithm for reprojecting rays through pixel images, IEEE Transactions on Medical Imaging, 1982, MI-1, 192-196

Implementation sketch for P
• We choose to let one thread correspond to one ray.
• Determine the ray origin and direction.
• Determine the ray and voxel volume intersection.
• Step along the ray and accumulate values from the voxel volume, using the 3D-texture cache.
• Multiply with L and store to output element.
Handling of ”odd” detector sizes
• The x-ray detector is divided into 2D blocks corresponding to CUDA thread blocks, using a static block size (16x2).
• To handle the case with detector dimensions not divisible with the block size, conditional statements were used:
• Although this reduces efficiency due to divergent programs, the reduction is small for detectors with reasonably large number of channels and rows.

// Return if outside detector

if (rowIx>=Nrows) return;

if (chIx>=Nch) return;

// Calculate focus center and displacement vectors

float alpha = theta + D_ALPHA(chIx);

float3 fc = make_float3(Rf*cos(alpha), Rf*sin(alpha), z0+D_Z(chIx));

float3 fw = make_float3(-sin(alpha), cos(alpha), 0.f);

float3 fh = make_float3(-cos(alpha)*sin(aAngle),

-sin(alpha)*sin(aAngle), cos(aAngle));

// Calculate detector center and displacement vectors

float3 dc = fc + (Rf+Rd) * make_float3(-cos(theta), -sin(theta), D_SLOPE(rowIx));

float3 dw = make_float3(sin(theta), -cos(theta), 0.f);

float3 dh = make_float3(0.f, 0.f, 1.f);

// Calculate focus position in texture index coordinates

float3 f = ((fc + fOffset.x*fw + fOffset.y*fh) - origin) / spacing;

// Calculate detector position in texture index coordinates

float3 d = ((dc + dOffset.x*dw + dOffset.y*dh) - origin) / spacing;

// Create ray struct

Ray ray;

ray.o = f;

ray.d = normalize(d - f);

Code based on NVIDIA CUDA SDK ”volumeRender”

Implementation details 2Accumulation of contributions

// Accumulate contributions along ray

float dValue = 0;

float t = tnear;

for(int i=0; i<dimensions.x; i++)

{

// Calculate current position

float3 pos = ray.o + ray.d*t + 0.5f; // texture origin: (0.5f, 0.5f, 0.5f)

dValue += tex3D(devT_reconVolume, pos.x, pos.y, pos.z);

// increase t

t += tstep;

if (t >= tfar) break;

}

// Update detector data value

dValue *= length((ray.d*spacing))*tstep;

projectionData[rowIx*Nch+chIx] += dValue;

Code based on NVIDIA CUDA SDK ”volumeRender”

Experiments – forward projection (P)
• Input data dimensions: 672x24x3000 floats
• Output data dimensions: 512x512x257 floats
• Only very small differences in accuracy, without practical implications, occur between CPU and GPU implementations.
• Calculation times
• CPU: 2500 s
• GPU: 47 s
• Speed up factor: approximately 50x
• For a larger collection of problems, speedups between 20x and 50x have been observed.
Efficient implementation of the exact adjoint operator PT is much trickier, why?
• 2D illustration of the procedure:
• Same interpolation coefficients as for P, but with scattered accumulation instead of reading.
• No hardware interpolation. No 2D/3D textures.
• New parallelization setup is needed. One ray/one thread leads to corrupted results.

pi

• Let one thread represent only one accumulation from a ray, i.e.,
• calculation of position in voxel volume.
• calculation and multiplication with interpolation coefficients.
• accumulation to the 4 voxels closest to the active ray/voxel volume plane intersection.
• Very short, and very many rays threads.
• One thread block represents a number of rays, separated so that conflicting updates do not occur:

Block 0

Block 1

Block 2 ...

Experiments – backprojection (PT)
• Input data dimensions: 512x512x257 floats
• Output data dimensions: 672x24x3000 floats
• Calculation times
• CPU: 2700 s
• GPU: 700 s
• Speed up factor: approximately 4x
• For a larger collection of problems, speedups between 4x and 8x have been observed.
Approximate implementation of the adjoint operator PT
• Bilinear interpolation on the detector.
• 2D-illustration:
• Parallelization is now accomplished by letting one pixel correspond to one thread.
• This method is approximate since the interpolation coefficients generally are different from those of PT.
Performance comparisonApproximate versus exact PT operator
• Calculation times
• CPU PT : 2700s
• GPU PT (exact): 700s
• GPU PT (approximate): 60s
• Axial images of Head phantom reconstructions (25HU-75HU):

Exact

Approximate

Conclusions and future research
• For operations such as P and PT, speedups of 4x to 50x can be obtained.
• The amount of speedup is highly dependent on
• The possibility to efficiently read memory (coalesced or by the use of texture caches / constant memory).
• The possibility to use hardware interpolation
• Complexity of the kernel. Using too many registers slows down the program. Lookup tables in constant memory can help reducing the amount of registers.
• Although scattered write is supported by CUDA, for these problemes, it did not seem good to use it from a performance point of view. Note that this prohibits using cudaArray textures.
• Remark: Even if the approximate PT operator give a bad result in the example given here, there are other situations where the exact operator is superior. It is therefore of interest to find better approximations.