Parallel Beam Back Projection: Implementation

Parallel Beam Back Projection:Implementation Srdjan Coric Miriam Leeser Eric Miller

Outline • Annapolis Wildstar • “Simple Architecture” • algorithm • datapath • Performance • Results • Parallelism extraction • “Advanced Architecture 4x” • datapath • Performance • Results • Implementation issues • Future directions

Sinogram data address generation Sinogram data retrieval Sinogram data prefetch Linear interpolation Data accumulation Data read Data write Data Flow

LUT1 starting position Critical error-accumulation path LUT1 quantization error Bit reduction error LUT2 quantization error LUT3 quantization error 5 10 . LUT1: 15 1 . LUT2: 15 . 2 LUT3: Interpolation factor errorCorner starting position

“Simple Architecture” Datapath

Performance Results: Software vs. FPGA Hardware • Software - Floating point - 450 MHz Pentium : ~ 240 s • Software - Floating point - 1 GHz Dual Pentium : ~ 94 s • Software - Fixed point - 450 MHz Pentium : ~ 50 s • Software - Fixed point - 1 GHz Dual Pentium : ~ 28 s • Hardware - 50 MHz : ~ 5.4 s Parameters: 1024 projections 1024 samples per projection 512*512 pixels image 9-bit sinogram data 3-bit interpolation factor

Original image Hardware output image Zoom: ~200% Grayscale range < Pixel value range (heart features in focus)

Original image Hardware output image Zoom: ~200% Grayscale range < Pixel value range (lung features in focus)

Original image - Hardware output image

Memory bandwidth requirements at 50 MHz (for data accumulation) Case 1: 0.4 GB/s Case 2: 1.6 GB/s Case 3: 0.4 GB/s Memory bandwidth limit 1.2 GB/s Parallelism Issues Case 1: No parallelism extracted Case 2: Pixel level parallelism extracted Case 3: Projection level parallelism extracted Projections Image columns V1 Image rows V3 V2 T~k1*V1 T~k1*V2 T~k2*V3 k1 <k2, V2 =V3 =V1 /4, T=Execution time

Simple Architecture Advanced Architecture - Data Path projection parallelism extracted

Performance Results: Software vs. FPGA Hardware • Software - Floating point - 450 MHz Pentium : ~ 240 s • Software - Floating point - 1 GHz Dual Pentium : ~ 94 s • Software - Fixed point - 450 MHz Pentium : ~ 50 s • Software - Fixed point - 1 GHz Dual Pentium : ~ 28 s • Hardware - 50 MHz : ~ 5.4 s • Hardware (Advanced Architecture) - 50 MHz : ~ 1.3 s Parameters: 1024 projections 1024 samples per projection 512*512 pixels image 9-bit sinogram data 3-bit interpolation factor

Implementation Issues - fanout - prj_num(3) fanout = 1565 ! routing delay = 7.913 ns (~39.99%)

Implementation Issues - fanout - odd_2_A_4[4] fanout = 144 !

Memory Bridges Stuff 3 architectures implemented: • “Simple Architecture” = non-parallel (on slide 6) • “Advanced Architecture” = 4-way parallel (slide 12) • “Bridge Free Advanced Arch” = as B but contains no memory bridges (all design buffers in BlockRAMs) from PCI bus to memory banks required for Host-Memory communication. Bridges are separate design that is downloaded before (after) design C is downloaded so that input data can be stored to (output data read from) memories on the WildStar board. Virtex1000 resource utilization: • 11% logic, 90% BlockRAMs (with bridges) • 39% logic, 100% BlockRAMs • 21% logic, 100% BlockRAMs

Floorplan of the “Bridge Free Advanced Architecture” (design C on the previous slide)

Future Directions • Graduate

Parallel Beam Back Projection: Implementation