1 / 30

Power-Efficient Medical Image Processing using PUMA

Power-Efficient Medical Image Processing using PUMA. Ganesh Dasika , Kevin Fan 1 , Scott Mahlke. University of Michigan Advanced Computer Architecture Laboratory. 1 Parakinetics, Inc. The Advent of the GPGPU. Increasingly popular substrate for HPC Astrophysics Weather Prediction EDA

keahi
Download Presentation

Power-Efficient Medical Image Processing using PUMA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Power-Efficient Medical Image Processing using PUMA GaneshDasika, Kevin Fan1, Scott Mahlke University of Michigan Advanced Computer Architecture Laboratory 1Parakinetics, Inc.

  2. The Advent of the GPGPU • Increasingly popular substrate for HPC • Astrophysics • Weather Prediction • EDA • Financial instrument pricing • Medical Imaging

  3. Advantages of GPGPUs • High degree of parallelism • Data-level • Thread-level • High bandwidth • Commodity products • Increasingly programmable

  4. Disadvantages of GPGPUs • Gap between computation and bandwidth • 933 GFLOPS : 142 GB/s bandwidth (0.15B of data per FLOP, ~26:1 Compute:Mem Ratio) • Very high power consumption • Graphics-specific hardware • Multiple thread contexts • Large register files and memories • Fully general datapath Inefficiencies in all general-purpose architectures

  5. Programmability vs Efficiency? FPGAs Highly efficient, some programmability General PurposeProcessors DSPs Domain-specific Accelerators, GPGPUs Flexibility ??? Loop Accelerators, ASICs Efficiency

  6. Medical Image Reconstruction • Compute intensive loops • 32-bit floating point code • High data/bandwidth requirements • Increased demand for portability, low power • Much current research focuses on using GPGPUs for this domain

  7. CT Image reconstruction • X-Ray emitters and receptors on opposite sides of patients • Received x-ray intensity corresponds to tissue density • Multiple scans (“slices”) taken around patient put together to reconstruct 1 2D-image

  8. Projection & Sinogram  Sinogram:All projections y Projection:All ray-sums in a direction P(t) t p  x f(x,y) t X-rays Sinogram

  9. Example: Backprojection Sinogram Backprojected Image

  10. Example:Filtered Backprojection Filtered Sinogram Reconstructed Image

  11. Reconstruction: Solve for m’s X-Ray Emitter 22 12 “Human Body“ 10 15 Detector Values 16 22 11 10 Densities

  12. ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Real Reconstruction Problem 100’s of diagonals @ 100’s of angles 712 199 255 • Intensity measured • Rays transmitted through multiple “pixels” • Find individual “pixel” values from transmission data 534 417 512 values 364 555 501 355 512 values

  13. Medical Imaging Applications • Image reconstruction for MRI/CT/PET scans • Large amounts of Vector/Thread-level parallelism • FP-intensive kernels • Often requiring math library functions • Data-intensive (~5:1 compute:mem ratio)

  14. Current Concerns: Portability/Power • Currently, most scans require moving patient to imaging room • Consumes time • Stress on patient • Studies show benefits of portable, bed-side scanners: • 86% increase in patients suitable for post-stroke thrombolytic therapy [Weinreb et al, RSNA] • 80-100% drop in scan-related complications [Gunnarsson et al, J. of Neurosurgery] • New X-Ray emitters push for mAs of current use

  15. Current Concerns: Performance • High-accuracy CT algorithms take too long • Iterative forward/backward projection • ~Hours on modern CT scanners instead of minutes • Interventional radiology • Scans currently takes minutes, but should take seconds • CT-Flouroscopy • Several scans done in succession

  16. Flexibility • Software algorithms change over time • NRE • Time-to-market

  17. PUMA • Tiled architecture • Bandwidth-matched for improved efficiency • Each tile is a “Programmable Loop Accelerator” Extern. Interface … Disk Mem CPU

  18. Programmable Loop Accelerator • Generalize accelerator without losing efficiency FPGAs General PurposeProcessors DSPs Domain-specific Accelerators, GPGPUs Flexibility ??? Programmable Loop Accelerators Loop Accelerators, ASICs Efficiency, Performance

  19. Designing Loop Accelerators Local Mem << MEM * … … … … … … … … … … … … CRF Point-to-point Connections BR + + MEM … … & Local Mem Hardware Loop C Code

  20. Loop Accelerator Architecture CRF Point-to-point Connections … … … … … … FSM Local Mem BR + & MEM Controlsignals • Hardware realization of modulo scheduled loop • Parameterized hardware: • FUs • Shift Register Files • Static Control • Point-to-point Interconnect

  21. Programmable Loop-Accelerator Architecture CRF Literals Point-to-point Connections Ring … … … … … … Control Memory FSM Local Mem + & BR +/- &/| MEM Controlsignals RR SRF RR SRF SRF RR SRF RR LA PLA • Functionality • Storage • Connectivity • Control Custom FU set Generalized FUs + MOVs Limited size, no addr. Rotating Reg. Files Point-to-point Ring + Port-swapping Hardwired Control Lit. Reg. File + Control Mem

  22. MRI.FH PLA • ~0.6 mm2 per tile • 38 FUs • 128 32-bit registers • Inter-FU BW 1 TB/sec

  23. Performance on MRI.FH PLA Unschedulable II preserved II doubled

  24. Efficiency on MRI.FH PLA

  25. PUMA System Design • 5 systems designed around 5 benchmarks • Each composed of identical tiles • Assume same B/W as GTX280 (142 GB/s) • # Tiles based on B/W requirements of benchmark Extern. Interface … Disk Mem CPU

  26. System Performance 4W 3W 2.8W 2.3W 2.7W

  27. Performance vs. GPGPU 2X performance of GTS 250 63% performance of GTX 295

  28. Efficiency vs. GPGPU 54X 22X

  29. Conclusions • Power-efficient accelerator for medical imaging • ASIC-like efficiency with programmability • 63-201% of GPU performance • 22-54X GPU Performance/Power efficiency

  30. Thank you!!Questions?

More Related