Programming Massively Parallel Processors
Download
1 / 51

Programming Massively Parallel Processors Lecture Slides for Chapter 8: Application Case Study Advanced MRI Reco - PowerPoint PPT Presentation


  • 124 Views
  • Uploaded on

Programming Massively Parallel Processors Lecture Slides for Chapter 8: Application Case Study – Advanced MRI Reconstruction. Objective. To learn about computational thinking skills through a concrete example Problem formulation Designing implementations to steer around limitations

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Programming Massively Parallel Processors Lecture Slides for Chapter 8: Application Case Study Advanced MRI Reco' - ita


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Programming Massively Parallel ProcessorsLecture Slides for Chapter 8: Application Case Study –Advanced MRI Reconstruction

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign


Objective
Objective

  • To learn about computational thinking skills through a concrete example

    • Problem formulation

    • Designing implementations to steer around limitations

    • Validating results

    • Understanding the impact of your improvements

  • A top to bottom experience!

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign


Acknowledgements

Acknowledgements

Sam S. Stone§, Haoran Yi§, Justin P. Haldar†, Deepthi Nandakumar, Bradley P. Sutton†,

Zhi-Pei Liang†, Keith Thulburin*

§Center for Reliable and

High-Performance Computing

†Beckman Institute for

Advanced Science and Technology

Department of Electrical and Computer Engineering

University of Illinois at Urbana-Champaign

* University of Illinois, Chicago Medical Center

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign


Overview
Overview

  • Magnetic resonance imaging

  • Non-Cartesian Scanner Trajectory

  • Least-squares (LS) reconstruction algorithm

  • Optimizing the LS reconstruction on the G80

    • Overcoming bottlenecks

    • Performance tuning

  • Summary

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign


Reconstructing mr images
Reconstructing MR Images

Cartesian Scan Data

Spiral Scan Data

Gridding

FFT

LS

Cartesian scan data + FFT:

Slow scan, fast reconstruction, images may be poor

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

2


Reconstructing mr images1
Reconstructing MR Images

Cartesian Scan Data

Spiral Scan Data

Gridding1

FFT

LS

Spiral scan data + Gridding + FFT:

Fast scan, fast reconstruction, better images

1 Based on Fig 1 of Lustig et al, Fast Spiral Fourier Transform for Iterative MR Image Reconstruction, IEEE Int’l Symp. on Biomedical Imaging, 2004

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

2


Reconstructing mr images2
Reconstructing MR Images

Cartesian Scan Data

Spiral Scan Data

Gridding

FFT

Least-Squares (LS)

Spiral scan data + LS

Superior images at expense of significantly more computation

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

2


An exciting revolution sodium map of the brain
An Exciting Revolution - Sodium Map of the Brain

Images of sodium in the brain

Very large number of samples for increased SNR

Requires high-quality reconstruction

Enables study of brain-cell viability before anatomic changes occur in stroke and cancer treatment – within days!

Courtesy of Keith Thulborn and Ian Atkinson, Center for MR Research, University of Illinois at Chicago

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign


Least squares reconstruction
Least-Squares Reconstruction

  • Q depends only on scanner configuration

  • FHd depends on scan data

  • ρ found using linear solver

  • Accelerate Q and FHd on G80

    • Q: 1-2 days on CPU

    • FHd: 6-7 hours on CPU

    • ρ: 1.5 minutes on CPU

Compute Q for FHF

Acquire Data

Compute FHd

Find ρ

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign


for (m = 0; m < M; m++) {

phiMag[m] = rPhi[m]*rPhi[m] +

iPhi[m]*iPhi[m];

for (n = 0; n < N; n++) {

expQ = 2*PI*(kx[m]*x[n] +

ky[m]*y[n] +

kz[m]*z[n]);

rQ[n] +=phiMag[m]*cos(expQ);

iQ[n] +=phiMag[m]*sin(expQ);

}

}

(a) Q computation

for (m = 0; m < M; m++) {

rMu[m] = rPhi[m]*rD[m] +

iPhi[m]*iD[m];

iMu[m] = rPhi[m]*iD[m] –

iPhi[m]*rD[m];

for (n = 0; n < N; n++) {

expFhD = 2*PI*(kx[m]*x[n] +

ky[m]*y[n] +

kz[m]*z[n]);

cArg = cos(expFhD);

sArg = sin(expFhD);

rFhD[n] += rMu[m]*cArg –

iMu[m]*sArg;

iFhD[n] += iMu[m]*cArg +

rMu[m]*sArg;

}

} (b) FHd computation

Q v.s. FHD

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign


Algorithms to accelerate
Algorithms to Accelerate

for (m = 0; m < M; m++) {

rMu[m] = rPhi[m]*rD[m] +

iPhi[m]*iD[m];

iMu[m] = rPhi[m]*iD[m] –

iPhi[m]*rD[m];

for (n = 0; n < N; n++) {

expFhD = 2*PI*(kx[m]*x[n] +

ky[m]*y[n] +

kz[m]*z[n]);

cArg = cos(expFhD);

sArg = sin(expFhD);

rFhD[n] += rMu[m]*cArg –

iMu[m]*sArg;

iFhD[n] += iMu[m]*cArg +

rMu[m]*sArg;

}

}

  • Scan data

    • M = # scan points

    • kx, ky, kz = 3D scan data

  • Pixel data

    • N = # pixels

    • x, y, z = input 3D pixel data

    • rFhD, iFhD= output pixel data

  • Complexity is O(MN)

  • Inner loop

    • 13 FP MUL or ADD ops

    • 2 FP trig ops

    • 12 loads, 2 stores

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign


From c to cuda step 1 what unit of work is assigned to each thread
From C to CUDA: Step 1What unit of work is assigned to each thread?

for (m = 0; m < M; m++) {

rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m];

iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m];

for (n = 0; n < N; n++) {

expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]);

cArg = cos(expFhD);

sArg = sin(expFhD);

rFhD[n] += rMu[m]*cArg – iMu[m]*sArg;

iFhD[n] += iMu[m]*cArg + rMu[m]*sArg;

}

}

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign


One possibility
One Possibility

__global__ void cmpFHd(float* rPhi, iPhi, phiMag,

kx, ky, kz, x, y, z, rMu, iMu, int N) {

int m = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x;

rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m];

iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m];

for (n = 0; n < N; n++) {

expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]);

cArg = cos(expFhD); sArg = sin(expFhD);

rFhD[n] += rMu[m]*cArg – iMu[m]*sArg;

iFhD[n] += iMu[m]*cArg + rMu[m]*sArg;

}

}

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign


One possibility improved
One Possibility - Improved

__global__ void cmpFHd(float* rPhi, iPhi, phiMag,

kx, ky, kz, x, y, z, rMu, iMu, int N) {

int m = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x;

float rMu_reg, iMu_reg;

rMu_reg = rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m];

iMu_reg = iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m];

for (n = 0; n < N; n++) {

expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]);

cArg = cos(expFhD); sArg = sin(expFhD);

rFhD[n] += rMu_reg*cArg – iMu_reg*sArg;

iFhD[n] += iMu_reg*cArg + rMu_reg*sArg;

}

}

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign


Back to the drawing board maybe map the n loop to threads
Back to the Drawing Board – Maybe map the n loop to threads?

for (m = 0; m < M; m++) {

rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m];

iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m];

for (n = 0; n < N; n++) {

expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]);

cArg = cos(expFhD);

sArg = sin(expFhD);

rFhD[n] += rMu[m]*cArg – iMu[m]*sArg;

iFhD[n] += iMu[m]*cArg + rMu[m]*sArg;

}

}

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign


for (m = 0; m < M; m++) { threads?

rMu[m] = rPhi[m]*rD[m] +

iPhi[m]*iD[m];

iMu[m] = rPhi[m]*iD[m] –

iPhi[m]*rD[m];

for (n = 0; n < N; n++) {

expFhD = 2*PI*(kx[m]*x[n] +

ky[m]*y[n] +

kz[m]*z[n]);

cArg = cos(expFhD);

sArg = sin(expFhD);

rFhD[n] += rMu[m]*cArg –

iMu[m]*sArg;

iFhD[n] += iMu[m]*cArg +

rMu[m]*sArg;

}

}

(a) FHd computation

for (m = 0; m < M; m++) {

for (n = 0; n < N; n++) {

rMu[m] = rPhi[m]*rD[m] +

iPhi[m]*iD[m];

iMu[m] = rPhi[m]*iD[m] –

iPhi[m]*rD[m];

expFhD = 2*PI*(kx[m]*x[n] +

ky[m]*y[n] +

kz[m]*z[n]);

cArg = cos(expFhD);

sArg = sin(expFhD);

rFhD[n] += rMu[m]*cArg –

iMu[m]*sArg;

iFhD[n] += iMu[m]*cArg +

rMu[m]*sArg;

}

} (b) after code motion

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign


A second option for the cmpfhd kernel
A Second Option for the cmpFHd Kernel threads?

__global__ void cmpFHd(float* rPhi, iPhi, phiMag,

kx, ky, kz, x, y, z, rMu, iMu, int N) {

int n = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x;

for (m = 0; m < M; m++) {

float rMu_reg =rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m];

float iMu_reg = iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m];

float expFhD = 2*PI*(kx[m]*x[n]+ky[m]*y[n]+kz[m]*z[n]);

float cArg = cos(expFhD);

float sArg = sin(expFhD);

rFhD[n] += rMu_reg*cArg – iMu_reg*sArg;

iFhD[n] += iMu_reg*cArg + rMu_reg*sArg;

}

}

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

6


We do have another option

We do have another option. threads?

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign


for (m = 0; m < M; m++) { threads?

rMu[m] = rPhi[m]*rD[m] +

iPhi[m]*iD[m];

iMu[m] = rPhi[m]*iD[m] –

iPhi[m]*rD[m];

for (n = 0; n < N; n++) {

expFhD = 2*PI*(kx[m]*x[n] +

ky[m]*y[n] +

kz[m]*z[n]);

cArg = cos(expFhD);

sArg = sin(expFhD);

rFhD[n] += rMu[m]*cArg –

iMu[m]*sArg;

iFhD[n] += iMu[m]*cArg +

rMu[m]*sArg;

}

}

(a) FHd computation

for (m = 0; m < M; m++) {

rMu[m] = rPhi[m]*rD[m] +

iPhi[m]*iD[m];

iMu[m] = rPhi[m]*iD[m] –

iPhi[m]*rD[m];

}

for (m = 0; m < M; m++) {

for (n = 0; n < N; n++) {

expFhD = 2*PI*(kx[m]*x[n] +

ky[m]*y[n] +

kz[m]*z[n]);

cArg = cos(expFhD);

sArg = sin(expFhD);

rFhD[n] += rMu[m]*cArg –

iMu[m]*sArg;

iFhD[n] += iMu[m]*cArg +

rMu[m]*sArg;

}

} (b) after loop fission

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign


A separate cmpmu kernel
A Separate cmpMu Kernel threads?

__global__ void cmpMu(float* rPhi, iPhi, rD, iD, rMu, iMu)

{

int m = blockIdx.x * MU_THREAEDS_PER_BLOCK + threadIdx.x;

rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m];

iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m];

}

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign


__global__ void cmpFHd(float* rPhi, iPhi, phiMag, threads?

kx, ky, kz, x, y, z, rMu, iMu, int N) {

int m = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x;

for (n = 0; n < N; n++) {

float expFhD = 2*PI*(kx[m]*x[n]+ky[m]*y[n]+kz[m]*z[n]);

float cArg = cos(expFhD);

float sArg = sin(expFhD);

rFhD[n] += rMu[m]*cArg – iMu[m]*sArg;

iFhD[n] += iMu[m]*cArg + rMu[m]*sArg;

}

}

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

6


for (m = 0; m < M; m++) { threads?

for (n = 0; n < N; n++) {

expFhD = 2*PI*(kx[m]*x[n] +

ky[m]*y[n] +

kz[m]*z[n]);

cArg = cos(expFhD);

sArg = sin(expFhD);

rFhD[n] += rMu[m]*cArg –

iMu[m]*sArg;

iFhD[n] += iMu[m]*cArg +

rMu[m]*sArg;

}

} (a) before loop interchange

for (n = 0; n < N; n++) {

for (m = 0; m < M; m++) {

expFhD = 2*PI*(kx[m]*x[n] +

ky[m]*y[n] +

kz[m]*z[n]);

cArg = cos(expFhD);

sArg = sin(expFhD);

rFhD[n] += rMu[m]*cArg –

iMu[m]*sArg;

iFhD[n] += iMu[m]*cArg +

rMu[m]*sArg;

}

} (b) after loop interchange

Figure 7.9 Loop interchange of the FHD computation

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign


A third option of fhd kernel
A Third Option of FHd kernel threads?

__global__ void cmpFHd(float* rPhi, iPhi, phiMag,

kx, ky, kz, x, y, z, rMu, iMu, int N) {

int m = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x;

for (n = 0; n < N; n++) {

float expFhD = 2*PI*(kx[m]*x[n]+ky[m]*y[n]+kz[m]*z[n]);

float cArg = cos(expFhD);

float sArg = sin(expFhD);

rFhD[n] += rMu[m]*cArg – iMu[m]*sArg;

iFhD[n] += iMu[m]*cArg + rMu[m]*sArg;

}

}

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

6


Using registers to reduce global memory traffic
Using Registers to Reduce Global Memory Traffic threads?

__global__ void cmpFHd(float* rPhi, iPhi, phiMag,

kx, ky, kz, x, y, z, rMu, iMu, int M) {

int n = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x;

float xn_r = x[n]; float yn_r = y[n]; float zn_r = z[n];

float rFhDn_r = rFhD[n]; float iFhDn_r = iFhD[n];

for (m = 0; m < M; m++) {

float expFhD = 2*PI*(kx[m]*xn_r+ky[m]*yn_r+kz[m]*zn_r);

float cArg = cos(expFhD);

float sArg = sin(expFhD);

rFhDn_r += rMu[m]*cArg – iMu[m]*sArg;

iFhDn_r += iMu[m]*cArg + rMu[m]*sArg;

}

rFhD[n] = rFhD_r; iFhD[n] = iFhD_r;

}

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign


Tiling of scan data
Tiling of Scan Data threads?

LS recon uses multiple grids

  • Each grid operates on all pixels

  • Each grid operates on a distinct subset of scan data

  • Each thread in the same grid operates on a distinct pixel

Thread n operates on pixel n:

for (m = 0; m < M/32; m++) {

exQ = 2*PI*(kx[m]*x[n] +

ky[m]*y[n] +

kz[m]*z[n])

rQ[n] += phi[m]*cos(exQ)

iQ[n] += phi[m]*sin(exQ)

}

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

12


Chunking k space data to fit into constant memory
Chunking k-space data to fit into constant memory threads?

__constant__ float kx_c[CHUNK_SIZE],

ky_c[CHUNK_SIZE], kz_c[CHUNK_SIZE];

__ void main() {

int i;

for (i = 0; i < M/CHUNK_SIZE; i++);

cudaMemcpy(kx_c,&kx[i*CHUNK_SIZE],4*CHUNK_SIZE, cudaMemCpyHostToDevice);

cudaMemcpy(ky_c,&ky[i*CHUNK_SIZE],4*CHUNK_SIZE, cudaMemCpyHostToDevice);

cudaMemcpy(ky_c,&ky[i*CHUNK_SIZE],4*CHUNK_SIZE, cudaMemCpyHostToDevice);

cmpFHD<<<FHD_THREADS_PER_BLOCK, N/FHD_THREADS_PER_BLOCK>>> (rPhi, iPhi, phiMag, x, y, z, rMu, iMu, int M);

}

/* Need to call kernel one more time if M is not */

/* perfect multiple of CHUNK SIZE */

}

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign


Revised kernel for constant memory
Revised Kernel for Constant Memory threads?

__global__ void cmpFHd(float* rPhi, iPhi, phiMag,

x, y, z, rMu, iMu, int M) {

int n = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x;

float xn_r = x[n]; float yn_r = y[n]; float zn_r = z[n];

float rFhDn_r = rFhD[n]; float iFhDn_r = iFhD[n];

for (m = 0; m < M; m++) {

float expFhD = 2*PI*(kx[m]*xn_r+ky[m]*yn_r+kz[m]*zn_r);

float cArg = cos(expFhD);

float sArg = sin(expFhD);

rFhDn_r += rMu[m]*cArg – iMu[m]*sArg;

iFhDn_r += iMu[m]*cArg + rMu[m]*sArg;

}

rFhD[n] = rFhD_r; iFhD[n] = iFhD_r;

}

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign


(a) k-space data stored in separate arrays. threads?

(b) k-space data stored in an array whose elements are structs.

Figure 7.14 Effect of k-space data layout on constant cache efficiency.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

25


struct kdata { threads?

float x, float y, float z;

} k;

__constant__ struct kdata k_c[CHUNK_SIZE];

__ void main() {

int i;

for (i = 0; i < M/CHUNK_SIZE; i++);

cudaMemcpy(k_c,k,12*CHUNK_SIZE, cudaMemCpyHostToDevice);

cmpFHD<<<FHD_THREADS_PER_BLOCK,N/FHD_THREADS_PER_BLOCK>>> ();

}

Figure 7.15 adjusting k-space data layout to improve cache efficiency

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

6


__global__ void cmpFHd(float* rPhi, iPhi, phiMag, threads?

x, y, z, rMu, iMu, int M) {

int n = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x;

float xn_r = x[n]; float yn_r = y[n]; float zn_r = z[n];

float rFhDn_r = rFhD[n]; float iFhDn_r = iFhD[n];

for (m = 0; m < M; m++) {

float expFhD = 2*PI*(k[m].x*xn_r+k[m].y*yn_r+k[m].z*zn_r);

float cArg = cos(expFhD);

float sArg = sin(expFhD);

rFhDn_r += rMu[m]*cArg – iMu[m]*sArg;

iFhDn_r += iMu[m]*cArg + rMu[m]*sArg;

}

rFhD[n] = rFhD_r; iFhD[n] = iFhD_r;

}

Figure 7.16 Adjusting the k-space data memory layout in the FHd kernel

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

6


From c to cuda step 2 where are the potential bottlenecks
From C to CUDA: Step 2 threads?Where are the potential bottlenecks?

Bottlenecks

  • Memory BW

  • Trig ops

  • Overheads (branches, addr calcs)

Q(float* x,y,z,rQ,iQ,kx,ky,kz,phi,

int startM,endM)

{ n = blockIdx.x*TPB + threadIdx.x

for (m = startM; m < endM; m++) {

exQ = 2*PI*(kx[m]*x[n]

+ ky[m]*y[n]

+ kz[m]*z[n])

rQ[n] += phi[m] * cos(exQ)

iQ[n] += phi[m] * sin(exQ)

}

}

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

13


Step 3 overcoming bottlenecks
Step 3: Overcoming bottlenecks threads?

  • LS recon on CPU (SP)

    • Q: 45 hours, 0.5 GFLOPS

    • FHd: 7 hours, 0.7 GFLOPS

  • Counting each trig op as 1 FLOP

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

14


Step 3 overcoming bottlenecks mem bw
Step 3: Overcoming Bottlenecks (Mem BW) threads?

  • Register allocate pixel data

    • Inputs (x, y, z); Outputs (rQ, iQ)

  • Exploit temporal and spatial locality in access to scan data

    • Constant memory + constant caches

    • Shared memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

15


Step 3 overcoming bottlenecks mem bw1
Step 3: Overcoming Bottlenecks (Mem BW) threads?

  • Register allocation of pixel data

    • Inputs (x, y, z); Outputs (rQ, iQ)

    • FP arithmetic to off-chip loads: 2 to 1

  • Performance

    • 5.1 GFLOPS (Q), 5.4 GFLOPS (FHd)

  • Still bottlenecked on memory BW

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

16


Step 3 overcoming bottlenecks mem bw2
Step 3: Overcoming Bottlenecks (Mem BW) threads?

  • Old bottleneck: off-chip BW

    • Solution: constant memory

    • FP arithmetic to off-chip loads: 284 to 1

  • Performance

    • 18.6 GFLOPS (Q), 22.8 GFLOPS (FHd)

  • New bottleneck: trig operations

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

17


Sidebar estimating off chip loads with const cache
Sidebar: Estimating Off-Chip Loads with Const Cache threads?

  • How can we approximate the number of off-chip loads when using the constant caches?

  • Given: 128 tpb, 4 blocks per SM, 256 scan points per grid

  • Assume no evictions due to cache conflicts

  • 7 accesses to global memory per thread (x, y, z, rQ x 2, iQ x 2)

    • 4 blocks/SM * 128 threads/block * 7 accesses/thread = 3,584 global mem accesses

  • 4 accesses to constant memory per scan point (kx, ky, kz, phi)

    • 256 scan points * 4 loads/point = 1,024 constant mem accesses

  • Total off-chip memory accesses = 3,584 + 1,024 = 4,608

  • Total FP arithmetic ops = 4 blocks/SM * 128 threads/block * 256 iters/thread * 10 ops/iter = 1,310,720

  • FP arithmetic to off-chip loads: 284 to 1

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

18


Step 3 overcoming bottlenecks trig
Step 3: Overcoming Bottlenecks (Trig) threads?

  • Old bottleneck: trig operations

    • Solution: SFUs

  • Performance

    • 98.2 GFLOPS (Q), 92.2 GFLOPS (FHd)

  • New bottleneck: overhead of branches and address calculations

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

19


Sidebar effects of approximations
Sidebar: Effects of Approximations threads?

  • Avoid temptation to measure only absolute error (I0 – I)

    • Can be deceptively large or small

  • Metrics

    • PSNR: Peak signal-to-noise ratio

    • SNR: Signal-to-noise ratio

  • Avoid temptation to consider only the error in the computed value

    • Some apps are resistant to approximations; others are very sensitive

A.N. Netravali and B.G. Haskell, Digital Pictures: Representation, Compression, and Standards (2nd Ed), Plenum Press, New York, NY (1995).

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

20


Step 3 overcoming bottlenecks overheads
Step 3: Overcoming Bottlenecks (Overheads) threads?

  • Old bottleneck: Overhead of branches and address calculations

    • Solution: Loop unrolling and experimental tuning

  • Performance

    • 179 GFLOPS (Q), 145 GFLOPS (FHd)

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

21


Experimental tuning tradeoffs
Experimental Tuning: Tradeoffs threads?

  • In the Q kernel, three parameters are natural candidates for experimental tuning

    • Loop unrolling factor (1, 2, 4, 8, 16)

    • Number of threads per block (32, 64, 128, 256, 512)

    • Number of scan points per grid (32, 64, 128, 256, 512, 1024, 2048)

  • Can’t optimize these parameters independently

    • Resource sharing among threads (register file, shared memory)

    • Optimizations that increase a thread’s performance often increase the thread’s resource consumption, reducing the total number of threads that execute in parallel

  • Optimization space is not linear

    • Threads are assigned to SMs in large thread blocks

    • Causes discontinuity and non-linearity in the optimization space

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

22


Experimental tuning example
Experimental Tuning: Example threads?

Increase in per-thread performance, but fewer threads:

Lower overall performance

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

23


Experimental tuning scan points per grid
Experimental Tuning: Scan Points Per Grid threads?

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

24


Sidebar cache conscious data layout
Sidebar: Cache-Conscious Data Layout threads?

  • kx, ky, kz, and phi components of same scan point have spatial and temporal locality

    • Prefetching

    • Caching

  • Old layout does not fully leverage that locality

  • New layout does fully leverage that locality

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

25


Experimental tuning scan points per grid improved data layout
Experimental Tuning: threads?Scan Points Per Grid (Improved Data Layout)

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

26


Experimental tuning loop unrolling factor
Experimental Tuning: Loop Unrolling Factor threads?

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

27


Sidebar optimizing the cpu implementation
Sidebar: Optimizing the CPU Implementation threads?

  • Optimizing the CPU implementation of your application is very important

    • Often, the transformations that increase performance on CPU also increase performance on GPU (and vice-versa)

    • The research community won’t take your results seriously if your baseline is crippled

  • Useful optimizations

    • Data tiling

    • SIMD vectorization (SSE)

    • Fast math libraries (AMD, Intel)

    • Classical optimizations (loop unrolling, etc)

  • Intel compiler (icc, icpc)

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

28


Summary of results
Summary of Results threads?

8X

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

29


Summary of results1
Summary of Results threads?

357X

228X

108X

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

30


Questions
Questions? threads?

+

=

Scanner image released distributed under GNU Free Documentation License.

GeForce 8800 GTX image obtained from http://www.nvnews.net/previews/geforce_8800_gtx/index.shtml

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

31


Algorithms to accelerate1
Algorithms to Accelerate threads?

Compute FHd

for (K = 0; K < numK; K++)

rRho[K] = rPhi[K]*rD[K] +

iPhi[K]*iD[K]

iRho[K] = rPhi[K]*iD[K] -

iPhi[K]*rD[K]

for (X = 0; X < numP; X++)

for (K = 0; K < numK; K++)

exp = 2*PI*(kx[K]*x[X] +

ky[K]*y[X] +

kz[K]*z[X])

cArg = cos(exp)

sArg = sin(exp)

rFH[X] += rRho[K]*cArg –

iRho[K]*sArg

iFH[X] += iRho[K]*cArg +

rRho[K]*sArg

  • Inner loop

    • 14 FP MUL or ADD ops

    • 4 FP trig ops

    • 12 loads (naively)

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign


Experimental methodology
Experimental Methodology threads?

  • Reconstruct a 3D image of a human brain1

    • 3.2 M scan data points acquired via 3D spiral scan

    • 256K pixels

  • Compare performance several reconstructions

    • Gridding + FFT recon1 on CPU (Intel Core 2 Extreme Quadro)

    • LS recon on CPU (double-precision, single-precision)

    • LS recon on GPU (NVIDIA GeForce 8800 GTX)

  • Metrics

    • Reconstruction time: compute FHd and run linear solver

    • Run time: compute Q or FHd

1 Courtesy of Keith Thulborn and Ian Atkinson, Center for MR Research, University of Illinois at Chicago

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

ECE408, University of Illinois, Urbana-Champaign

8


ad