16bit 3D Convolution Implementation SSE + OpenMP Benchmarking on Penryn

16bit 3D Convolution Implementation SSE + OpenMPBenchmarking on Penryn Dr. Zvi Danovich, Senior Application Engineer January 2008

Agenda • Mathematics of 3D convolution • Main idea of SSE implementation of 1D convolution • Basic routine of algorithm: 2D convolution – 1 line • Main routine of algorithm: 3D convolution – line by line • Adding OpenMP, benchmarking, conclusions

P =sum Kp Kp Kp Kp Kp Kp Kp Kp Kp 3D convolution – what is it ? • 3D convolution (with 3x3x3 kernel K) is computed for each pixel P as where p is source pixels and K – convolution kernel values. • In another words, each new pixel is the sum of 27 products of source pixels values with appropriate kernel values inside kernel cubic:

Recombination from 1D convolutions • If 1D convolution is defined as therefore final line of 3D convolution is i.e. 3D convolution can be presented as double sum of 9 1D convolutions – 3 planes with 3 lines in plane

k- -1 k- 0 k- 1 k- 2 k+ 1 P0 k+ 2 P1 k+ 3 P2 P3 k+ 4 kc 0 kc 1 kc 2 kc 3 Main part of algorithm: 1D convolutionidea of implementation • Let start from 3 sequential QUADs from sourse line, multiply all three by different K (kernel) values (denoted as k-, kc,k+) Selection by PALIGNR Multiplication k- k- k- k- k- -4 k- -3 k- -2 k- -1 k- 0 k- 1 k- 2 k- 3 k- 4 k- 5 k- 6 k- 7 Sourse pixels p kc -4 kc -3 kc -2 kc -1 kc 0 kc 1 kc 2 kc 3 kc 4 kc 5 kc 6 kc 7 -4 -3 -2 -1 0 1 2 3 4 5 6 7 kc kc kc kc Multiplication k+ k+ k+ k+ k+ -4 k+ -3 k+ -2 k+ -1 k+ 0 k+ 1 k+ 2 k+ 3 k+ 4 k+ 5 k+ 6 k+ 7 Selection by PALIGNR • Using PALIGNR, select QUAD shifted left for products with k- and QUAD shifted right for products with k+. Sum up them with unshifted QUAD products with kc: k-p2+kcp3+k+p4 Resulting sums are convolution expressions for central QUAD ! k-p1+kcp2+k+p3 k-p0+kcp1+k+p2 k-p-1+kcp0+k+p1

Basic routine of algorithm: 2D convolution – 1 line • Main loop is treating sequential EIGHTs of 16bit pixels for 3 adjacent lines (unrolled inside 1 step). 1D convolution (in 32bit form) is computed for 2 QUADs of each EIGHT, results for 3 lines are summed up, therefore forming 2D convolution results. • To avoid using “if”s in the main loop, the very first step is separated into prolog part, being simpler than general step. • Below is the description of 1 line (from 3 lines) computations in general main loop step. It starts from loading EIGHT 16bit source pixels and unpacking them into 2 32bit QUADs : Equivalence First unpacked 32bit QUAD p0 p1 p2 p3 p0 p1 p2 p3 Shuffle Load EIGHT of 16 bit source pixels p0 p1 p2 p3 p4 p5 p6 p7 Equivalence Shuffle p4 p5 p6 p7 p4 p5 p6 p7 Second unpacked 32bit QUAD

Multiplication SSE4 mullo_epi32 Multiplication SSE4 mullo_epi32 Basic routine of algorithm: 2d convolution – 1 line • Multiply 2 QUADs (from previous step) with three different K values (denoted as k-, kc, k+), resulting in 6 product QUADs. Treat them together with 2 similar product QUADs saved at previous step. 1 2 k- -4 k- -3 k- -2 k- -1 k- 0 k- 1 k- 2 k- 3 k- 4 k- 5 k- 6 k- 7 k- k- k- k- 2 Saved product QUADs from previous step 1 kc 0 kc 1 kc 2 kc 3 kc 4 kc 5 kc 6 kc 7 kc kc kc kc 0 1 2 3 4 5 6 7 Prev k+ -4 k+ -3 k+ -2 k+ -1 k+ 0 k+ 1 k+ 2 k+ 3 k+ 4 k+ 5 k+ 6 k+ 7 k+ k+ k+ k+ 1 • Using PALIGNR, select appropriate QUAD and start/continue forming 3 sum QUADs: • (1) RED frame: 2D convolution of 1st sourse QUAD: will be finalized and stored at the end of current step, • (2) GREEN frame: 2D convolution of 2nd sourse QUAD: will be finalized and stored at the end of next step/epilog, • (Prev) YELLOW frame: 2D convolution of previous 2nd sourse QUAD: will be finalized and stored at the end of current step • Therefore, at the end of current step, 2 resulting 2D convolution QUADs– PREVIOUS 2nd and CURRENT 1st - will be stored.

Basic routine of algorithm: 2d convolution – 1 linefinalizing • As already mentioned, each step treats and sums up data from 3 adjacent lines – performs computations from previous foils for 2 other lines and sets of kernel components accordingly. • Prolog step doesn’t include PREVIOUS sum computation and certainly doesn’t save it. The epilog step includes the very last 2D convolution QUAD computation and store that is fully similar to PREVIOUS computation in regular step. • Finally, the above routine builds ONE 32bit line of 2D convolution resulting points.

Slice -1 (previous) 2D convolution Slice 0 (current) Line -1 Slice 1 (next) Summing up Line 0 Summing up Line 1 Main routine of algorithm: 3D convolution – line by line • To build full 3D convolution stack, this routine runs on lines (inner loop) of all slices (external loop). • For each source line, it computes 3 32bit 2D convolution lines – based on previous, current and next slices, using “2D convolution -1 line” routine described above. packs_epi32 • Resulting 3D convolution line is built by summing up these 3 lines, normalizing by arithmetical shift and converting result to 16 bit as following: After shift: actually – 16bit 0 1 2 3 4 5 6 7 Line -1 2D conv. 0 1 4 5 2 3 6 7 Summing up 0 1 2 3 4 5 6 7 Line 0 2D conv. Shift 0 1 2 3 4 5 6 7 Line +1 2D conv. Store 0 1 2 3 4 5 6 7 32bit 3D convolution 0 1 2 3 4 5 6 7 Final 16bit 3D convolution EIGHT

Parallelizing by OpenMP and benchmarking • To parallelize the above algorithm by using OpenMP for external (slices) loop, 3 32bit working lines for each thread are allocated. • See below benchmarks with and without OpenMP on 2-way HPTN machine (8 cores). 3 runs – equivalent of 3D gradient computation: SSE only SSE+OpenMP Serial/SSE = ~3, SSE/(SSE+OpenMP) = ~5.5, Serial/(SSE+OpenMP) = ~16.3 10 runs: SSE only SSE+OpenMP Serial/SSE = ~3, SSE/(SSE+OpenMP) = ~6.3, Serial/(SSE+OpenMP) = ~18.6 Speed-up of SSE (3x)is close to theoretical limit for 4-32bit-vector operations ! Additional OpenMP speed-up (5.5x-6.3x) brings overall speed-up to 16.3x-18.6x !

16bit 3D Convolution Implementation SSE + OpenMP Benchmarking on Penryn

16bit 3D Convolution Implementation SSE + OpenMP Benchmarking on Penryn

Presentation Transcript

Convolution

Convolution

Convolution Fourier Convolution

Convolution

Convolution

Convolution

3D Anaglyph implementation demo

Convolution

SSE Ireland

Convolution Codes

Convolution

16Bit Microprocessor : 8086

SSE Remote Telemetry Unit SSE RTU

Convolution

Convolution Operators

Convolution

Convolution

FFT Convolution

Convolution

Convolution

Convolution Operators