D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

D2MA: Accelerating Coarse-Grained Data Transfer for GPUs D. Anoushe Jamshidi, Mehrzad Samadi, and Scott Mahlke University of Michigan PACT-23 August 27th, 2014

Achieving Peak GPU Performance: Theory and Practice Matrix Multiplication Not easy to fully utilize GPU capabilities! Peak CUBLAS SDK

A Quick Overview of GPUs Chip SMs L2 $ Interconnect DRAM Register File Fetch Decode Result Issue Shared Memory SPs DRAM LD/ST Result Result Data Data Data Writeback Result L1D $ DRAM …

A Quick Overview of GPUs Chip SMs L2 $ Interconnect DRAM Register File Fetch Decode Issue ~100’s of cycles Shared Memory SPs DRAM LD/ST Writeback L1D $ DRAM …

How do GPUs Achieve Great Performance? • Effectively use available memory bandwidth • Exploit data reuse when possible Cache Line SP SP SP SP Store Store Store Store

How do GPUs Achieve Great Performance? • Effectively use available memory bandwidth • Exploit data reuse when possible • Regular, well coalesced memory accesses Cache Line Cache Line SP SP SP SP Store

Buffering to Optimize Bandwidth Chip SMs L2 $ Interconnect DRAM Register File Fetch Decode Issue ~100’s of cycles Tile[0] Tile[1] Shared Memory SPs Tile[2] DRAM LD/ST <10 cycles Writeback L1D $ DRAM Buffer data in fast Shared Memory …

Buffering Problem 1: Wasted Storage Chip SMs L2 $ Interconnect DRAM Register File Tile[1] Tile[0] Tile[2] Fetch Decode Issue Tile[0] Tile[1] Shared Memory SPs Tile[2] DRAM LD/ST Tile[0] Writeback Tile[0] L1D $ Tile[1] Roundabout path to Shared Memory Tile[2] DRAM Tile[0] Tile[1] Duplicated data in Shared Mem, Caches, Reg. File Tile[2] …

Buffering Problem 2: Code Expansion IADD R4.CC, R7, c [0x0] [0x150]; SHL.W R18, R7, 0x2; IMUL.U32.U32.HI R20, R7, 0x4; MOV R12, c [0x0] [0x150]; IADD.X R5, RZ, RZ; IADD R2.CC, R18, c [0x0] [0x148]; IADD.X R3, R20, c [0x0] [0x14c]; IADD R0, R0, R19; IMAD.U32.U32 R6.CC, R12, 0x2, R7; LD.E R14, [R2]; IADD.X R8, RZ, RZ; IMAD R10.CC, R12, 0x3, R7; SHL R21, R0, 0x2; IADD.X R9, RZ, RZ; IMAD.U32.U32 R11.CC, R12, 0x4, R7; STS [R21], R14; SHR.U32 R0, R4, 0x1e; SHL R22, R4, 0x2; IADD.X R4, RZ, RZ; IMAD R27.CC, R12, 0x5, R7; SHR.U32 R13, R6, 0x1e; SHL R24, R6, 0x2; IADD.X R6, RZ, RZ; ISCADD R23, R5, R0, 0x2; IMAD R0.CC, R12, 0x6, R7; IADD.X R5, RZ, RZ; IMAD R33.CC, R12, 0x7, R7; SHR.U32 R15, R10, 0x1e; SHL R26, R10, 0x2; SHR.U32 R10, R11, 0x1e; SHL R28, R11, 0x2; IADD.X R11, RZ, RZ; IADD R12.CC, R22, c [0x0] [0x148]; ISCADD R25, R8, R13, 0x2; IADD.X R13, R23, c [0x0] [0x14c]; IADD R8.CC, R24, c [0x0] [0x148]; SHR.U32 R7, R27, 0x1e; LD.E R13, [R12]; SHL R30, R27, 0x2; STS [R21+0x84], R13; ISCADD R27, R9, R15, 0x2; IADD.X R9, R25, c [0x0] [0x14c]; IADD R2.CC, R26, c [0x0] [0x148]; ISCADD R29, R4, R10, 0x2; IADD.X R3, R27, c [0x0] [0x14c]; IADD R4.CC, R28, c [0x0] [0x148]; SHR.U32 R10, R0, 0x1e; ISCADD R31, R6, R7, 0x2; ISCADD R32, R5, R10, 0x2; LD.E R9, [R8]; IADD.X R5, R29, c [0x0] [0x14c]; IADD R6.CC, R30, c [0x0] [0x148]; SHL R0, R0, 0x2; IADD.X R7, R31, c [0x0] [0x14c]; SHR.U32 R34, R33, 0x1e; IADD R10.CC, R0, c [0x0] [0x148]; SHL R33, R33, 0x2; LD.E R3, [R2]; ISCADD R34, R11, R34, 0x2; LD.E R5, [R4]; IADD.X R11, R32, c [0x0] [0x14c]; IADD R14.CC, R33, c [0x0] [0x148]; LD.E R6, [R6]; IADD.X R15, R34, c [0x0] [0x14c]; LD.E R8, [R10]; LD.E R2, [R14]; STS [R21+0x108], R9; STS [R21+0x18c], R3; STS [R21+0x210], R5; STS [R21+0x294], R6; STS [R21+0x318], R8; STS [R21+0x39c], R2; BAR.SYNC 0xf; cvt.s64.s32 %rl6, %r13; add.s64 %rl7, %rl5, %rl6; shl.b64 %rl8, %rl7, 2; mov.u64 %rl9, __cuda_local_var_42177_35_non_const_block; add.s64 %rl10, %rl9, %rl8; cvta.to.global.u64 %rl11, %rl2; mul.wide.u32 %rl12, %r15, 4; add.s64 %rl13, %rl11, %rl12; ld.global.f32 %f1, [%rl13]; st.shared.f32 [%rl10], %f1; cvt.u64.u32 %rl14, %r1; add.s64 %rl15, %rl14, %rl4; shl.b64 %rl16, %rl15, 2; add.s64 %rl17, %rl11, %rl16; ld.global.f32 %f2, [%rl17]; st.shared.f32 [%rl10+132], %f2; shl.b32 %r21, %r1, 1; cvt.u64.u32 %rl18, %r21; add.s64 %rl19, %rl18, %rl4; shl.b64 %rl20, %rl19, 2; add.s64 %rl21, %rl11, %rl20; ld.global.f32 %f3, [%rl21]; st.shared.f32 [%rl10+264], %f3; mul.lo.s32 %r24, %r1, 3; cvt.u64.u32 %rl22, %r24; add.s64 %rl23, %rl22, %rl4; shl.b64 %rl24, %rl23, 2; add.s64 %rl25, %rl11, %rl24; ld.global.f32 %f4, [%rl25]; st.shared.f32 [%rl10+396], %f4; shl.b32 %r27, %r1, 2; cvt.u64.u32 %rl26, %r27; add.s64 %rl27, %rl26, %rl4; shl.b64 %rl28, %rl27, 2; add.s64 %rl29, %rl11, %rl28; ld.global.f32 %f5, [%rl29]; st.shared.f32 [%rl10+528], %f5; mul.lo.s32 %r30, %r1, 5; cvt.u64.u32 %rl30, %r30; add.s64 %rl31, %rl30, %rl4; shl.b64 %rl32, %rl31, 2; add.s64 %rl33, %rl11, %rl32; ld.global.f32 %f6, [%rl33]; st.shared.f32 [%rl10+660], %f6; mul.lo.s32 %r33, %r1, 6; cvt.u64.u32 %rl34, %r33; add.s64 %rl35, %rl34, %rl4; shl.b64 %rl36, %rl35, 2; add.s64 %rl37, %rl11, %rl36; ld.global.f32 %f7, [%rl37]; st.shared.f32 [%rl10+792], %f7; mul.lo.s32 %r36, %r1, 7; cvt.u64.u32 %rl38, %r36; add.s64 %rl39, %rl38, %rl4; shl.b64 %rl40, %rl39, 2; add.s64 %rl41, %rl11, %rl40; ld.global.f32 %f8, [%rl41]; st.shared.f32 [%rl10+924], %f8; bar.sync 15; __global__ void CUDAkernel2DCT(float *dst, float *src, int ImgStride) { __shared__ float tile[TILE_HEIGHT * STRIDE]; // Preliminary address calculations … float *tile_ptr = tile + <offset>; // Buffer into shared memory #pragma unroll for(unsigned int i = 0; i < TILE_SIZE; i++) tile_ptr[i * STRIDE] = src[i * ImgStride]; __syncthreads(); // Processing data … } Each tile transfer requires many arithmetic ops to calculate addresses Address generation consumes ~50% of tile transfer cycles CUDA 4 Lines PTX 59 Instructions SASS 73 Instructions

Objective • A tool to help achieve better memory performance • Inspired by Direct Memory Access (DMA) CPU ! DRAM ! DMA $

Objective • A tool to help achieve better memory performance • Inspired by Direct Memory Access (DMA) GPU ! Not interruptible! SM ! ! CPU ? $ $ $ $ DRAM DMA $ Heavy bookkeeping!

D2MA: The Big Picture GPU SM $ $ $ $ DRAM D2MA

D2MA: Data-Parallel Direct Memory Access • Take advantage of regular memory accesses & unified L1D/Shared Memory space • Decouple tile transfers from SM resources • Simplify address generation • Improve memory pipelining • Direct path to shared memory SM Register File Fetch Decode Issue D2MA Shared Memory SPs LD/ST Writeback L1D $ MSHR Tile[0]

D2MA Programming Model __global__ void CUDAkernel2DCT(float *dst, float *src, int ImgStride) { __shared__ float tile[T_HEIGHT * T_STRIDE]; int OffsThreadInRow = threadIdx.y * T_SIZE + threadIdx.x; int OffsThreadInCol = threadIdx.z * T_SIZE; src += FMUL(blockIdx.y * T_HEIGHT + OffsThreadInCol, ImgStride) + blockIdx.x * T_WIDTH + OffsThreadInRow; dst += FMUL(blockIdx.y * T_HEIGHT + OffsThreadInCol, ImgStride) + blockIdx.x * T_WIDTH + OffsThreadInRow; float *tile_ptr = tile + OffsThreadInCol * T_STRIDE + OffsThreadInRow; //process rows then columns CUDAsubroutineInplaceDCTvector(tile + (OffsThreadInCol + threadIdx.x) * T_STRIDE + OffsThreadInRow - threadIdx.x, 1); CUDAsubroutineInplaceDCTvector(tile_ptr, T_STRIDE); for(unsigned int i = 0; i < T_SIZE; i++) dst[i * ImgStride] = tile_ptr[i * T_STRIDE]; } __global__ void D2MAkernel2DCT(float *dst, float *src, int ImgStride) { __shared__ float tile[T_HEIGHT * T_STRIDE]; int OffsThreadInRow = threadIdx.y * T_SIZE + threadIdx.x; int OffsThreadInCol = threadIdx.z * T_SIZE; src += FMUL(blockIdx.y * T_HEIGHT, ImgStride) + blockIdx.x * T_WIDTH; dst += FMUL(blockIdx.y * T_HEIGHT + OffsThreadInCol, ImgStride) + blockIdx.x * T_WIDTH + OffsThreadInRow; float *tile_ptr = tile + OffsThreadInCol * T_STRIDE + OffsThreadInRow; //process rows then columns CUDAsubroutineInplaceDCTvector(tile + (OffsThreadInCol + threadIdx.x) * T_STRIDE + OffsThreadInRow - threadIdx.x, 1); CUDAsubroutineInplaceDCTvector(tile_ptr, T_STRIDE); for(unsigned int i = 0; i < T_SIZE; i++) dst[i * ImgStride] = tile_ptr[i * T_STRIDE]; } CUDA: 4 Lines PTX: 59 Instructions CUDA: 4 Lines PTX: 12 Instructions d2ma_configure_matrix(tile, src, T_HEIGHT, T_WIDTH, ImgStride); d2ma_set_datatype_float(); d2ma_enable_shmem_blank_col(); d2ma_ignite_buffer(0); #pragma unroll for(unsigned int i = 0; i < T_SIZE; i++) tile_ptr[i * T_STRIDE] = src[i * ImgStride]; __syncthreads(); D2MA-Optimized Code Original Code

D2MA Overview D2MA Engine Controller SM Register File Fetch Glob. Addr Shr. Addr # Elements Elem. Size Decode Stride Buf. 0 Issue Buf. 1 D2MA Buf. 2 Buf. 3 Shared Memory SPs LD/ST AGEN Logic Consistency Checker Writeback L1D $ MSHR

D2MA Operation: Configuration D2MA Engine Controller SM Register File Fetch 0110110 0110101 Glob. Addr Shr. Addr # Elements Elem. Size Decode Stride Config Config Buf. 0 Issue 1 4 64 0x20 0x1020 Buf. 1 D2MA Buf. 2 Buf. 3 Shared Memory SPs LD/ST AGEN Logic Consistency Checker Writeback L1D $ MSHR d2ma_configure_matrix(tile, src, T_HEIGHT, T_WIDTH, ImgStride); d2ma_set_datatype_float(); d2ma_enable_shmem_blank_col(); d2ma_ignite_buffer(0);

D2MA Operation: Addr. Generation D2MA Engine Controller SM Register File Fetch 0111000 Glob. Addr Shr. Addr # Elements Elem. Size Decode Stride Ignite #0 Buf. 0 Issue 1 4 64 0x20 1 4 0x1020 64 0x20 0x1020 Buf. 1 D2MA Buf. 2 Buf. 3 Shared Memory SPs LD/ST AGEN Logic Consistency Checker AGEN Logic Global Mem. AGEN 0x1020 Control Shared Mem. AGEN 0x20 Writeback L1D $ MSHR d2ma_configure_matrix(tile, src, T_HEIGHT, T_WIDTH, ImgStride); d2ma_set_datatype_float(); d2ma_enable_shmem_blank_col(); d2ma_ignite_buffer(0);

D2MA Operation: Memory Transfer D2MA Engine Controller SM Register File Fetch Glob. Addr Shr. Addr # Elements Elem. Size Decode Stride Buf. 0 Issue 1 4 64 0x20 0x1020 Buf. 1 D2MA Buf. 2 Buf. 3 Shared Memory SPs LD/ST AGEN Logic Consistency Checker AGEN Logic Global Mem. AGEN 0x1020 0x10A0 Control Shared Mem. AGEN 0x20 0xA0 Writeback L1D $ MSHR Glob. Addr Shr. Addr … 0x2000 0xFF … 0xFFFF 0xFF … 0x1020 0x20 … 0xFFFF 0xFF … 0x10A0 0xA0 …

D2MA Operation: Memory Transfer D2MA Engine Controller SM Register File Fetch Glob. Addr Shr. Addr # Elements Elem. Size Decode Stride Buf. 0 Issue 1 4 64 0x20 0x1020 Buf. 1 D2MA Buf. 2 Buf. 3 Shared Memory SPs LD/ST AGEN Logic Consistency Checker AGEN Logic Global Mem. AGEN Control Shared Mem. AGEN Writeback L1D $ MSHR &0x20 &0xA0 &0x1020 &0x10A0 Glob. Addr Shr. Addr … 0x2000 0xFF … 0x1020 0x20 … 0x1020 0x20 … 0x10A0 0xA0 … 0x10A0 0xA0 …

D2MA Operation: Enforcing Synchronization Thread Block 2 Thread Block 1 Thread Block 2 Thread Block 1 No syncthreads()! Start TX 1 Start TX 1, Thread barrier syncthreads() Independent code executes Start TX 2, Thread barrier Load from buffer Start TX 2 No warp ready to schedule End TX 1 Barrier satisfied, End TX 1 Load from buffer Code independent of buffer Re-exec load Load from buffer Synchronization handled transparently by H/W Programmer must guarantee consistency Without D2MA With D2MA

Experimental Evaluation • GPGPU-Sim v3.2.1 • Benchmarks from NVIDIA CUDA SDK, Rodinia • Must perform shared memory buffering

Results: Performance Geomean speedup: 1.36x

Results: Cycle Breakdown Baseline D2MA Addr. Gen: improved by 98% Mem. TX: reduced by 66% Avg TX cycles: ~5x reduction

Results: Overheads • Model of D2MA Engine synthesized using Synopsys • Compared to NVIDIA GTX 480 • Die area: 529 mm2 • TDP: 250 W • One D2MA Engine per SM (15 SMs): • Area overhead: 0.016% • Power overhead: 0.022%

Conclusion • Programmer must optimize memory traffic to achieve good performance on GPUs • Shared memory buffering improves b/w utilization • Buffering still has overheads • D2MA decouples tiled data buffering from existing SM resources • Reduces costs of address generation by 98% • Improves memory transfer times by 66% • Performance improves by 1.36x • Dynamic instructions executed reduced by 7% • Enforces synchronization transparently • Low area and power overheads (<0.03%)

Thank You! • Questions? Image credits: http://www.opengraphicdesign.com/web/ajax-loading-graphics/ http://www.barriersandbollards.com/html-pages/mb50-1.png

D2MA: Accelerating Coarse-Grained Data Transfer for GPUs D. Anoushe Jamshidi, Mehrzad Samadi, and Scott Mahlke University of Michigan PACT-23 August 27th, 2014

Special Addressing Modes Blank Column Mode Halo Addressing Mode

Results: Dynamic Instruction Count Reduction

D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs