1 / 29

D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs. D. Anoushe Jamshidi , Mehrzad Samadi, and Scott Mahlke University of Michigan PACT-23 August 27 th , 2014. Achieving Peak GPU Performance: Theory and Practice. Matrix Multiplication. Not easy to fully utilize GPU capabilities!.

Download Presentation

D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. D2MA: Accelerating Coarse-Grained Data Transfer for GPUs D. Anoushe Jamshidi, Mehrzad Samadi, and Scott Mahlke University of Michigan PACT-23 August 27th, 2014

  2. Achieving Peak GPU Performance: Theory and Practice Matrix Multiplication Not easy to fully utilize GPU capabilities! Peak CUBLAS SDK

  3. A Quick Overview of GPUs Chip SMs L2 $ Interconnect DRAM Register File Fetch Decode Result Issue Shared Memory SPs DRAM LD/ST Result Result Data Data Data Writeback Result L1D $ DRAM …

  4. A Quick Overview of GPUs Chip SMs L2 $ Interconnect DRAM Register File Fetch Decode Issue ~100’s of cycles Shared Memory SPs DRAM LD/ST Writeback L1D $ DRAM …

  5. How do GPUs Achieve Great Performance? • Effectively use available memory bandwidth • Exploit data reuse when possible Cache Line SP SP SP SP Store Store Store Store

  6. How do GPUs Achieve Great Performance? • Effectively use available memory bandwidth • Exploit data reuse when possible • Regular, well coalesced memory accesses Cache Line Cache Line SP SP SP SP Store

  7. Buffering to Optimize Bandwidth Chip SMs L2 $ Interconnect DRAM Register File Fetch Decode Issue ~100’s of cycles Tile[0] Tile[1] Shared Memory SPs Tile[2] DRAM LD/ST <10 cycles Writeback L1D $ DRAM Buffer data in fast Shared Memory …

  8. Buffering Problem 1: Wasted Storage Chip SMs L2 $ Interconnect DRAM Register File Tile[1] Tile[0] Tile[2] Fetch Decode Issue Tile[0] Tile[1] Shared Memory SPs Tile[2] DRAM LD/ST Tile[0] Writeback Tile[0] L1D $ Tile[1] Roundabout path to Shared Memory Tile[2] DRAM Tile[0] Tile[1] Duplicated data in Shared Mem, Caches, Reg. File Tile[2] …

  9. Buffering Problem 2: Code Expansion IADD R4.CC, R7, c [0x0] [0x150]; SHL.W R18, R7, 0x2; IMUL.U32.U32.HI R20, R7, 0x4; MOV R12, c [0x0] [0x150]; IADD.X R5, RZ, RZ; IADD R2.CC, R18, c [0x0] [0x148]; IADD.X R3, R20, c [0x0] [0x14c]; IADD R0, R0, R19; IMAD.U32.U32 R6.CC, R12, 0x2, R7; LD.E R14, [R2]; IADD.X R8, RZ, RZ; IMAD R10.CC, R12, 0x3, R7; SHL R21, R0, 0x2; IADD.X R9, RZ, RZ; IMAD.U32.U32 R11.CC, R12, 0x4, R7; STS [R21], R14; SHR.U32 R0, R4, 0x1e; SHL R22, R4, 0x2; IADD.X R4, RZ, RZ; IMAD R27.CC, R12, 0x5, R7; SHR.U32 R13, R6, 0x1e; SHL R24, R6, 0x2; IADD.X R6, RZ, RZ; ISCADD R23, R5, R0, 0x2; IMAD R0.CC, R12, 0x6, R7; IADD.X R5, RZ, RZ; IMAD R33.CC, R12, 0x7, R7; SHR.U32 R15, R10, 0x1e; SHL R26, R10, 0x2; SHR.U32 R10, R11, 0x1e; SHL R28, R11, 0x2; IADD.X R11, RZ, RZ; IADD R12.CC, R22, c [0x0] [0x148]; ISCADD R25, R8, R13, 0x2; IADD.X R13, R23, c [0x0] [0x14c]; IADD R8.CC, R24, c [0x0] [0x148]; SHR.U32 R7, R27, 0x1e; LD.E R13, [R12]; SHL R30, R27, 0x2; STS [R21+0x84], R13; ISCADD R27, R9, R15, 0x2; IADD.X R9, R25, c [0x0] [0x14c]; IADD R2.CC, R26, c [0x0] [0x148]; ISCADD R29, R4, R10, 0x2; IADD.X R3, R27, c [0x0] [0x14c]; IADD R4.CC, R28, c [0x0] [0x148]; SHR.U32 R10, R0, 0x1e; ISCADD R31, R6, R7, 0x2; ISCADD R32, R5, R10, 0x2; LD.E R9, [R8]; IADD.X R5, R29, c [0x0] [0x14c]; IADD R6.CC, R30, c [0x0] [0x148]; SHL R0, R0, 0x2; IADD.X R7, R31, c [0x0] [0x14c]; SHR.U32 R34, R33, 0x1e; IADD R10.CC, R0, c [0x0] [0x148]; SHL R33, R33, 0x2; LD.E R3, [R2]; ISCADD R34, R11, R34, 0x2; LD.E R5, [R4]; IADD.X R11, R32, c [0x0] [0x14c]; IADD R14.CC, R33, c [0x0] [0x148]; LD.E R6, [R6]; IADD.X R15, R34, c [0x0] [0x14c]; LD.E R8, [R10]; LD.E R2, [R14]; STS [R21+0x108], R9; STS [R21+0x18c], R3; STS [R21+0x210], R5; STS [R21+0x294], R6; STS [R21+0x318], R8; STS [R21+0x39c], R2; BAR.SYNC 0xf; cvt.s64.s32 %rl6, %r13; add.s64 %rl7, %rl5, %rl6; shl.b64 %rl8, %rl7, 2; mov.u64 %rl9, __cuda_local_var_42177_35_non_const_block; add.s64 %rl10, %rl9, %rl8; cvta.to.global.u64 %rl11, %rl2; mul.wide.u32 %rl12, %r15, 4; add.s64 %rl13, %rl11, %rl12; ld.global.f32 %f1, [%rl13]; st.shared.f32 [%rl10], %f1; cvt.u64.u32 %rl14, %r1; add.s64 %rl15, %rl14, %rl4; shl.b64 %rl16, %rl15, 2; add.s64 %rl17, %rl11, %rl16; ld.global.f32 %f2, [%rl17]; st.shared.f32 [%rl10+132], %f2; shl.b32 %r21, %r1, 1; cvt.u64.u32 %rl18, %r21; add.s64 %rl19, %rl18, %rl4; shl.b64 %rl20, %rl19, 2; add.s64 %rl21, %rl11, %rl20; ld.global.f32 %f3, [%rl21]; st.shared.f32 [%rl10+264], %f3; mul.lo.s32 %r24, %r1, 3; cvt.u64.u32 %rl22, %r24; add.s64 %rl23, %rl22, %rl4; shl.b64 %rl24, %rl23, 2; add.s64 %rl25, %rl11, %rl24; ld.global.f32 %f4, [%rl25]; st.shared.f32 [%rl10+396], %f4; shl.b32 %r27, %r1, 2; cvt.u64.u32 %rl26, %r27; add.s64 %rl27, %rl26, %rl4; shl.b64 %rl28, %rl27, 2; add.s64 %rl29, %rl11, %rl28; ld.global.f32 %f5, [%rl29]; st.shared.f32 [%rl10+528], %f5; mul.lo.s32 %r30, %r1, 5; cvt.u64.u32 %rl30, %r30; add.s64 %rl31, %rl30, %rl4; shl.b64 %rl32, %rl31, 2; add.s64 %rl33, %rl11, %rl32; ld.global.f32 %f6, [%rl33]; st.shared.f32 [%rl10+660], %f6; mul.lo.s32 %r33, %r1, 6; cvt.u64.u32 %rl34, %r33; add.s64 %rl35, %rl34, %rl4; shl.b64 %rl36, %rl35, 2; add.s64 %rl37, %rl11, %rl36; ld.global.f32 %f7, [%rl37]; st.shared.f32 [%rl10+792], %f7; mul.lo.s32 %r36, %r1, 7; cvt.u64.u32 %rl38, %r36; add.s64 %rl39, %rl38, %rl4; shl.b64 %rl40, %rl39, 2; add.s64 %rl41, %rl11, %rl40; ld.global.f32 %f8, [%rl41]; st.shared.f32 [%rl10+924], %f8; bar.sync 15; __global__ void CUDAkernel2DCT(float *dst, float *src, int ImgStride) { __shared__ float tile[TILE_HEIGHT * STRIDE]; // Preliminary address calculations … float *tile_ptr = tile + <offset>; // Buffer into shared memory #pragma unroll for(unsigned int i = 0; i < TILE_SIZE; i++) tile_ptr[i * STRIDE] = src[i * ImgStride]; __syncthreads(); // Processing data … } Each tile transfer requires many arithmetic ops to calculate addresses Address generation consumes ~50% of tile transfer cycles CUDA 4 Lines PTX 59 Instructions SASS 73 Instructions

  10. Objective • A tool to help achieve better memory performance • Inspired by Direct Memory Access (DMA) CPU ! DRAM ! DMA $

  11. Objective • A tool to help achieve better memory performance • Inspired by Direct Memory Access (DMA) GPU ! Not interruptible! SM ! ! CPU ? $ $ $ $ DRAM DMA $ Heavy bookkeeping!

  12. D2MA: The Big Picture GPU SM $ $ $ $ DRAM D2MA

  13. D2MA: Data-Parallel Direct Memory Access • Take advantage of regular memory accesses & unified L1D/Shared Memory space • Decouple tile transfers from SM resources • Simplify address generation • Improve memory pipelining • Direct path to shared memory SM Register File Fetch Decode Issue D2MA Shared Memory SPs LD/ST Writeback L1D $ MSHR Tile[0]

  14. D2MA Programming Model __global__ void CUDAkernel2DCT(float *dst, float *src, int ImgStride) { __shared__ float tile[T_HEIGHT * T_STRIDE]; int OffsThreadInRow = threadIdx.y * T_SIZE + threadIdx.x; int OffsThreadInCol = threadIdx.z * T_SIZE; src += FMUL(blockIdx.y * T_HEIGHT + OffsThreadInCol, ImgStride) + blockIdx.x * T_WIDTH + OffsThreadInRow; dst += FMUL(blockIdx.y * T_HEIGHT + OffsThreadInCol, ImgStride) + blockIdx.x * T_WIDTH + OffsThreadInRow; float *tile_ptr = tile + OffsThreadInCol * T_STRIDE + OffsThreadInRow; //process rows then columns CUDAsubroutineInplaceDCTvector(tile + (OffsThreadInCol + threadIdx.x) * T_STRIDE + OffsThreadInRow - threadIdx.x, 1); CUDAsubroutineInplaceDCTvector(tile_ptr, T_STRIDE); for(unsigned int i = 0; i < T_SIZE; i++) dst[i * ImgStride] = tile_ptr[i * T_STRIDE]; } __global__ void D2MAkernel2DCT(float *dst, float *src, int ImgStride) { __shared__ float tile[T_HEIGHT * T_STRIDE]; int OffsThreadInRow = threadIdx.y * T_SIZE + threadIdx.x; int OffsThreadInCol = threadIdx.z * T_SIZE; src += FMUL(blockIdx.y * T_HEIGHT, ImgStride) + blockIdx.x * T_WIDTH; dst += FMUL(blockIdx.y * T_HEIGHT + OffsThreadInCol, ImgStride) + blockIdx.x * T_WIDTH + OffsThreadInRow; float *tile_ptr = tile + OffsThreadInCol * T_STRIDE + OffsThreadInRow; //process rows then columns CUDAsubroutineInplaceDCTvector(tile + (OffsThreadInCol + threadIdx.x) * T_STRIDE + OffsThreadInRow - threadIdx.x, 1); CUDAsubroutineInplaceDCTvector(tile_ptr, T_STRIDE); for(unsigned int i = 0; i < T_SIZE; i++) dst[i * ImgStride] = tile_ptr[i * T_STRIDE]; } CUDA: 4 Lines PTX: 59 Instructions CUDA: 4 Lines PTX: 12 Instructions d2ma_configure_matrix(tile, src, T_HEIGHT, T_WIDTH, ImgStride); d2ma_set_datatype_float(); d2ma_enable_shmem_blank_col(); d2ma_ignite_buffer(0); #pragma unroll for(unsigned int i = 0; i < T_SIZE; i++) tile_ptr[i * T_STRIDE] = src[i * ImgStride]; __syncthreads(); D2MA-Optimized Code Original Code

  15. D2MA Overview D2MA Engine Controller SM Register File Fetch Glob. Addr Shr. Addr # Elements Elem. Size Decode Stride Buf. 0 Issue Buf. 1 D2MA Buf. 2 Buf. 3 Shared Memory SPs LD/ST AGEN Logic Consistency Checker Writeback L1D $ MSHR

  16. D2MA Operation: Configuration D2MA Engine Controller SM Register File Fetch 0110110 0110101 Glob. Addr Shr. Addr # Elements Elem. Size Decode Stride Config Config Buf. 0 Issue 1 4 64 0x20 0x1020 Buf. 1 D2MA Buf. 2 Buf. 3 Shared Memory SPs LD/ST AGEN Logic Consistency Checker Writeback L1D $ MSHR d2ma_configure_matrix(tile, src, T_HEIGHT, T_WIDTH, ImgStride); d2ma_set_datatype_float(); d2ma_enable_shmem_blank_col(); d2ma_ignite_buffer(0);

  17. D2MA Operation: Addr. Generation D2MA Engine Controller SM Register File Fetch 0111000 Glob. Addr Shr. Addr # Elements Elem. Size Decode Stride Ignite #0 Buf. 0 Issue 1 4 64 0x20 1 4 0x1020 64 0x20 0x1020 Buf. 1 D2MA Buf. 2 Buf. 3 Shared Memory SPs LD/ST AGEN Logic Consistency Checker AGEN Logic Global Mem. AGEN 0x1020 Control Shared Mem. AGEN 0x20 Writeback L1D $ MSHR d2ma_configure_matrix(tile, src, T_HEIGHT, T_WIDTH, ImgStride); d2ma_set_datatype_float(); d2ma_enable_shmem_blank_col(); d2ma_ignite_buffer(0);

  18. D2MA Operation: Memory Transfer D2MA Engine Controller SM Register File Fetch Glob. Addr Shr. Addr # Elements Elem. Size Decode Stride Buf. 0 Issue 1 4 64 0x20 0x1020 Buf. 1 D2MA Buf. 2 Buf. 3 Shared Memory SPs LD/ST AGEN Logic Consistency Checker AGEN Logic Global Mem. AGEN 0x1020 0x10A0 Control Shared Mem. AGEN 0x20 0xA0 Writeback L1D $ MSHR Glob. Addr Shr. Addr … 0x2000 0xFF … 0xFFFF 0xFF … 0x1020 0x20 … 0xFFFF 0xFF … 0x10A0 0xA0 …

  19. D2MA Operation: Memory Transfer D2MA Engine Controller SM Register File Fetch Glob. Addr Shr. Addr # Elements Elem. Size Decode Stride Buf. 0 Issue 1 4 64 0x20 0x1020 Buf. 1 D2MA Buf. 2 Buf. 3 Shared Memory SPs LD/ST AGEN Logic Consistency Checker AGEN Logic Global Mem. AGEN Control Shared Mem. AGEN Writeback L1D $ MSHR &0x20 &0xA0 &0x1020 &0x10A0 Glob. Addr Shr. Addr … 0x2000 0xFF … 0x1020 0x20 … 0x1020 0x20 … 0x10A0 0xA0 … 0x10A0 0xA0 …

  20. D2MA Operation: Enforcing Synchronization Thread Block 2 Thread Block 1 Thread Block 2 Thread Block 1 No syncthreads()! Start TX 1 Start TX 1, Thread barrier syncthreads() Independent code executes Start TX 2, Thread barrier Load from buffer Start TX 2 No warp ready to schedule End TX 1 Barrier satisfied, End TX 1 Load from buffer Code independent of buffer Re-exec load Load from buffer Synchronization handled transparently by H/W Programmer must guarantee consistency Without D2MA With D2MA

  21. Experimental Evaluation • GPGPU-Sim v3.2.1 • Benchmarks from NVIDIA CUDA SDK, Rodinia • Must perform shared memory buffering

  22. Results: Performance Geomean speedup: 1.36x

  23. Results: Cycle Breakdown Baseline D2MA Addr. Gen: improved by 98% Mem. TX: reduced by 66% Avg TX cycles: ~5x reduction

  24. Results: Overheads • Model of D2MA Engine synthesized using Synopsys • Compared to NVIDIA GTX 480 • Die area: 529 mm2 • TDP: 250 W • One D2MA Engine per SM (15 SMs): • Area overhead: 0.016% • Power overhead: 0.022%

  25. Conclusion • Programmer must optimize memory traffic to achieve good performance on GPUs • Shared memory buffering improves b/w utilization • Buffering still has overheads • D2MA decouples tiled data buffering from existing SM resources • Reduces costs of address generation by 98% • Improves memory transfer times by 66% • Performance improves by 1.36x • Dynamic instructions executed reduced by 7% • Enforces synchronization transparently • Low area and power overheads (<0.03%)

  26. Thank You! • Questions? Image credits: http://www.opengraphicdesign.com/web/ajax-loading-graphics/ http://www.barriersandbollards.com/html-pages/mb50-1.png

  27. D2MA: Accelerating Coarse-Grained Data Transfer for GPUs D. Anoushe Jamshidi, Mehrzad Samadi, and Scott Mahlke University of Michigan PACT-23 August 27th, 2014

  28. Special Addressing Modes Blank Column Mode Halo Addressing Mode

  29. Results: Dynamic Instruction Count Reduction

More Related