Fast matrix multiplication with cuda
This presentation is the property of its rightful owner.
Sponsored Links
1 / 19

Fast matrix multiplication with CUDA PowerPoint PPT Presentation


  • 107 Views
  • Uploaded on
  • Presentation posted in: General

Fast matrix multiplication with CUDA. Overview. Platform GEFORCE 8800GT, 512MB Core: G92, Shader frequency: 1.5 GHz, Mem frequency: 900 MHz Performance Tuned for 4k x 4k matrix, 192 GFlops Revisiting the Tiled version Using large tiles Base algorithm Optimized algorithm Tools and tips.

Download Presentation

Fast matrix multiplication with CUDA

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Fast matrix multiplication with cuda

Fast matrix multiplication with CUDA


Overview

Overview

  • Platform

    • GEFORCE 8800GT, 512MB

    • Core: G92, Shader frequency: 1.5 GHz, Mem frequency: 900 MHz

  • Performance

    • Tuned for 4k x 4k matrix, 192 GFlops

  • Revisiting the Tiled version

  • Using large tiles

    • Base algorithm

    • Optimized algorithm

  • Tools and tips

University of Central Florida


The tiled version

The Tiled version

  • Tile Size: 16 x 16

  • 256 threads / block

  • 14 regs, 2076 smem / block

  • Occupancy: 2/3

T0 T1 … T14 T15

T16 T17 … T30 T31

T239 … T255

University of Central Florida


The tiled version memory access

The Tiled version – Memory access

  • Every half warp is accessing continuous memory locations.

  • Memory accesses are fully coalesced.

T0 T1 … T14 T15

T16 T17 … T30 T31

T239 … T255

University of Central Florida


The tiled version bank conflicts

The Tiled version – Bank conflicts

  • No bank conflicts.

16 banks

broadcast

T0 T1 … T14 T15

T16 T17 … T30 T31

T239 … T255

University of Central Florida


The tiled version bottlenecks

The Tiled version - Bottlenecks

  • If fully use memory bandwidth and ALUs:

    • 14.4G float/s, 168G MAD/s

    • 11.67 MAD/float

  • With 16 x 16 tiles:

    • Total (W3/8) loads, 8 MAD/float

    • Too many loads!

  • Solution: large tile.

  • “Psub += As[ty][k] * Bs[k][tx]”

    • Extra instructions.

  • 77 GFlops (4k x 4k)

mov.b32 $r12, s[$ofs4+0x0000]

mov.b32 $r7, s[$ofs4+0x0040]

mad.rn.f32 $r11, s[$ofs1+0x000c], $r11, $r13

add.b32 $ofs4, $ofs3, 0x0000019c

mad.rn.f32 $r13, s[$ofs1+0x0010], $r12, $r11

mov.b32 $r12, s[$ofs4+0x0000]

mov.b32 $r11, s[$ofs4+0x0040]

mad.rn.f32 $r7, s[$ofs1+0x0014], $r7, $r13

add.b32 $ofs4, $ofs3, 0x0000021c

mad.rn.f32 $r13, s[$ofs1+0x0018], $r12, $r7

Reused THeight times.

TWidth

Reused TWidth times.

THeight

University of Central Florida


Using large tiles

Using Large Tiles

  • Each thread:

    • 17 loads / iteration

    • W/16 iterations

    • Total (W3/15) loads, 15 MAD/load

16

256

Stored in registers.

256 threads

16

256

16

16

Stored in shared memory.

16 Psubs/thread

University of Central Florida


Using large tiles algorithm

Using Large Tiles - Algorithm

  • For each sub tile in A & B

    • Read the sub tile from A to shared memory. 1 number / thread.

    • For each of the 16 numbers in B:

      • Read one number from B into a register.

      • Perform one MAD for each Psub.

  • To remove extra instructions for offset calculation, we want the sub tile A to be stored in column-major format in the shared memory.

    • But …

B

A

C

T0 T1 T2 … T255

University of Central Florida


Using large tiles algorithm1

Using Large Tiles - Algorithm

  • Solution1:

    • Transpose A to column-major format first.

A

Shared A

C

T0

B0

B15

T15

T0 T1 T2 … T255

  • Solution2:

    • Read A in row-major format, write to the shared memory in column-major format.

    • Bank conflicts when write to the shared memory!

A

Shared A

C

B0

T0

T15

B15

T0 T1 T2 … T255

University of Central Florida


Using large tiles algorithm2

Using Large Tiles - Algorithm

  • Solution3:

    • Padding Shared A with one empty row.

    • No bank conflicts. Do not need to transpose A.

    • 164 GFlops (4k x 4k).

Shared A

A

C

B0

B1

B15

T0

T15

B1

B2

B0

B15

B0

B1

T0 T1 T2 … T255

University of Central Florida


Using large tiles code

Using Large Tiles - code

for (int i = 0; i < MATRIX_WIDTH/16; i++)

{

ashare[tx][ty] = A[0];

__syncthreads();

#pragma unroll // 150 GFlops (4k x 4k) without unroll

for (int k = 0; k < 16; k++)

{

b = B[k * MATRIX_WIDTH];

comp16(b, &ashare[k][0], c);

}

A += 16;

B += 16 * MATRIX_WIDTH;

__syncthreads();

};

University of Central Florida


Using large tiles optimized

Using Large Tiles - optimized

do

{

ashare[tx][ty] = a;

__syncthreads();

a = A[0];

bb[0] = b[0];

bb[1] = b[1];

bb[2] = b[2];

bb[3] = b[3];

b[0] = B[4 * MATRIX_WIDTH];

b[1] = B[5 * MATRIX_WIDTH];

b[2] = B[6 * MATRIX_WIDTH];

b[3] = B[7 * MATRIX_WIDTH];

for (int i = 0; i < 4; i ++)

comp16(bb[i], &ashare[i][0], c);

bb[0] = b[0];

bb[1] = b[1];

bb[2] = b[2];

bb[3] = b[3];

b[0] = B[12 * MATRIX_WIDTH];

b[1] = B[13 * MATRIX_WIDTH];

b[2] = B[14 * MATRIX_WIDTH];

b[3] = B[15 * MATRIX_WIDTH];

for (int i = 0; i < 4; i ++)

comp16(bb[i], &ashare[i + 8][0], c);

bb[0] = b[0];

bb[1] = b[1];

bb[2] = b[2];

bb[3] = b[3];

A += 16;

B += 16 * MATRIX_WIDTH;

b[0] = B[0 * MATRIX_WIDTH];

b[1] = B[1 * MATRIX_WIDTH];

b[2] = B[2 * MATRIX_WIDTH];

b[3] = B[3 * MATRIX_WIDTH];

for (int i = 0; i < 4; i ++)

comp16(bb[i], &ashare[i + 12][0], c);

__syncthreads();

} while( A < Alast );

... // last iteration

University of Central Florida


Using large tiles performance

Using Large Tiles - Performance

Execution time is measured as the computation time on GPU.

University of Central Florida


Using large tiles performance 2

Using Large Tiles – Performance 2

Gflops (comp): excluding CPU GPU data transfer time.

Gflops (total): including CPU GPU data transfer time.

University of Central Florida


Tools cuda gpu occupancy calculator

Tools - CUDA GPU Occupancy Calculator

University of Central Florida


Tools decuda

Tools - decuda

  • Developed by Wladimir J. van der Laan

    • a PhD candidate at the Institute of Mathematics and Computing Science of the University of Groningen.

  • http://www.cs.rug.nl/~wladimir/decuda/

University of Central Florida


Tools cuda visual profiler

Tools – CUDA Visual Profiler

  • http://forums.nvidia.com/index.php?showtopic=57443

    • GPU TimeCPU TimeOccupancy

    • Profiler counters:gld_incoherent : Number of non-coalesced global memory loadsgld_coherent : Number of coalesced global memory loadsgst_incoherent : Number of non-coalesced global memory storesgst_coherent : Number of coalesced global memory storeslocal_load : Number of local memory loadslocal_store : Number of local memory storesbranch : Number of branch events (instruction and/or sync stack)divergent_branch : Number of divergent branches within a warp instructions : Number of dynamic instructions (in fetch)warp_serialize : Number of threads in a warp serialize basedon address (GRF or constant)cta_launched : Number of CTAs launched on the PM TPC

University of Central Florida


Fast matrix multiplication with cuda

Tips

  • Get usage of reg, smem, cmem, and lmem:

    • nvcc -m32 -o data/matrix_kernel.cubin -cubin matrix_kernel.cu --compiler-options -fno-strict-aliasing -I. -I/usr/local/cuda/include -I../../common/inc -DUNIX -O3 --ptxas-options=-v

  • Compile with –maxrregcount

University of Central Florida


References

References

  • NVIDIA CUDA Samples:

    • http://www.nvidia.com/object/cuda_sample_linear_algebra.html

    • Simple CUBLAS

    • Matrix Multiplication

    • Matrix Transpose

  • NVIDIA Forum:

    • http://forums.nvidia.com/index.php?showtopic=47689&st=0

University of Central Florida


  • Login