Memory Optimizations for Graphics Processing Units

Memory Optimizations for Graphics Processing Units The material in these slides has been taken from the NVIDIA manuals (Best Practices Guide & Optimizing Matrix Transpose in CUDA), and from a paper by Ryoo et al [Ryoo12]. See "A bit of History" in the last slide

What are Graphics Processing Units

The Wheel of Reincarnation A scandalously brief history of GPUs • In the good old days, the graphics hardware was just the VGA. All the processing was in software. • People started complaining: software is slow… • But, what do you want to run at the hardware level? Do you know how the frame buffer works? Can you program the VGA standard in any way?

The Wheel of Reincarnation A scandalously brief history of GPUs • Some functions, like the rasterizer, are heavily used. What is rasterization? • Better to implement these functions in hardware. How can we implement a function at the hardware level? What is the advantage of implementing a function at the hardware level? Is there any drawback? Can we program (in any way) this hardware used to implement a specific function?

Graphics Pipeline • Graphics can be processed in a pipeline. • Transform, project, clip, display, etc… • Some functions, although different, can be implemented by very similar hardware. • Add a graphics API to program the shaders. • But this API is so specific… and the hardware is so powerful… what a waste! Shading is an example. Do you know what is a shader? A scandalously brief history of GPUs

General Purpose Hardware A scandalously brief history of GPUs • Let’s add a instruction set to the shader. • Let’s augment this hardware with general purpose integer operations. • What about adding some branching machinery too? • Hum… add a high level language on top of this stuff. • Plus a lot of documentation. Advertise it! It should look cool! • Oh boy: we have now two general purpose processors. • We should unify them. The rant starts all over again…

1.5 turns around the wheel A scandalously brief history of GPUs • Lets add a display processor to the display processor • After all, there are some operations that are really specific, and performance critical… Dedicated rasterizer

Brief Timeline A scandalously brief history of GPUs

Computer Organization An outrageously concise overview of the programming mode. • GPUs show different types of parallelism • Single Instruction Multiple Data (SIMD) • Single Program Multiple Data (SPMD) • In the end, we have a MSIMD hardware. Why are GPUs so parallel? Why traditional CPUs do not show off all this parallelism? We can think on a SIMD hardware as a firing squad: we have a captain, and a row of soldiers. The captain issues orders, such as set, aim, fire! And all the soldiers, upon hearing one of these orders, performs an action. They all do the same action, yet, they use different guns and bullets.

The Programming Environment An outrageously concise overview of the programming mode. • There are two main programming languages used to program graphics processing units today: OpenCL and C for CUDA • These are not the first languages developed for GPUs. They came after Cg or HLSL, for instance. • But they are much more general and expressive. • We will focus on C for CUDA. • This language lets the programmer explicitly write code that will run in the CPU, and code that will run in the GPU. • It is a heterogeneous programming language.

From C to CUDA in one Step An outrageously concise overview of the programming mode. • This program, written in C, performs a typical vector operation, reading two arrays, and writing on a third array. • We will translate this program to C for CUDA. void saxpy_serial(intn, float alpha, float *x, float *y) { for (inti = 0; i < n; i++) y[i] = alpha*x[i] + y[i]; } // Invoke the serial function: saxpy_serial(n, 2.0, x, y); What is the asymptotic complexity of this program? How much can we parallelize this program? In a world with many – really many – processors, e.g., the PRAM world, what would be the complexity of this program?

The first Cuda program An outrageously concise overview of the programming mode. void saxpy_serial(intn, float alpha, float *x, float *y) { for (inti = 0; i < n; i++) y[i] = alpha*x[i] + y[i]; } // Invoke the serial function: saxpy_seral(n, 2.0, x, y); What happened to the loop in the CUDA program? __global__ void saxpy_parallel(intn, float alpha, float *x, float *y) { inti = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = alpha * x[i] + y[i]; } // Invoke the parallel kernel: intnblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);

Understanding the Code An outrageously concise overview of the programming mode. • Threads are grouped in warps, blocks and grids • Threads in different grids do not talk to each other • Grids are divided in blocks • Threads in the same block share memory and barriers • Blocks are divided in warps • Threads in the same warp follow the SIMD model. __global__ void saxpy_parallel(intn, float alpha, float *x, float *y) { inti = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = alpha * x[i] + y[i]; } // Invoke the parallel kernel: intnblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);

Raising the level An outrageously concise overview of the programming mode. • Cuda programs contain CPU programs plus kernels • Kernels are called via a special syntax: • The C part of the program is compiled as traditional C. • The kernel part is first translated into PTX, and then this high level assembly is translated into SASS. __global__ void matMul1(float* B, float* C, float* A, intw) { float Pvalue = 0.0; for (intk = 0; k < w; ++k) { Pvalue += B[threadIdx.y * w + k] * C[k * w + threadIdx.x]; } A[threadIdx.x + threadIdx.y * w] = Pvalue; } void Mul(const float* A, const float* B, int width, float* C) { int size = width * width * sizeof(float); // Load A and B to the device float* Ad; cudaMalloc((void**)&Ad, size); cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice); float* Bd; cudaMalloc((void**)&Bd, size); cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice); // Allocate C on the device float* Cd; cudaMalloc((void**)&Cd, size); // Compute the execution configuration assuming // the matrix dimensions are multiples of BLOCK_SIZE dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); dim3 dimGrid(wB / dimBlock.x, hA / dimBlock.y); // Launch the device computation Muld<<<dimGrid, dimBlock>>>(Ad, Bd, width, Cd); // Read C from the device cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost); // Free device memory cudaFree(Ad); cudaFree(Bd); cudaFree(Cd); } kernel<<<dGrd, dBck>>>(A,B,w,C);

Lowering the level An outrageously concise overview of the programming mode. .entry saxpy_GPU (n, a, x, y) { .reg .u16 %rh<4>; .reg .u32 %r<6>; .reg .u64 %rd<8>; .reg .f32 %f<6>; .reg .pred %p<3>; $LBB1__Z9saxpy_GPUifPfS_: mov.u16 %rh1, %ctaid.x; mov.u16 %rh2, %ntid.x; mul.wide.u16 %r1, %rh1, %rh2; cvt.u32.u16 %r2, %tid.x; add.u32 %r3, %r2, %r1; ld.param.s32 %r4, [n]; setp.le.s32 %p1, %r4, %r3; @%p1 bra $Lt_0_770; .loc 28 13 0 cvt.u64.s32 %rd1, %r3; mul.lo.u64 %rd2, %rd1, 4; ld.param.u64 %rd3, [y]; add.u64 %rd4, %rd3, %rd2; ld.global.f32 %f1, [%rd4+0]; ld.param.u64 %rd5, [x]; add.u64 %rd6, %rd5, %rd2; ld.global.f32 %f2, [%rd6+0]; ld.param.f32 %f3, [alpha]; mad.f32 %f4, %f2, %f3, %f1; st.global.f32 [%rd4+0], %f4; exit; } • CUDA assembly is called Parallel Thread Execution (PTX) What do you think an assembly language for parallel programming should have? __global__ void saxpy_parallel(intn, float a, float *x, float *y) { inti = bid.x * bid.x + tid.x; if (i < n) y[i] = a * x[i] + y[i]; }

Memory Optimizations

A Brief Overview of the GPU Threading Model • Each thread has local registers and local memory • Threads are grouped in warps • Warps run in SIMD exec • Warps are grouped in blocks • Shared memory + syncs • Blocks are grouped in grids • Each grid represents a kernel

A Brief Overview of the GPU Threading Model How do different grids communicate? How do threads in the same block communicate? What determines the size of the block of threads? What determines the size of the grid of threads? What is the effect of branches in the warp execution?

Going to the Archives • GPUs are much more memory intensive than traditional CPUs. Lets look into an example? • The GeForce 8800 processes 32 pixels per clock. Each pixel contains a color (3 bytes) and a depth (4 bytes), which are read and written. On the average 16 extra bytes of information are read for each pixel. How many bytes are processed per clock? To put these numbers in perspective, how much data is processed in each cycle of an ordinary x86 CPU?

The GPU Archive • Registers: fast, yet few. Private to each thread • Shared memory: used by threads in the same block • Local memory: off-chip and slow. Private to each thread • Global memory: off-chip and slow. Used to provide communication between blocks and grids.

The GPU Archive • Registers: fast, yet few. Private to each thread • Shared memory: used by threads in the same block • Local memory: off-chip and slow. Private to each thread • Global memory: off-chip and slow. Used to provide communication between blocks and grids. Why can't we leave all the data in registers? Why can't we leave all the data in shared memory? The CPU also has a memory hierarchy. Do you remember how is this hierarchy like? Why do we have a memory hierarchy also in the CPU?

The Traveler’s Journal

Inter-Device Communication

The Interplanetary Trip • Copying data between GPU and CPU is pretty slow. CUDA provides some library functions for this: • cudaMalloc: allocates data in the GPU memory space • cudaMemset: fills a memory area with a value • cudaFree: frees the data in the GPU memory space • cudaMemcpy: copies data from CPU to GPU, or vice-versa

The interplanetary trip What is each of these calls below doing? intnbytes = 1024 * sizeof(int); int *a_d = 0; cudaMalloc( (void**) &a_d, nbytes); cudaMemset( a_d, 0, nbytes); cudaFree(a_d);

The interplanetary trip intmain(intargc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); cudaMemset(&a_d, 0, nBytes); cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS; } This program copies data from the host to the device, and then moves this data inside the device, and finally brings the data back to the host memory.

The interplanetary trip intmain(intargc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); cudaMemset(&a_d, 0, nBytes); cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS; } float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i;

The interplanetary trip intmain(intargc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); cudaMemset(&a_d, 0, nBytes); cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS; } nBytes = N * sizeof(float); a_h = (float*)malloc(nBytes); b_h = (float*)malloc(nBytes); Host a_h b_h

The interplanetary trip intmain(intargc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); cudaMemset(&a_d, 0, nBytes); cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS; } cudaMalloc((void**)&a_d, nBytes); cudaMalloc((void**)&b_d, nBytes); Host a_h b_h Device a_d b_d

The interplanetary trip intmain(intargc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); cudaMemset(&a_d, 0, nBytes); cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS; } cudaMemset(&a_d, 0, nBytes); Host a_h b_h Device a_d b_d

The interplanetary trip intmain(intargc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); cudaMemset(&a_d, 0, nBytes); cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS; } cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); Host a_h b_h Device a_d b_d

The interplanetary trip intmain(intargc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); cudaMemset(&a_d, 0, nBytes); cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS; } cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); Host a_h b_h Device a_d b_d

The interplanetary trip intmain(intargc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); cudaMemset(&a_d, 0, nBytes); cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS; } cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); Host a_h b_h Device a_d b_d

The interplanetary trip intmain(intargc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); cudaMemset(&a_d, 0, nBytes); cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS; } for (i=0; i< N; i++) { ASSERT(a_h[i] == b_h[i]); } Host a_h b_h Device a_d b_d

The interplanetary trip intmain(intargc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); cudaMemset(&a_d, 0, nBytes); cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS; } free(a_h); free(b_h); Host Device a_d b_d

The interplanetary trip intmain(intargc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); cudaMemset(&a_d, 0, nBytes); cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS; } cudaFree(a_d); cudaFree(b_d); Host Device

Inter-device Communication • Inter-device communication, i.e, between the CPU and the GPU, should be minimized as much as possible. • Inter-device communication is orders of magnitude slower than reading data from shared memory, for instance. • That is why GPUs are not good for interactive applications.

Avoid traveling whenever you can cudaMalloc((void**) &d_vec, mem_size); cudaMemcpy(d_vec, h_vec, mem_size, cudaMemcpyHostToDevice); kernel0<<< gridSize0, blockSize0 >>>(d_vec, vec_size); kernel1<<< gridSize1, blockSize1 >>>(d_vec, vec_size); cudaMemcpy(h_vec, d_vec, mem_size, cudaMemcpyDeviceToHost); cudaFree(d_vec); d_vec does not change between kernl calls. Therefore, there is no need to send it again! From maxSort, available in the course webpage.

Keep data on the GPU • Once data is sent to the GPU, it stays on the DRAM, even after the kernel is done executing • Try invoking kernels on data already on the GPU Can you think about a situation in which it is better to leave a kernel, do some computation on the CPU, and then call another kernel? By the way, can you think about a problem that is inherently sequential?

The GPU deserves complex work What is the complexity of copying data from the CPU to the GPU? Is it worth to do matrix sum in the GPU? Is it worth to do matrix multiplication in the GPU? __global__ void matSumKernel(float* S, float* A, float* B, int side) { intij = tid.x + tid.y * side; A[ij] = B[ij] + C[ij]; } __global__ void matMul1(float* B, float* C, float* A, intw) { float v = 0.0; for (intk = 0; k < w; ++k) { v += B[tid.x*w+k] * C[k*w+tid.x]; } A[tid.x + tid.y * w] = v; }

Matrix Sum × Matrix Mul • Matrix Sum: • Matrix Mul:

The ballerina’s waltz • Start working as soon as data is available - use a pipeline! cudaMemcpy(dst, src, N * sizeof(float), dir); kernel<<<N/nThreads, nThreads>>>(dst); • C for CUDA has an API for asynchronous transfers: sz = N * sizeof(float) / nStreams; for (i = 0; i < nStreams; i++) { offset = i * N / nStreams; cudaMemcpyAsync(dst+offset, src+offset, sz, dir, stream[i]); } for (i=0; i<nStreams; i++) { gridSize = N / (nThreads * nStreams); offset = i * N / nStreams; kernel<<<gridSize, nThreads, 0, stream[i]>>>(dst+offset); } What is the glue between data and computation in this example?

The ballerina’s waltz • Asynchronous memory copy overlaps data transfer and GPU processing This technique to obtain parallelism, e.g., pipeline parallelism, is a pattern used in many different scenarios. Could you name other examples of pipeline parallelism?

Global Memory Access

The Silk Road Trip • Reading or writing to the global memory is also slow. • But not as much as reading/writing between host and device. • The Global Memory is on-board.

The Matrix Multiplication Kernel In the PRAM model, what is the asymptotic complexity of the matrix multiplication problem? Could you translate this program to C for CUDA? void matmult(float* B, float* C, float* A, intw) { for (unsigned inti = 0; i < w; ++i) { for (unsigned intj = 0; j < w; ++j) { A[i * w + j] = 0.0; for (unsigned intk = 0; k < w; ++k) { A[i * w + j] += B[i * w + k] * C[k * w + j]; } } } }

Matrix Multiplication Kernel What is the asymptotic complexity of this program? Given width = 10, how many accesses to the global memory does this program perform? How to know how many floating-point operations per second this program will perform? In this example, each thread is responsible for multiplying one line of B by one column of C, to produce an element of A. __global__ void matMul1(float* B, float* C, float* A, int Width) { float Pvalue = 0.0; inttx = blockIdx.x * blockDim.x + threadIdx.x; intty = blockIdx.y * blockDim.y + threadIdx.y; for (intk = 0; k < Width; ++k) { Pvalue += B[tx * Width + k] * C[k * Width + ty]; } A[ty + tx * Width] = Pvalue; } From matMul, available in the course webpage.

GFLOPS How many instructions we find in block Lt_0_1282 of this code? How many floating point operations does this program perform in the inner loop? What is, then, the proportion of floating point operations per GPU operation? If the GTX 8800 can perform 172.8 Gflops, how many GFlops could we expect from this code? But if we get a much lower number, what could be the reasons for this bad performance behavior? mov.f32 %f1, 0f00000000; $Lt_0_1282: cvt.u64.u32 %rd3, %r7; mul.lo.u64 %rd4, %rd3, 4; ld.param.u64 %rd2, [B]; add.u64 %rd5, %rd2, %rd4; ld.global.f32 %f2, [%rd5+0]; cvt.u64.u32 %rd6, %r9; mul.lo.u64 %rd7, %rd6, 4; ld.param.u64 %rd1, [C]; add.u64 %rd8, %rd1, %rd7; ld.global.f32 %f3, [%rd8+0]; mad.f32 %f1, %f2, %f3, %f1; add.u32 %r7, %r7, 1; ld.param.s32 %r3, [w]; add.u32 %r9, %r3, %r9; setp.ne.s32 %p2, %r7, %r8; @%p2 bra $Lt_0_1282;

Coalesced Access to the Global Memory • The global memory is divided into segments of 16 cells. If 16 threads read data from the same segment, the memory access contains only one trip. • However, if each thread reads from a different segment we may have a slow access to the global memory. • In order to know if we are accessing the same segment, we can do the following check: • take thread t whose id is 16 × n • find out the segment s that thread t is accessing • for every thread t+i, 1 ≤ i ≤ 15, see if t + i is accessing segment s

The Anatomy of a Block (tid.x, tid.y) 0, 0 1, 0 2, 0 3, 0 …, 0 Warp 0, 1 1, 1 2, 1 3, 1 …, 1 … 0, 2 1, 2 2, 2 3, 2 …, 2 0, 3 1, 3 2, 3 3, 3 …, 3 Warps should read data in the same segment. … … … 0, … 1, … 2, … 3, … …, …

Memory Optimizations for Graphics Processing Units