GPU Memory Details

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.0)

Overview Note that details about host memory interconnection are platform specific. Host Memory GPU Device GPU Chip L1 Cache L1 Cache L2 Cache Registers Registers Global Memory Core Core Core Core … … SMP Host ~ 25 GBps … CPU PCI Express (16/32 GBps) > 100 GBps by Martin Kruliš (v1.0)

Host-Device Transfers • PCIe Transfers • Much slower than internal GPU data transfers • Issued explicitly by host code • cudaMemcpy(dst, src, size, direction); • With one exception, when the GPU memory is mapped to the host memory space • The transfer call has significant overhead • Bulk transfers are preferred • Overlapping • Up to 2 async. transfers while the GPU is computing by Martin Kruliš (v1.0)

Global Memory • Global Memory Properties • Off-chip, but on the GPU device • High bandwidth and high latency • ~ 100 GBps, 400-600 of clock cycles • Operated in transactions • Continuous aligned segments of 32 B - 128 B • Data are cached in L2 • On CC < 3.0 also cached in L1 cache • Configurable by compiler flag • -Xptxas -dlcm=ca (Cache Always, i.e. also in L1, default) • -Xptxas -dlcm=cg (Cache Global, i.e. L2 only) • CC 3.x reserves L1 for local memory caching by Martin Kruliš (v1.0)

Global Memory • Coalesced Transfers • Number of transactions caused by global memory access depends on the pattern of the access • Certain access patterns are optimized • CC 1.x • Threads sequentially access aligned memory block • Subsequent threads access subsequent words • CC 2.0 and later • Threads access aligned memory block • Access within the block can be permuted by Martin Kruliš (v1.0)

Global Memory • Access Patterns • Perfectly aligned sequential access by Martin Kruliš (v1.0)

Global Memory • Access Patterns • Perfectly aligned with permutation by Martin Kruliš (v1.0)

Global Memory • Access Patterns • Continuous sequential, but misaligned by Martin Kruliš (v1.0)

Global Memory • Coalesced Loads Impact by Martin Kruliš (v1.0)

Shared Memory • Memory Shared by SM • Divided into banks • Each bank can be accessed independently • Consecutive 32-bit words are in consecutive banks • Optionally, 64-bit words division is used (CC 3.x) • Bank conflicts are serialized • Except for reading the same address (broadcast) by Martin Kruliš (v1.0)

Shared Memory • Linear Addressing • Each thread in warp access different memory bank • No collisions by Martin Kruliš (v1.0)

Shared Memory • Linear Addressing with Stride • Each thread access 2*i-th item • 2-way conflicts (2x slowdown) on CC < 3.0 • No collisions on CC 3.x • Due to 64-bits per cycle throughput by Martin Kruliš (v1.0)

Shared Memory • Linear Addressing with Stride • Each thread access 3*i-thitem • No collisions, since the number of banks is not divisible by the stride by Martin Kruliš (v1.0)

Shared Memory • Broadcast • One set of threads access value in bank #12 and the remaining threads access value in bank #20 • Broadcasts are served independently on CC 1.x • I.e., sample bellow causes 2-way conflict • CC 2.x and 3.x serve all broadcasts simultaneously by Martin Kruliš (v1.0)

Shared Memory • Shared Memory vs. L1 Cache • On most devices, they are the same resource • Division can be set for each kernel by cudaFuncSetCacheConfig(kernel, cacheConfig); • Cache configuration can prefer L1 or shared memory(i.e., selecting 48kB of 64kB for the preferred) • Shared Memory Configuration • Some devices (CC 3.x) can configure memory banks cudaFuncSetSharedMemConfig(kernel,config); • The config selects between 32 bit and 64 bit mode by Martin Kruliš (v1.0)

Registers • Registers • One register pool per multiprocessor • 8-64k of 32-bit registers (depending on CC) • Register allocation is defined by compiler • As fast as the cores (no extra clock cycles) • Read-after-write dependency • 24 clock cycles • Can be hidden if there are enough active warps • Hardware scheduler (and compiler) attempts to avoid register bank conflicts whenever possible • The programmer have no direct control over conflicts by Martin Kruliš (v1.0)

Local Memory • Per-thread Global Memory • Allocated automatically by compiler • Compiler may report the amount of allocated local memory (use --ptxas-options=-v) • Large structures and arrays are places here • Instead of the registers • Register Pressure • There is not enough registers to accommodate the data of the thread • The registers are spilled into the local memory • Can be moderated selecting smaller thread blocks by Martin Kruliš (v1.0)

Constant and Texture Memory • Constant Memory • Special 64KB cache for read-only data • 8KB is the cache working set per multiprocessor • CC 2.x introduces LDU (LoaD Uniform) instruction • Compiler uses to force loading read-only variables that are thread-independent into the cache • Texture Memory • Texture cache is optimized for 2D spatial locality • Additional functionality like fast data interpolation, normalized coordinate system, or handling the boundary cases by Martin Kruliš (v1.0)

Memory Allocation • Global Memory • cudaMalloc(), cudaFree() • Dynamic kernel allocation • malloc() and free() called from kernel • cudaDeviceSetLimit(cudaLimitMallocHeapSize, size) • Shared Memory • Statically (e.g., __shared__intfoo[16];) • Dynamically (by kernel launch parameter) extern __shared__float bar[]; float *bar1 = &(bar[0]); float *bar2 = &(bar[size_of_bar1]); by Martin Kruliš (v1.0)

Implications and Guidelines • Global Memory • Data should be accessed in coalesced manner • Hot data should be manually cached in shared mem • Shared Memory • Bank conflicts needs to be avoided • Redesigning data structures in col-wise manner • Using strides that are not divisible by # of banks • Registers and Local Memory • Use as few as possible, avoid registry spilling by Martin Kruliš (v1.0)

Implications and Guidelines • Memory Caching • The structures should be designed to utilize caches in best way possible • The workset of active blocks should fit L2 cache • Providing maximum information for the compiler • Using const for constant data • Using __restrict__ to indicate that no pointer aliasing will occur • Data Alignment • Operate on 32bit/64bit values only • Align data structures to suitable powers of 2 by Martin Kruliš (v1.0)

Maxwell Architecture • What is new in Maxwell…. • L1 merges with texture cache • Data are cached in L1 the same way as in Fermi • Shared memory is independent • 64k or 96k not shared with L1 • Shared memory uses 32bit banks • Revert to Fermi-like style, keeping the aggregated bandwidth • Faster shared memory atomic operations by Martin Kruliš (v1.0)

Discussion by Martin Kruliš (v1.0)

GPU Memory Details

GPU Memory Details

Presentation Transcript

GPU

GPU

Modeling GPU non-Coalesced Memory Access

Lecture 6: Shared-memory Computing with GPU

Energy Efficient GPU Transactional Memory via Space-Time Optimizations

RSVM: a Region-based Software Virtual Memory for GPU

Understanding GPU Memory

GPU

Large Data Visualization on Distributed Memory Multi-GPU Clusters

Hardware Transactional Memory for GPU Architectures*

Details. Details. Details

Fast Forwarding Table Lookup Exploiting GPU Memory Architecture

GPU Memory Model Overview

Hardware Transactional Memory for GPU Architectures

TESTING AND EXPOSING WEAK GPU MEMORY MODELS

RSVM: a Region-based Software Virtual Memory for GPU

GPU

GPU Manufacturers -JETALL GPU

Hardware Transactional Memory for GPU Architectures*

Large Data Visualization on Distributed Memory Multi-GPU Clusters

GPU