CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Fall 2009 • Jih-Kwon Peir • Computer Information Science Engineering • University of Florida CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming • Acknowledgement:Slides borrowed from • Accelerators for Science and Engineering Applications: GPUs and Multicores, by David Kirk / NVIDIA and Wen-meiHwu / University of Illinois, 2006-2008, (http://www.greatlakesconsortium.org/events/GPUMulticore/agenda.html) • Course material posted from CUDA zone (http://www.nvidia.com/object/cuda_education.html) • Intel Software Network (http://software.intel.com/en-us/academic/) • The Art of Multiprocessor Programming (http://software.intel.com/en-us/academic/ ) • Presentation slides from various papers

Course Goals • Learn how to program massively parallel processors and achieve • high performance • functionality and maintainability • scalability across future generations • Acquire technical knowledge required to achieve the above goals • principles and patterns of parallel programming • processor architecture features and constraints • programming API, tools and techniques • Learn new many-core general-purpose and GPU processor architecture • Organization and memory systems • Parallel programming basics: Locking, synchronization, mutual exclusion, transactional memory, etc.

Course Outline • Week 1-2: Introduction, GPU architectures, CUDA programming • Week 3-6: CUDA threads, code blocks, grids, CUDA memory, synchronization, performance • Week 7: Project selection and discussion • Week 8-9: Intel many-core architectures • Week 10-11: Parallel programming model, synchronization, mutual exclusion, conditional synchronization, locks, barriers, concurrency and correctness, sequential program and consistency. • Add Fermi and Larrabee • Week 12-13 - Discussion of advanced issues in multi-core architecture and programming • Week 14-16 In-depth discussion of project topics and project presentation

. . . . . . CUDA – GPU Proggming • Integrated host+device app C program • Serial or modestly parallel parts in host C code • Highly parallel parts in device SPMD kernel C code Serial Code (host)‏ Parallel Kernel (device)‏ KernelA<<< nBlk, nTid >>>(args); Serial Code (host)‏ Parallel Kernel (device)‏ KernelB<<< nBlk, nTid >>>(args);

CUDA Thread Blocks and Threads • Each thread uses IDs to decide what data to work on • Block ID: 1D or 2D • Thread ID: 1D, 2D, or 3D • Simplifies memoryaddressing when processingmultidimensional data • Image processing • Solving PDEs on volumes • …

Matrix MultiplicationA Simple Example // Matrix multiplication on the (CPU) host in double precision void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏ { for (int i = 0; i < Width; ++i)‏ for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * width + k]; double b = N[k * width + j]; sum += a * b; } P[i * Width + j] = sum; } } N k j WIDTH M P i WIDTH k WIDTH WIDTH

G80 Example: Thread Scheduling (cont.) • SM implements zero-overhead warp scheduling • At any time, only one of the warps is executed by SM • Warps whose next instruction has its operands ready for consumption are eligible for execution • Eligible Warps are selected for execution on a prioritized scheduling policy • All threads in a warp execute the same instruction when selected

Thread Scheduling (cont.) • Each code block assigned to one SM, each SM can take up to 8 blocks • Each block up to 512 threads, divided into 32-therad wrap, each wrap scheduled on 8 SP, 4 threads on one SP, wrap executed SIMT mode • SP is pipelined ~30 stages, fetch, decode, gather and write-back act on whole warps, so they have a throughput of 1 warp/slow clock • Execute acts on group of 8 threads or quarter-warps (there are only 8 SP/SM), so their throughput is 1 warp/4 fast clocks or 1 warp/2 slow clocks • The Fetch/decode/... stages have a higher throughput to feed both the MAD and the SFU/MUL units alternatively. Hence the peak rate of 8 MAD + 8 MUL per (fast) clock cycle • Need 6 warps (or 192 threads) per SM to hide the read-after-write latencies

Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Host Global Memory Constant Memory G80 Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write per-thread local memory • Read/write per-block shared memory • Read/write per-grid global memory • Read/only per-grid constant memory

How about performance on G80? • All threads access global memory for their input matrix elements • Two memory accesses (8 bytes) per floating point multiply-add • 4B/s of memory bandwidth/FLOPS • 4*346.5 = 1386 GB/s required to achieve peak FLOP rating • 86.4 GB/s limits the code at 21.6 GFLOPS • The actual code runs at about 15 GFLOPS • Need to drastically cut down memory accesses to get closer to the peak 346.5 GFLOPS Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Host Global Memory Constant Memory

__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { 1. __shared__float Mds[TILE_WIDTH][TILE_WIDTH]; 2. __shared__float Nds[TILE_WIDTH][TILE_WIDTH]; 3. int bx = blockIdx.x; int by = blockIdx.y; 4. int tx = threadIdx.x; int ty = threadIdx.y; // Identify the row and column of the Pd element to work on 5. int Row = by * TILE_WIDTH + ty; 6. int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the Md and Nd tiles required to compute the Pd element 8. for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Coolaborative loading of Md and Nd tiles into shared memory 9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)]; Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width]; __syncthreads(); 11. for (int k = 0; k < TILE_WIDTH; ++k) Pvalue += Mds[ty][k] * Nds[k][tx]; Synchthreads(); } 13. Pd[Row*Width+Col] = Pvalue; } Tiled Matrix Multiplication Kernel

Today’s Intel PC Architecture:Single Core System • FSB connection between processor and Northbridge (82925X) • Memory Control Hub • Northbridge handles “primary” PCIe to video/GPU and DRAM. • PCIe x16 bandwidth at 8 GB/s (4 GB each direction) • Southbridge (ICH6RW) handles other peripherals

GeForce-8 Series HW Overview Streaming Processor Array … TPC TPC TPC TPC TPC TPC Texture Processor Cluster Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch SM Shared Memory TEX SP SP SP SP SM SFU SFU SP SP SP SP

warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95 warp 8 instruction 12 warp 3 instruction 96 SM Warp Scheduling • SM hardware implements zero-overhead Warp scheduling • Warps whose next instruction has its operands ready for consumption are eligible for execution • Eligible Warps are selected for execution on a prioritized scheduling policy • All threads in a Warp execute the same instruction when selected • 4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80 • If one global memory access is needed for every 4 instructions • A minimal of 13 Warps are needed to fully tolerate 200-cycle memory latency SM multithreaded Warp scheduler time ...

(Device) Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Host Global Memory Constant Memory Texture Memory CUDA Device Memory Space: Review • Each thread can: • R/W per-thread registers • R/W per-thread local memory • R/W per-block shared memory • R/W per-grid global memory • Read only per-grid constant memory • Read only per-grid texture memory • The host can R/W global, constant, and texture memories using Copy function

Memory Layout of a Matrix in C M0,0 M1,0 M2,0 M3,0 Access direction in Kernel code M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 … Time Period 2 T1 T2 T3 T4 Time Period 1 T1 T2 T3 T4 M M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3

2-way Bank Conflicts Linear addressing stride == 2 8-way Bank Conflicts Linear addressing stride == 8 Bank 0 Thread 0 x8 Thread 1 Bank 1 Thread 0 Bank 0 Bank 2 Thread 2 Thread 1 Bank 1 Thread 3 Bank 3 Thread 2 Bank 2 Thread 4 Bank 4 Thread 3 Thread 5 Bank 5 Thread 4 Thread 6 Bank 6 Bank 7 Thread 7 Bank 7 Bank 8 Bank 9 Thread 8 x8 Thread 9 Bank 15 Thread 15 Thread 10 Thread 11 Bank 15 Bank Addressing Examples

Control Flow Instructions • Main performance concern with branching is divergence • Threads within a single warp take different paths • Different execution paths are serialized in G80 • The control paths taken by the threads in a warp are traversed one at a time until there is no more. • A common case: avoid divergence when branch condition is a function of thread ID • Example with divergence: • If (threadIdx.x > 2) { } • This creates two different control paths for threads in a block • Branch granularity < warp size; threads 0 and 1 follow different path than the rest of the threads in the first warp • Example without divergence: • If (threadIdx.x / WARP_SIZE > 2) { } • Also creates two different control paths for threads in a block • Branch granularity is a whole multiple of warp size; all threads in any given warp follow the same path

Vector Reduction with Branch Divergence Thread 0 Thread 2 Thread 4 Thread 6 Thread 8 Thread 10 0 1 2 3 4 5 6 7 8 9 10 11 1 0+1 2+3 4+5 6+7 8+9 10+11 2 0...3 4..7 8..11 3 0..7 8..15 iterations Array elements

No Divergence until < 16 sub-sums Thread 0 0 1 2 3 … 13 14 15 16 17 18 19 1 0+16 15+31 3 4

Fundamentals of Parallel Computing • Parallel computing requires that • The problem can be decomposed into sub-problems that can be safely solved at the same time • The programmer structures the code and data to solve these sub-problems concurrently • The goals of parallel computing are • To solve problems in less time, and/or • To solve bigger problems, and/or • To achieve better solutions The problems must be large enough to justify parallel computing and to exhibit exploitable concurrency.

Challenges of Parallel Programming • Finding and exploiting concurrency often requires looking at the problem from a non-obvious angle • Computational thinking (J. Wing) • Dependences need to be identified and managed • The order of task execution may change the answers • Obvious: One step feeds result to the next steps • Subtle: numeric accuracy may be affected by ordering steps that are logically parallel with each other • Performance can be drastically reduced by many factors • Overhead of parallel processing • Load imbalance among processor elements • Inefficient data sharing patterns • Saturation of critical resources such as memory bandwidth

Fermi Implements CUDA • Definition of memory scope, grid, thread block, thread, are same as in Tesla • Grid: Array of thread blocks • Thread Block: up to1536 concurrent threads, comm. through shared memory • GPU has an array of SMs, each executes one or more thread block, each block is grouped into warps with 32 thread per warp • Other resource constraints are implementation based

Fermi – GT300 Key Feature 32 cores per SM, 512 cores Fully pipelined integer and floating point unit that implements new IEEE 754-2008 standard include fused multiply-add (FMA) Two warps from different thread blocks (even different kernels) can be issued and executed concurrently ECC protection from the registers to DRAM Linear addressing model with caching at all levels Large shared memory / L1 cache Double precision performance 8x faster than GT200 and reach ~600 double-precision GFLOPs 25

Fermi supports simultaneous execution of multiple kernels from the same application, each kernel distributed to one or more SMs GigaThread hardware thread scheduler, manages 1,536 simultaneously active threads for each SM across 16 kernels Switching from one application to another is 20x faster on Fermi Fermi supports OpenCL, Fortran, C++, Java, Matlab, and Python. Each SM has 32cores and 16 LS/ST units, 4 SFUs Fermi supports FMA for both singe and double precision Fermi – GT300 Key Feature (cont.) 26

Instruction Schedule Example • A total of 32 instructions from one or two warps can be dispatched in each cycle to any two of the four execution blocks within a Fermi SM: two blocks of 16 cores each, one block of four Special Function Units, and one block of load/store units. This figure shows how instructions are issued to the four execution blocks. • It takes two cycles for the 32 instructions in each warp to execute on the cores or load/store units. A warp of 32 special-function instructions is issued in a single cycle but takes eight cycles to complete on the four SFUs • Another major improvement in Fermi and PTX 2.0 is a new unified addressing model. All addresses in the GPU are allocated from a continuous 40-bit (one terabyte) address space. Global, shared, and local addresses are defined as ranges within this address space and can be accessed by common load/store instructions. (The load/store instructions support 64-bit addresses to allow for future growth.)

Multi-Core Architecture:Intel Quad Core Technology of TodayCache Structure The L2 cache of today’s quad-core processors is not one cache shared by all 4 cores. Instead there are two L2 cache shared by two cores each Core 1 Core 0 Core 2 Core 3 4MB Shared L2 Cache 4MB Shared L2 Cache Bus Interface 1066MHz/1333Mhz FSB

Programming with OpenMP* What Is OpenMP*? C$OMP FLUSH #pragma omp critical CALL OMP_SET_NUM_THREADS(10) C$OMP THREADPRIVATE(/ABC/) call omp_test_lock(jlok) C$OMP parallel do shared(a, b, c) C$OMP MASTER call OMP_INIT_LOCK (ilok) http://www.openmp.org Current spec is OpenMP 2.5 250 Pages (combined C/C++ and Fortran) C$OMP ATOMIC C$OMP SINGLE PRIVATE(X) setenv OMP_SCHEDULE “dynamic” C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C) C$OMP ORDERED C$OMP PARALLEL REDUCTION (+: A, B) C$OMP SECTIONS #pragma omp parallel for private(A, B) !$OMP BARRIER C$OMP PARALLEL COPYIN(/blk/) C$OMP DO lastprivate(XX) omp_set_lock(lck) Nthrds = OMP_GET_NUM_PROCS()

More material • Intel Larrabee Architecture • Herlihy’s Book • Chapter 1: Introduction • Chapter 2: Mutual Exclusion

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming