Parallel Computing

Parallel Computing CS 6021/01 Advanced Computer Architecture Final Project Spring 2019 • Group 1 • Hu Longhua • Shweta Khandal • Vannel Zeufack

Plan • Introduction • Concepts • Parallel Computing Memory Architectures • Parallel Programming Models • References

Introduction • Def: Simultaneous use of many resources to solve a computing problem. • Necessary due to power wall: single processor limited due to heat • We can solve • Larger problems • Faster

Amdahl’s law • 1 server -> 20 customers per hour • 2 servers -> 40 customers per hour • 3 servers -> 60 customers per hour • Only true if • Severs serve at the same speed • Servers do not share resources

Amdahl’s Law

Amdahl’s Law (Examples)

Parallelism vs Concurrency Concurrency: managing the execution of multiple tasks such that they seem to be occurring at the same time. Parallelism: running two tasks at the exact same time

Parallelism vs Concurrency Parallelism concurrency Interleaving tasks running once at a time One processor needed • Running at the same time • Multiple processors needed Task 1 Task 1 Task 2 Task 2

Types of parallelism • Bit-Level • based on increasing CPU word size (from 4 bits microprocessors to 64 bits microprocessors) • Reduces the number of instructions the processor must execute in order to perform an operation on variables whose sizes are greater than the length of the word. • Instruction-Level Parallelism • Based on simultaneous execution of many instructions • Can occur both at hardware (chips) and software level (compilers) • Task-Level Parallelism • Running many different tasks at the same time on the same data • A task (process/thread) is an unit of execution and is made of many instructions. • Data-Level parallelism • Running the same task on different data at the same time

Parallel Computing Memory Architectures

Memory Architectures • Shared Memory • Uniform Memory Access (UMA) • Non-Uniform Memory Access (NUMA) • Distributed Memory • Hybrid Architecture • Hybrid Architecture with Accelerators (co-processors) • GPGPU (General Purpose Graphical Processing Unit) • MIC (Many Integrated Core)

Shared Memory • Multiple processors can operate independently but share the same memory resources. • Changes in a memory location made by one processor are visible to all other processors. • Classified as UMA (Uniform Memory Access) and NUMA (Non-uniform Memory Access), based upon memory access times.

Shared Memory: Uniform Memory Access (UMA) • Identical processors • Equal access times to memory • Sometimes called CC-UMA (Cache Coherent UMA). • Most commonly represented today by Symmetric Multiprocessor (SMP) machines

Shared Memory: Non-Uniform Memory Access (NUMA) • Often made by physically linking two or more SMPs (Symmetric Multiprocessors) • One SMP can directly access memory of another SMP • Processors do not have equal access time to memories • Memory access across link is slower

Shared Memory: Pros and Cons • Advantages • User-friendly programming perspective to memory • Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs • Disadvantages • Lack of scalability between memory and CPUs • can increase traffic on the shared Memory-CPU path • for cache coherent systems, can increase traffic associated with cache/memory management.

Distributed Memory • Memory is local to each processor • Data exchanged by message passing over a network • Because each processor has its own local memory, it operates independently. Hence, the concept of cache coherency does not apply • The network “fabric” used for data transfer varies widely, though it can be as simple as Ethernet.

Distributed Memory: Pros and Cons • Advantages • Memory is scalable with the number of processors. Increase the number of processors and the size of memory increases proportionately. • Each processor can rapidly access its own memory. • Cost effectiveness. • Disadvantages • The programmer is responsible for many of the details associated with data communication between processors. • It may be difficult to map existing data structures, based on global memory, to this memory organization. • Non-uniform memory access times

Hybrid Architecture • The largest and fastest computers in the world today employ both shared and distributed memory architectures. • The shared memory component can be a shared memory machine and/or graphics processing units (GPU) • Network communications are required to move data from one machine to another

Hybrid Architecture with accelerators • Why need accelerators or co-processors • Limitation of clock frequency due to power requirements and heat dissipation restrictions (unmanageable problem). • Number of cores per chip increases. • On HPC, we need a chip which can provide higher computing performance at lower energy.

Hybrid Architecture with accelerators • How to solve • The actual solution is a hybrid system containing both CPUs and “accelerators”, plus other forms of parallelism such as vector instruction support. • Widely accepted that hybrid systems with accelerators deliver the highest performance, and energy efficient computing in HPC. • The most common accelerator is MIC(Many Integrated Core ) and GPGPU(general purpose graphical processing unit

Accelerated (GPGPU and MIC) Systems • Accelerator (or co-processor): computer processor used to supplement the functions of the primary processor (the CPU ), with it allowing even great parallelism. • GPGPU (general purpose graphical processing unit) • Derived form graphics hardware • Requires a new programming model and specific libraries and compilers (CUDA, OpenCL) • Newer GPUs support IEEE 754-2008 floating point standard • Dose not support flow control (handled by host thread) • MIC (Many Integrated Core) • Derived from traditional CPU hardware • Based on x86 instruction set • Supports multiple programming models (OpenMP, MPI, OpenCL) • Flow control can be handled on accelerator

CPU vs MIC vs GPU Architecture Comparison • General-purpose architecture • Power-efficient Multiprocessor X86 design architecture • Massively data parallel

Hybrid Architecture with accelerators(GPGPU and MIC) • Calculations made in both CPU and accelerator • Provide abundance of low-cost flops • Typically communicate over PCI-e bus • Load balancing critical for performance

Parallel Programming Models

Parallel Programming Models • Shared Memory Model without threads • Shared Memory Model with threads • Distributed Memory Model with Message Passing Interface • Hybrid Model

Shared Memory Model (without threads) • Simplest parallel programming model • Processes/tasks share a common address space, which they read and write to asynchronously • Locks/semaphores are used to control access to the shared memory, resolve contentions and prevent race conditions and deadlocks. • Examples • POSIX standard provides an API to implement shared memory model • UNIX provides shared memory segments (shmget, shmat, shmctl, etc)

Shared Memory Model without threads • Advantages • No need of to specify explicitly the communication of data between tasks. • Simplest model • Disadvantages • Data locality issue • Deadlock and Race Condition

Threads Model • Shared memory programming model but using threads • Threads implementations commonly comprise: • A library of subroutines that are called from within parallel source code • A set of compiler directives embedded in either serial or parallel source code

Types of Thread Model

POSIX Threads • Specified by the IEEE POSIX 1003.1c standard (1995). C Language only. • Part of Unix/Linux operating systems • Library based • Commonly referred to as Pthreads. • Very explicit parallelism; requires significant programmer attention to detail.

POSIX Threads

Pthread • The subroutines which comprise the Pthreads API can be informally grouped into four major groups: • Thread management: routines that work directly on threads - creating, detaching, joining, etc. • Mutexes: routines that deal with synchronization. Mutex functions provide for creating, destroying, locking and unlocking mutexes. • Condition variables: routines that address communications between threads that share a mutex. • Synchronization: routines that manage read/write locks and barriers • Major disadvantage is deadlocks.

OpenMP • Industry standard, jointly defined and endorsed by a group of major computer hardware and software vendors, organizations and individuals. • Compiler directive based • Portable / multi-platform, including Unix and Windows platforms • Available in C/C++ and Fortran implementations • Can be very easy and simple to use - provides for "incremental parallelism". Can begin with serial code.

OpenMP Fork-Join Model

Fork-Join Model or OpenMP in Java Incorrect order Incorrect order Correct order

Distributed Memory / Message Passing Model • Tasks use their own local memory • Tasks exchange data through communications by sending and receiving messages. • Data transfer usually requires cooperative operations to be performed by each process. • From a programming perspective, message passing implementations usually comprise a library of subroutines. • The programmer is responsible for determining all parallelism.

Distributed Memory / Message Passing Model • Point to point communication • Thread Safety • Mainly, MPI used in portable parallel program, parallel library, irregular or dynamic data relationships that do not fit in data parallel model.

Hybrid Model • A hybrid model combines more than one of the previously described programming models • Currently, a common example of a hybrid model is the combination of the message passing model (MPI) with the threads model (OpenMP). • Threads perform computationally intensive kernels using local, on-node data • Communications between processes on different nodes occurs over the network using MPI

Hybrid Model • MPI with CPU-GPU (Graphics Processing Unit) programing: another similar and increasingly popular example of a hybrid model • MPI tasks run on CPUs using local memory and communicating with each other over a network. • Computationally intensive kernels are off-loaded to GPUs on-node. • Data exchange between node-local memory and GPUs uses CUDA (or something equivalent).

References • Introduction to parallel computing • https://en.wikipedia.org/wiki/Parallel_computing#Fine-grained,_coarse-grained,_and_embarrassing_parallelism • https://computing.llnl.gov/tutorials/parallel_comp/ • Parallel vs Concurrent Programming • https://www.youtube.com/watch?v=ltTQaMSk6ME • https://www.youtube.com/watch?v=FChZP09Ba4E • GPU vs ManyCore • https://www.greymatter.com/corporate/hardcopy-article/gpu-vs-manycore/ • MIC & GPU Architecture • https://www.lrz.de/services/compute/courses/x_lecturenotes/MIC_GPU_Workshop/MIC-AND-GPU-2015.pdf • Parallel Programming Models • http://apiacoa.org/teaching/big-data/smp.en.html • https://www.cs.uky.edu/~jzhang/CS621/chapter9.pdf • http://hpcg.purdue.edu/bbenes/classes/CGT%20581-I/lectures/CGT%20581-I-01-Introduction.pdf • file:///C:/Users/Lenovo/Downloads/BPTX_2013_2_11320_0_378526_0_153227.pdf • https://homes.cs.washington.edu/~djg/teachingMaterials/spac/grossmanSPAC_forkJoinFramework.htmlhttps://www.mpi-forum.org/docs/

Thanks for your attention!Questions?

Parallel Computing