470 likes | 735 Views
Parallel Computing. CS 6021/01 Advanced Computer Architecture Final Project. Spring 2019. Group 1 Hu Longhua Shweta Khandal Vannel Zeufack. Plan. Introduction Concepts Parallel Computing Memory Architectures Parallel Programming Models References. Introduction.
E N D
Parallel Computing CS 6021/01 Advanced Computer Architecture Final Project Spring 2019 • Group 1 • Hu Longhua • Shweta Khandal • Vannel Zeufack
Plan • Introduction • Concepts • Parallel Computing Memory Architectures • Parallel Programming Models • References
Introduction • Def: Simultaneous use of many resources to solve a computing problem. • Necessary due to power wall: single processor limited due to heat • We can solve • Larger problems • Faster
Amdahl’s law • 1 server -> 20 customers per hour • 2 servers -> 40 customers per hour • 3 servers -> 60 customers per hour • Only true if • Severs serve at the same speed • Servers do not share resources
Parallelism vs Concurrency Concurrency: managing the execution of multiple tasks such that they seem to be occurring at the same time. Parallelism: running two tasks at the exact same time
Parallelism vs Concurrency Parallelism concurrency Interleaving tasks running once at a time One processor needed • Running at the same time • Multiple processors needed Task 1 Task 1 Task 2 Task 2
Types of parallelism • Bit-Level • based on increasing CPU word size (from 4 bits microprocessors to 64 bits microprocessors) • Reduces the number of instructions the processor must execute in order to perform an operation on variables whose sizes are greater than the length of the word. • Instruction-Level Parallelism • Based on simultaneous execution of many instructions • Can occur both at hardware (chips) and software level (compilers) • Task-Level Parallelism • Running many different tasks at the same time on the same data • A task (process/thread) is an unit of execution and is made of many instructions. • Data-Level parallelism • Running the same task on different data at the same time
Memory Architectures • Shared Memory • Uniform Memory Access (UMA) • Non-Uniform Memory Access (NUMA) • Distributed Memory • Hybrid Architecture • Hybrid Architecture with Accelerators (co-processors) • GPGPU (General Purpose Graphical Processing Unit) • MIC (Many Integrated Core)
Shared Memory • Multiple processors can operate independently but share the same memory resources. • Changes in a memory location made by one processor are visible to all other processors. • Classified as UMA (Uniform Memory Access) and NUMA (Non-uniform Memory Access), based upon memory access times.
Shared Memory: Uniform Memory Access (UMA) • Identical processors • Equal access times to memory • Sometimes called CC-UMA (Cache Coherent UMA). • Most commonly represented today by Symmetric Multiprocessor (SMP) machines
Shared Memory: Non-Uniform Memory Access (NUMA) • Often made by physically linking two or more SMPs (Symmetric Multiprocessors) • One SMP can directly access memory of another SMP • Processors do not have equal access time to memories • Memory access across link is slower
Shared Memory: Pros and Cons • Advantages • User-friendly programming perspective to memory • Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs • Disadvantages • Lack of scalability between memory and CPUs • can increase traffic on the shared Memory-CPU path • for cache coherent systems, can increase traffic associated with cache/memory management.
Distributed Memory • Memory is local to each processor • Data exchanged by message passing over a network • Because each processor has its own local memory, it operates independently. Hence, the concept of cache coherency does not apply • The network “fabric” used for data transfer varies widely, though it can be as simple as Ethernet.
Distributed Memory: Pros and Cons • Advantages • Memory is scalable with the number of processors. Increase the number of processors and the size of memory increases proportionately. • Each processor can rapidly access its own memory. • Cost effectiveness. • Disadvantages • The programmer is responsible for many of the details associated with data communication between processors. • It may be difficult to map existing data structures, based on global memory, to this memory organization. • Non-uniform memory access times
Hybrid Architecture • The largest and fastest computers in the world today employ both shared and distributed memory architectures. • The shared memory component can be a shared memory machine and/or graphics processing units (GPU) • Network communications are required to move data from one machine to another
Hybrid Architecture with accelerators • Why need accelerators or co-processors • Limitation of clock frequency due to power requirements and heat dissipation restrictions (unmanageable problem). • Number of cores per chip increases. • On HPC, we need a chip which can provide higher computing performance at lower energy.
Hybrid Architecture with accelerators • How to solve • The actual solution is a hybrid system containing both CPUs and “accelerators”, plus other forms of parallelism such as vector instruction support. • Widely accepted that hybrid systems with accelerators deliver the highest performance, and energy efficient computing in HPC. • The most common accelerator is MIC(Many Integrated Core ) and GPGPU(general purpose graphical processing unit
Accelerated (GPGPU and MIC) Systems • Accelerator (or co-processor): computer processor used to supplement the functions of the primary processor (the CPU ), with it allowing even great parallelism. • GPGPU (general purpose graphical processing unit) • Derived form graphics hardware • Requires a new programming model and specific libraries and compilers (CUDA, OpenCL) • Newer GPUs support IEEE 754-2008 floating point standard • Dose not support flow control (handled by host thread) • MIC (Many Integrated Core) • Derived from traditional CPU hardware • Based on x86 instruction set • Supports multiple programming models (OpenMP, MPI, OpenCL) • Flow control can be handled on accelerator
CPU vs MIC vs GPU Architecture Comparison • General-purpose architecture • Power-efficient Multiprocessor X86 design architecture • Massively data parallel
Hybrid Architecture with accelerators(GPGPU and MIC) • Calculations made in both CPU and accelerator • Provide abundance of low-cost flops • Typically communicate over PCI-e bus • Load balancing critical for performance
Parallel Programming Models • Shared Memory Model without threads • Shared Memory Model with threads • Distributed Memory Model with Message Passing Interface • Hybrid Model
Shared Memory Model (without threads) • Simplest parallel programming model • Processes/tasks share a common address space, which they read and write to asynchronously • Locks/semaphores are used to control access to the shared memory, resolve contentions and prevent race conditions and deadlocks. • Examples • POSIX standard provides an API to implement shared memory model • UNIX provides shared memory segments (shmget, shmat, shmctl, etc)
Shared Memory Model without threads • Advantages • No need of to specify explicitly the communication of data between tasks. • Simplest model • Disadvantages • Data locality issue • Deadlock and Race Condition
Threads Model • Shared memory programming model but using threads • Threads implementations commonly comprise: • A library of subroutines that are called from within parallel source code • A set of compiler directives embedded in either serial or parallel source code
POSIX Threads • Specified by the IEEE POSIX 1003.1c standard (1995). C Language only. • Part of Unix/Linux operating systems • Library based • Commonly referred to as Pthreads. • Very explicit parallelism; requires significant programmer attention to detail.
Pthread • The subroutines which comprise the Pthreads API can be informally grouped into four major groups: • Thread management: routines that work directly on threads - creating, detaching, joining, etc. • Mutexes: routines that deal with synchronization. Mutex functions provide for creating, destroying, locking and unlocking mutexes. • Condition variables: routines that address communications between threads that share a mutex. • Synchronization: routines that manage read/write locks and barriers • Major disadvantage is deadlocks.
OpenMP • Industry standard, jointly defined and endorsed by a group of major computer hardware and software vendors, organizations and individuals. • Compiler directive based • Portable / multi-platform, including Unix and Windows platforms • Available in C/C++ and Fortran implementations • Can be very easy and simple to use - provides for "incremental parallelism". Can begin with serial code.
OpenMP Fork-Join Model
Fork-Join Model or OpenMP in Java Incorrect order Incorrect order Correct order
Distributed Memory / Message Passing Model • Tasks use their own local memory • Tasks exchange data through communications by sending and receiving messages. • Data transfer usually requires cooperative operations to be performed by each process. • From a programming perspective, message passing implementations usually comprise a library of subroutines. • The programmer is responsible for determining all parallelism.
Distributed Memory / Message Passing Model • Point to point communication • Thread Safety • Mainly, MPI used in portable parallel program, parallel library, irregular or dynamic data relationships that do not fit in data parallel model.
Hybrid Model • A hybrid model combines more than one of the previously described programming models • Currently, a common example of a hybrid model is the combination of the message passing model (MPI) with the threads model (OpenMP). • Threads perform computationally intensive kernels using local, on-node data • Communications between processes on different nodes occurs over the network using MPI
Hybrid Model • MPI with CPU-GPU (Graphics Processing Unit) programing: another similar and increasingly popular example of a hybrid model • MPI tasks run on CPUs using local memory and communicating with each other over a network. • Computationally intensive kernels are off-loaded to GPUs on-node. • Data exchange between node-local memory and GPUs uses CUDA (or something equivalent).
References • Introduction to parallel computing • https://en.wikipedia.org/wiki/Parallel_computing#Fine-grained,_coarse-grained,_and_embarrassing_parallelism • https://computing.llnl.gov/tutorials/parallel_comp/ • Parallel vs Concurrent Programming • https://www.youtube.com/watch?v=ltTQaMSk6ME • https://www.youtube.com/watch?v=FChZP09Ba4E • GPU vs ManyCore • https://www.greymatter.com/corporate/hardcopy-article/gpu-vs-manycore/ • MIC & GPU Architecture • https://www.lrz.de/services/compute/courses/x_lecturenotes/MIC_GPU_Workshop/MIC-AND-GPU-2015.pdf • Parallel Programming Models • http://apiacoa.org/teaching/big-data/smp.en.html • https://www.cs.uky.edu/~jzhang/CS621/chapter9.pdf • http://hpcg.purdue.edu/bbenes/classes/CGT%20581-I/lectures/CGT%20581-I-01-Introduction.pdf • file:///C:/Users/Lenovo/Downloads/BPTX_2013_2_11320_0_378526_0_153227.pdf • https://homes.cs.washington.edu/~djg/teachingMaterials/spac/grossmanSPAC_forkJoinFramework.htmlhttps://www.mpi-forum.org/docs/