1 / 48

SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture

SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture. Concurrent Programming. Until now: execution involved one flow of control through program Concurrent programming is about programs with multiple flows of control

jeanlouis
Download Presentation

SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture

  2. Concurrent Programming • Until now: execution involved one flow of control through program • Concurrent programming is about programs with multiple flows of control • For example: a program that runs as multiple processes cooperating to achieve a common goal • To cooperate, processes must somehow communicate

  3. Inter Process Communication (IPC) • Using files • Parent process creates 2 files before forking child process • Child inherits file descriptors from parent, and they share the file pointers • Can use one for parent to write and child to read, other for child to write and parent to read • OS supports something called a pipe • Producer writes at one end (write-end) and consumer reads from the other end (read-end) • corresponds to 2 file descriptors (int fd[2]) • Read from fd[0] accesses data written to fd[1] in FIFO order and vice versa • Used with fork - parent process creates a pipe and uses it to communicate with a child process

  4. Other IPC Mechanisms • Processes could communicate through variables that are shared between them • Shared variables, shared memory; other variables are private to a process • Special OS support for program to specify objects that are to be in shared regions of address space • Posix shared memory – shmget, shmat • Processes could communicate by sending and receiving messages to each other • Special OS support for these messages

  5. More Ideas on IPC Mechanisms • Sometimes processes don’t need to communicate explicit values to cooperate • They might just have to synchronize their activities • Example: Process 1 reads 2 matrices, Process 2 multiplies them, Process 3 writes the result matrix • Process 2 should not start work until Process 1 finishes reading, etc. • Called process synchronization • Synchronization primitives • Examples: mutex lock, semaphore, barrier

  6. Programming With Shared Variables • Consider a 2 process program in which both processes increment a shared variable shared int X = 0; P1: P2: X++; X++; • Q: What is the value of X after this? • Complication: Remember that X++ compiles into something like LOAD R1, 0(R2) ADD R1, R1, 1 STORE 0(R2), R1

  7. Problem with using shared variables • Final value of X could be 1! P1 loads X into R1, increments R1 P2 loads X into register before P1 stores new value into X Net result: P1 stores 1, P2 stores 1 • Moral of example: Necessary to synchronize processes that are interacting using shared variables • Problem arises when 2 or more processes try to update shared variable • Critical Section: part of program where shared variable is accessed like this

  8. Critical Section Problem: Mutual Exclusion • Must synchronize processes so that they access shared variable one at a time in critical section; called Mutual Exclusion • Mutex Lock: a synchronization primitive • AcquireLock(L) • Done before critical section of code • Returns when safe for process to enter critical section • ReleaseLock(L) • Done after critical section • Allows another process to acquire lock

  9. Implementing a Lock int L=0; /* 0: lock available */ AcquireLock(L): while (L==1); L = 1; ReleaseLock(L): L = 0; /* `BUSY WAITING’ */

  10. Why this implementation fails wait: LW R1, Addr(L) BNEZ R1, wait ADDI R1, R0, 1 SW R1, Addr(L) while ( L == 1) ; L = 1; Process 1 Process 2 LW R1 with 0 Context Switch LW R1 with 0 BNEZ ADDI SW Enter CS Assume that lock L is currently available (L = 0) and that 2 processes, P1 and P2 try to acquire the lock L Context Switch IMPLEMENTATION ALLOWS PROCESSES P1 and P2 TO BE IN CRITICAL SECTION TOGETHER! BNEZ ADDI SW Enter CS time

  11. Busy Wait Lock Implementation • Hardware support will be useful to implement a lock • Example: Test&Set instruction Test&Set Lock: tmp = Lock Lock = 1 Return tmp Where these 3 steps happen atomically or indivisibly. i.e., all 3 happen as one operation (with nothing happening in between) Atomic Read-Modify-Write (RMW) instruction

  12. Busy Wait Lock with Test&Set AcquireLock(L) while (Test&Set(L)) ; ReleaseLock(L) L = 0; • Consider the case where P1 is currently in a critical section, P2-P10 are executing AcquireLock: all are executing the while loop • When P1 releases the lock, by the definition of Test&Set exactly one of P2-P10 will read the new lock value of 0 and set L back to 1

  13. More on Locks • Other names for this kind of lock • Mutex • Spin wait lock • Busy wait lock • Can have locks where instead of busy waiting, an unsuccessful process gets blocked by the operating system

  14. Semaphore • A more general synchronization mechanism • Operations: P (wait) and V (signal) • P(S) • if S is nonzero, decrements S and returns • Else, suspends the process until S becomes nonzero, when the process is restarted • After restarting, decrements S and returns • V(S) • Increments S by 1 • If there are processes blocked for S, restarts exactly one of them

  15. Critical Section Problem & Semaphore • Semaphore S = 1; • Before critical section: P(S) • After critical section: V(S) • Semaphores can do more than mutex locks • Initialize S to 10 and 10 processes will be allowed to proceed • P1:read matrices; P2: multiply; P3: write product • Semaphores S1=S2=0; • End of P1: V(S1), beginning of P2: P(S1) etc

  16. Deadlock Consider the following process: P1: lock (L); wait(L); • P1 is waiting for something (release of lock that it is holding) that will never happen • Simple case of a general problem called deadlock • Cycle of processes waiting for resources held by others while holding resources needed by others

  17. Classical Problems Producers-Consumers Problem • Bounded buffer problem • Producer process makes things and puts them into a fixed size shared buffer • Consumer process takes things out of shared buffer and uses them • Must ensure that producer doesn’t put into full buffer or consumer take out of empty buffer • While treating buffer accesses as critical section

  18. Producers-Consumers Problem shared Buffer[0 .. N-1] Producer: repeatedly Produce x Buffer[i++] = x Consumer: repeatedly y = Buffer[- - i] Consume y ; if (buffer is full) wait for consumption ; signal consumer If (buffer is empty) wait for production ; signal producer

  19. Dining Philosophers Problem • N philosophers sitting around a circular table with a plate of food in front of each and a fork between each 2 philosophers • Philosopher does: repeatedly Eat (using 2 forks) Think • Problem: Avoid deadock; be fair

  20. THREADS Thread • a basic unit of CPU utilization • Thread of control in a process • `Light weight process’ • Weight related to • Time for creation • Time for context switch • Size of context • Recall context of process

  21. Threads and Processes • Thread context • Thread id • Stack • Stack pointer, PC, GPR values • So, thread context switching can be fast • Many threads in same process that share parts of process context • Virtual address space (other than stack) • So, threads in the same process share variables that are not stack allocated

  22. Threads and Sharing • Shares with other threads of a process – code section, data section, open files and signals

  23. Threads • Benefits – responsiveness, communication, parallelism and scalability • Types – user threads and kernel threads • Multithreading models • Many-one: efficient; but entire process will block if a thread makes a blocking system call • One-to-one: e.g. linux. Parallelism; but heavy weight • Many-to-many: balance between the above two schemes

  24. Thread Implementation • Could either be supported in the operating system or by a library • Pthreads: POSIX thread library – a standard for defining thread creation and synchronization • int pthread_create • pthread_t *thread, const pthread_attr_t *attr, void *(*start_routine), void *arg • pthread_attr • pthread_join • pthread_exit • pthread_detach • Do “man –k pthreads”

  25. Synchronization Primitives Mutex locks int pthread_mutex_lock(pthread_mutex_t *mutex) If the mutex is already locked, the calling thread blocks until the mutex becomes available. Returns with the mutex object referenced by mutex in the locked state with the calling thread as its owner. pthread_mutex_unlock Semaphores sem_init sem_wait sem_post

  26. Pthread scheduling • Process contention scope – scheduling user-level threads among a set of kernel threads. • System contention scope – scheduling kernel threads for CPU. • Functions for setting the scope - pthread_attr_setscope, pthread_attr_getscope • Can use PTHREAD_SCOPE_PROCESS for PCS and PTHREAD_SCOPE_SYSTEM for SCS

  27. Thread Safety • A function is thread safe if it always produces correct results when called repeatedly from concurrent multiple threads • Thread Unsafe functions • That don’t protect shared variables • That keep state across multiple invocations • That return a pointer to a static variable • That call thread unsafe functions • Races • When correctness of a program depends on one thread reaching a point x before another thread reaching a point y

  28. Parallel Architecture

  29. PARALLEL ARCHITECTURE Parallel Machine: a computer system with more than one processor Motivations • Faster Execution time due to non-dependencies between regions of code • Presents a level of modularity • Resource constraints. Large databases. • Certain class of algorithms lend themselves • Aggregate bandwidth to memory/disk. Increase in data throughput. • Clock rate improvement in the past decade – 40% • Memory access time improvement in the past decade – 10%

  30. Classification of Architectures – Flynn’s classification In terms of parallelism in instruction and data stream • Single Instruction Single Data (SISD): Serial Computers • Single Instruction Multiple Data (SIMD) - Vector processors and processor arrays - Examples: CM-2, Cray-90, Cray YMP, Hitachi 3600 Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

  31. Classification of Architectures – Flynn’s classification • Multiple Instruction Single Data (MISD): Not popular • Multiple Instruction Multiple Data (MIMD) - Most popular - IBM SP and most other supercomputers, clusters, computational Grids etc. Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

  32. Classification of Architectures – Based on Memory • Shared memory • 2 types – UMA and NUMA NUMA Examples: HP-Exemplar, SGI Origin, Sequent NUMA-Q UMA Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

  33. Classification 2:Shared Memory vs Message Passing • Shared memory machine: The n processors share physical address space • Communication can be done through this shared memory • The alternative is sometimes referred to as a message passing machine or a distributed memory machine P P P P P P P M M M M M M M P P P P P Interconnect P P Interconnect Main Memory

  34. Shared Memory Machines The shared memory could itself be distributed among the processor nodes • Each processor might have some portion of the shared physical address space that is physically close to it and therefore accessible in less time • Terms: NUMA vs UMA architecture • Non-Uniform Memory Access • Uniform Memory Access

  35. Classification of Architectures – Based on Memory • Distributed memory Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/ • Recently multi-cores • Yet another classification – MPPs, NOW (Berkeley), COW, Computational Grids

  36. Parallel Architecture: Interconnection Networks • An interconnection network defined by switches, links and interfaces • Switches – provide mapping between input and output ports, buffering, routing etc. • Interfaces – connects nodes with network

  37. Parallel Architecture: Interconnections • Indirect interconnects: nodes are connected to interconnection medium, not directly to each other • Shared bus, multiple bus, crossbar, MIN • Direct interconnects: nodes are connected directly to each other • Topology: linear, ring, star, mesh, torus, hypercube • Routing techniques: how the route taken by the message from source to destination is decided • Network topologies • Static – point-to-point communication links among processing nodes • Dynamic – Communication links are formed dynamically by switches

  38. Interconnection Networks • Static • Bus • Completely connected • Star • Linear array, Ring (1-D torus) • Mesh • k-d mesh: d dimensions with k nodes in each dimension • Hypercubes – 2-logp mesh • Trees – our campus network • Dynamic – Communication links are formed dynamically by switches • Crossbar • Multistage • For more details, and evaluation of topologies, refer to book by Grama et al.

  39. Indirect Interconnects Shared bus Multiple bus 2x2 crossbar Crossbar switch Multistage Interconnection Network

  40. Direct Interconnect Topologies Star Ring Linear 2D Mesh (binary n-cube) Hypercube n=3 n=2 Torus

  41. Evaluating Interconnection topologies • Diameter – maximum distance between any two processing nodes • Full-connected – • Star – • Ring – • Hypercube - • Connectivity – multiplicity of paths between 2 nodes. Miniimum number of arcs to be removed from network to break it into two disconnected networks • Linear-array – • Ring – • 2-d mesh – • 2-d mesh with wraparound – • D-dimension hypercubes – 1 2 p/2 logP 1 2 2 4 d

  42. Evaluating Interconnection topologies • bisection width – minimum number of links to be removed from network to partition it into 2 equal halves • Ring – • P-node 2-D mesh - • Tree – • Star – • Completely connected – • Hypercubes - 2 Root(P) 1 1 P2/4 P/2

  43. Evaluating Interconnection topologies • channel width – number of bits that can be simultaneously communicated over a link, i.e. number of physical wires between 2 nodes • channel rate – performance of a single physical wire • channel bandwidth – channel rate times channel width • bisection bandwidth – maximum volume of communication between two halves of network, i.e. bisection width times channel bandwidth

  44. Shared Memory Architecture: Caches P2 P1 Read X Read X Read X Write X=1 Cache hit: Wrong data!! X: 1 X: 0 X: 0 X: 0 X: 1

  45. Cache Coherence Problem • If each processor in a shared memory multiple processor machine has a data cache • Potential data consistency problem: the cache coherence problem • Shared variable modification, private cache • Objective: processes shouldn’t read `stale’ data • Solutions • Hardware: cache coherence mechanisms

  46. Cache Coherence Protocols • Write update – propagate cache line to other processors on every write to a processor • Write invalidate – each processor gets the updated cache line whenever it reads stale data • Which is better?

  47. Invalidation Based Cache Coherence P2 P1 Read X Read X Read X Write X=1 X: 1 X: 1 X: 0 X: 0 Invalidate X: 1 X: 0

  48. Cache Coherence using invalidate protocols • 3 states associated with data items • Shared – a variable shared by 2 caches • Invalid – another processor (say P0) has updated the data item • Dirty – state of the data item in P0 • Implementations • Snoopy • for bus based architectures • shared bus interconnect where all cache controllers monitor all bus activity • There is only one operation through bus at a time; cache controllers can be built to take corrective action and enforce coherence in caches • Memory operations are propagated over the bus and snooped • Directory-based • Instead of broadcasting memory operations to all processors, propagate coherence operations to relevant processors • A central directory maintains states of cache blocks, associated processors • Implemented with presence bits

More Related