CMPE 421 Parallel Computer Architecture

CMPE 421 Parallel Computer Architecture Multi Processing

Multi Processing • Goal: connecting multiple computers to get higher performance • Multiprocessors: To create powerful computers by connecting many existing smaller ones • Scalability • H/W and S/W are designed to be sold with variable number of processors • Availability • If one fails others should continue • Power efficiency • Job-level (process-level) parallelism (Threads) • High throughput for independent jobs • Parallel processing program • Single program run on multiple processors • Multi-core microprocessors • Chips with multiple processors (cores) • Clusters: • A set of computers connected over a local area network that function as single large multiprocessor

Hardware and Software • Hardware • Serial: e.g., Pentium 4 • Parallel: e.g., Core 2 Duo, quad-core, Xeon • Software • Sequential: e.g., matrix multiplication • Concurrent: e.g., operating system • Sequential/concurrent software can run on serial/parallel hardware • Challenge: making effective use of parallel hardware • Example • Database • File servers • Computer-aided design packages • Multiprocessing operating systems • Google PC’s clusters

Parallel Programming • Parallel software is the problem • Need to get significant performance improvement • Otherwise, just use a faster uniprocessor, since it’s easier! • Difficulties • Partitioning (loops) • Coordination • Communications overhead

Scaling Example • Workload: sum of 10 scalars, and 10 × 10 matrix sum • Speed up from 10 to 100 processors • Single processor: Time = (10 + 100) × tadd • 10 processors • Time = 10 × tadd + 100/10 × tadd = 20 × tadd • Speedup = 110/20 = 5.5 • 100 processors • Time = 10 × tadd + 100/100 × tadd = 11 × tadd • Speedup = 110/11 = 10 • Assumes load can be balanced across processors

Scaling Example (cont) • What if matrix size is 100 × 100? • Single processor: Time = (10 + 10000) × tadd • 10 processors • Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd • Speedup = 10010/1010 = 9.9 • 100 processors • Time = 10 × tadd + 10000/100 × tadd = 110 × tadd • Speedup = 10010/110 = 91 • Assuming load balanced

Strong vs Weak Scaling • Strong scaling: problem size fixed • As in example • Weak scaling: problem size proportional to number of processors • 10 processors, 10 × 10 matrix • Time = 20 × tadd • 100 processors, 32 × 32 matrix • Time = 10 × tadd + 1000/100 × tadd = 20 × tadd • Constant performance in this example

Problems to derive the design of multiprocessors and clusters • How do parallel processors share data? • How do parallel processors coordinate? • WHEN operating on shared data …. • How many processors?

How do parallel processors share data? • Shared memory Procedures • Single address space that all procedures share • Since the processors communicate through shared variables in memory, Accessing any memory location is performed via loads and stores Single Bus

How do parallel processors coordinate? • Difficulty of coordination within processors is: • One processor could start working on data before another is finished with it.This coordination is called synchronization • When sharing is supported with a single address space, there must be a separate mechanism for synchronization. • Write-back caches used to keep bus traffic at a minimum • Caches are used to reduce latency and to lower bus traffic • Must provide hardware to ensure that caches and memory are consistent (cache coherency) we will discuss later One approach uses a lock: only one processor at a time can acquire the lock, and other processors interested in shared data must wait until the original processor unlocks the variable

Types of Memory Access • UMAs (uniform memory access) – SMP (symmetric multiprocessors) • all accesses to main memory take the same amount of time no matter which processor makes the request or which location is requested • NUMAs (nonuniform memory access) • some main memory accesses are faster than others depending on the processor making the request and which location is requested • can scale to larger sizes than UMAs so are potentially higher performance compared to UMA

Example: Sum Reduction • Suppose we have a single-bus multiprocessor UMA of 100 processors. Write a parallel processing program to calculate sum of 100,000 numbers. • Each processor has ID: 0 ≤ Pn ≤ 99 • Partition 1000 numbers per processor • Initial summation on each processor sum[Pn] = 0; for (i = 1000*Pn;i < 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i]; vectors A and sum are shared variables, Pn is the processor’s number, i is a private variable • Now need to add these partial sums • Reduction: divide and conquer • Half the processors add pairs, then quarter, … • Need to synchronize between reduction steps

Example: Sum Reduction • Processors start by running a loop that sums their subset of vector A numbers (vectors A and sum are shared variables, Pn is the processor’s number, i is a private variable) • The processors then coordinate in adding together the partial sums (half is a private variable initialized to 100 (the number of processors)) half = 100; repeat synch(); if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; /* Conditional sum needed when half is odd; Processor0 gets missing element */ half = half/2; /* dividing line on who sums */ if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half]; until (half == 1);

sum[P0] sum[P1] sum[P2] sum[P3] sum[P4] sum[P5] sum[P6] sum[P7] sum[P8] sum[P9] P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P0 P1 P2 P3 P4 P0 P1 P0 An Example with 10 Processors half = 10 half = 5 half = 2 half = 1

Alternative Model (Message Passing) • Distributed Memory Multiprocessors • Each Processor has private physical address space • Hardware send/receives messages between processor • Example: Clusters (desktop computers) • Processors knows when a message is sent • Processors knows when a message arrives Sometimes confirmation is asked to receiving processors • Network of independent computers • Each has private memory and OS • Connected using I/O system • E.g., Ethernet/switch, Internet • Suitable for applications with independent tasks • Web servers, databases, simulations,

Sum Reduction (Again) • Suppose we have a network-connected multiprocessor of 100 processors. Write a parallel processing program to calculate sum of 100,000 numbers. • Divided the 100,000 numbers into 100 subsets, each of 1000 numbers, and each subset is summed by an individual processor. • Then add the partial sums together with log2 (100) steps • The do partial sums sum = 0;for (i = 0; i<1000; i = i + 1) sum = sum + AN[i]; • Reduction • Half the processors send, other half receive and add • The quarter send, quarter receive and add, … Different subsets of 100,000 numbers are copied to different individual memories All processors have same copy of the program

Sum Reduction (Again) • Communication and coordination between processors through message passing • Given send() and receive() operations limit = 100; half = 100;/* 100 processors */repeat half = (half+1)/2; /* send vs. receive dividing line */ if (Pn >= half && Pn < limit)send(Pn - half, sum); if (Pn < (limit/2)) sum = sum + receive(); limit = half; /* upper limit of senders */until (half == 1); /* exit with final sum */ • Send/receive also provide synchronization • Assumes send/receive take similar time to addition

Cache Coherence • Shared memory multiprocessors have cache coherence problem • Multiple copies of the same memory data can exist in different caches simultaneously. • Update in the caches will cause an inconsistent view of memory • Solutions: • Software oriented – Compilers identify the data items that may cause cache – inconsistent and instruct hardware not caching them • Hardware oriented • Directory protocol • The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed the directory either updates or invalidates the other caches with that entry • Snoopy Protocol • is the process where the individual caches monitor address lines for accesses to memory locations that they have cached. When a write operation is observed to a location that a cache has a copy of, the cache controller invalidates its own copy of the snooped memory location. – Write-update – Write-invalidate

Cache Coherence • Ex: P1 reads A from shared MEM (A=5) P2 reads A from shared MEM (A=5) P1 modifies A as 10 (A=10 in cache 1, A=5 in cache2, mem) inconsistent • We can not apply a write-trough policy for every update • Why? P1 should tell P2 and MEM that A=10, • It will be to slow, and to costly in terms of MEM access and bus traffic • We need cache coherence protocols to solve this problem • Multiple copies are not problem when reading they become a problem when writing

Snooping • Cache controllers monitor (snoop) the shared bus to determine whether or not they have a copy of the shared block which is being updated (written) • Shared reads are no problem • So, snoopy cache coherence protocols must find all caches that share an object to be written • When there is a “write” to a shared data block, all other copies of that variable can be updated or invalidated. (How?, will be explained later) Monitor whether or not they have a copy of the deisred block In Snoopy caches there is broadcast media that listens to all invalidates and read request and performs appropriate coherence operations locally Tags are duplicated to reduce the demands of snooping on the cache

Read Misses • If the request is a read” which missed (value could not be found in local cache), then all other caches check to see whether they have a copy of the requested block, if yes, then the supply the data to requesting processor.

Handling Writes Ensuring that all other processors sharing data are informed of writes can be handled two ways: • Write-update (write-broadcast) – writing processor broadcasts new data over the bus, all copies are updated • All writes go to the bus  higher bus traffic • Since new values appear in caches sooner, can reduce latency • Write-invalidate– (other “owners” are told their copies are no longer valid) writing processor issues invalidation signal on bus, cache snoops check to see if they have a copy of the data, if so they invalidate their cache block containing the word (this allows multiple readers but only one writer) • Uses the bus only on the first write  lower bus traffic, so better use of bus bandwidth

An Example of a Cache Coherence (CC) Protocol • Finite state transition diagram for a write-invalidation protocol based on a write-back policy. Each cache block is in one of three states: 1. Shared (read only): This cache block is clean (not written) and may be shared. 2. Modified (read/write): This cache block is dirty (written) and may not be shared. 3. Invalid: This cache block does not have valid data.

receives invalidate (write by another processor to this block) Shared (clean) Invalid Another processor has a read miss or a write miss for this block (seen on bus); write back old block Modified (dirty) A Write-Invalidate CC Protocol read (hit or miss) read (miss) write (miss) send invalidate send invalidate write (hit or miss) write-back caching protocol in black signals from the processor coherence additions in red signals from the bus coherence additions in blue read (hit) or write (hit)

CC Protocol • A read miss causes the cache to acquire the bus and write back the victim block (if it was in the Modified (dirty) state). All the other caches monitor the read miss to see if this block is in their cache. If one has a copy and it is in the Modified state, then the block is written back and its state is changed to Invalid. The read miss is then satisfied by reading from the block memory, and the state of the block is set to Shared. • Read hits do not change the cache state.

CC Protocol • A write miss to an Invalid block causes the cache to acquire the bus, read the block, modify the portion of the block being written and changing the block’s state to Modified. A write miss to a Shared block causes the cache to acquire the bus, send an invalidate signal to invalidate any other existing copies in other caches, modify the portion of the block being written and change the block’s state to Modified. • A write hit to a Modified block causes no action. A write hit to a Shared block causes the cache to acquire the bus, send an invalidate signal to invalidate any other existing copies in other caches, modify the part of the block being written, and change the state to Modified.

Main Mem A Main Mem A Main Mem A Main Mem A Proc 2 Proc 1 Proc 2 Proc 2 Proc 1 Proc 1 Proc 1 Proc 2 A S A I A S A M A M A I A I A I Write-Invalidate CC Examples • I = invalid (many), S = shared (many), M = modified (only one) 3. snoop sees read request for A & lets MM supply A 1. read miss for A 1. write miss for A 2. writes A & changes its state to M 4. gets A from MM & changes its state to S 4. change A state to I 2. read request for A 3. P2 sends invalidate for A 1. write miss for A 3. snoop sees read request for A, writes-back A to MM 1. read miss for A 4. gets A from MM & changes its state to M 2. writes A & changes its state to M 4. change A state to I 6. change A state to I 5. P2 sends invalidate for A 2. read request for A 3. P2 sends invalidate for A

Grid Computing • Separate computers interconnected by long-haul networks • E.g., Internet connections • Work units farmed out, results sent back • Can make use of idle time on PCs • E.g., SETI@home, World Community Grid

Multithreading • Performing multiple threads of execution in parallel • Replicate registers, PC, etc. • Fast switching between threads • Fine-grain multithreading • Switch threads after each cycle • Interleave instruction execution • If one thread stalls, others are executed • Coarse-grain multithreading • Only switch on long stall (e.g., L2-cache miss) • Simplifies hardware, but doesn’t hide short stalls (eg, data hazards)

Simultaneous Multithreading • In multiple-issue dynamically scheduled processor • Schedule instructions from multiple threads • Instructions from independent threads execute when function units are available • Within threads, dependencies handled by scheduling and register renaming • Example: Intel Pentium-4 HT • Two threads: duplicated registers, shared function units and caches

Multithreading Example

Future of Multithreading • Will it survive? In what form? • Power considerations  simplified microarchitectures • Simpler forms of multithreading • Tolerating cache-miss latency • Thread switch may be most effective • Multiple simple cores might share resources more effectively

Instruction and Data Streams • An alternate classification • SPMD: Single Program Multiple Data • A parallel program on a MIMD computer • Conditional code for different processors

CMPE 421 Parallel Computer Architecture