Multiprocessing and Parallel Processing Chapter 9

Multiprocessing and Parallel ProcessingChapter 9 N. Guydosh 5/4/04

Parallel Processing - Introduction • We already looked at parallel processing when we studied pipelining. • Uniprocessors are running out of gas • Even with pipelining (“fine granularity parallelism”) and cache. • Technology, speed of light, and quantum mechanics are now the limit. • It was Rear Admiral Grace Hopper who held up a 6” piece of wire to demonstrate what a nanosecond was (propagation of a pulse thru 6” of wire is a nanosec!) – BTW she was a pioneer in computers, being a co-inventor of COBOL – see:http://www.sdsc.edu/ScienceWomen/hopper.html http://www.chinfo.navy.mil/navpalib/ships/destroyers/hopper/hoprcom.html • The next wave of performance enhancement is parallel processing - at the system level. • At the programming level it is called cooperative processing • The theoretical limit of performance of n processors working in parallel would be a speedup of n over a single processor. • Besides a speedup factor, parallel processing has the potential of being more reliable – if one node crashes, processing continues at a slightly lower performance.

Parallel Processing – Introduction (cont.) • Some questions to be investigated when considering parallel processing: • Cooperative processing vs independent jobs on each node • Independent jobs would be like multi-programming where each job/thread gets its own processor rather than merely a memory partition and a shared processor. • For cooperative processing: • every node has the same program and the data is distributed among the nodes: good for simple processing of massive data loads. • A single program is distributed over all the nodes along with distributed data – true distributed processing – most difficult to achieve. Good example is parallel logic simulation. • How are the processors connected:A common bus with common memory, but independent caches (SMP).An interconnection of general purpose computers – the connection media being a hardware switch or a LAN • Processor communication: shared memory vs. message passing • Load balancing: get maximum utilization for all processors. • …

Flynn’s Classification – A Quick Comparison • SISD – Single Instruction, Single data stream • Basic Uniprocessor – single program counter • SIMD - Single Instruction, Multiple data stream • A logically single stream of instructions operating on different units of data in parallel – ex. A vector processor • Example of an implementation: a single stream of SIMD instructions from a single program counter in a special SISD host processor are broadcasted to many parallel SIMD processors each with its own registers and cache memory. Each of the SIMD processors now executes the same instruction on a different unit of data in parallel lock step synchronism. Example: the CM-2 “Super Computer” with 65,563 processors, each having a 1 bit ALU (32 way bit slicing?) • MISD - Multiple Instruction, Single data stream – sequence of different data broadcasted to different parallel processors, each executing a different instruction sequence. • Not ever implemented. • MIMD - Multiple Instruction, Multiple data stream – many parallel processors executing different instruction streams on different data items. • Commonly implemented with “loosely coupled” clusters of general purpose computers on a network (see later) and also tightly coupled SMP.

Parallel Systems – The Big Picture From Stallings, “Operating Systems”, 4th ed.

MIMD – A Closer Look • Most versatile highest performing of all configurations • An interconnection of conceptually “standalone” computers ... sometimes called clusters • Each node can run an independent instruction stream on its own data. • In principle, it is possible to interconnect existing commercial PC’s and workstations into an MIMD configuration via a connection media • Nodes in an MIMD are typically SISD machines. • Nodes can even be heterogeneous - they need only satisfy the network/interconnect interface • MIMD’s are generally scalable: can grow them from two processors to many processors ==> implication: MIMD’s have a high degree of “RAS”Reliable, Available, and Serviceable. If a node craps out, reconfigure without it, and the MIMD cluster still runs (maybe slightly slower - but does not die).

MIMD – A Closer Look (cont.) • In a multitasking environment within a node, you can allow a node to do ordinary “office” work while still participating in a MIMD galactic collision (the many body problem), or weather simulation. • Inter-node communication and synchronization is via message passing. • Locally available clusters reside at the Cornell University Theory Center (Supercomputing Center) Run by IBM and used by scientists world wide. The interconnection mechanism is a massive hierarchical cross bar switch – point to point – many-to-many simultaneous communication paths as distinct from a LAN. The logical interface may be the same as a LAN.

MIMD – Possible Programming Schemes • MIMD is ideally suited for distributed and/or cooperative processing • Simplest is distribute data: you can still to a SIMD job on MIMD: in order to process a massive volume of data, distributed the data in all nodes and let the nodes independently chew up the data. May not be true SIMD, because it is not lock step, and the computed results of each node will have to be collected together into a single result – perhaps by a master node. • Coolest: distributed or cooperating algorithms: • Instead of distributing data, partition the algorithm into program loads for each node – this may be very hairy. • Programs now run independently but still have a need for periodic or occasional synchronization. • Like pipelining there are data and logical hazards: node 10 needs data being generated by node 20, but node 20 has a heavier work load or is a slower machine and is late with the data, ==> node 10 then blocks and waits for it. • Even cooler: time warp: node 10 guesses at the data from node 20 and runs with it, later it checks for correctness - if wrong it “rolls back time” and redoes the calculation with correct data. • Popular application of MIMD is simulation and modeling in the physical sciences.

Some Basic Problems With MIMD • How do processors (nodes) share data? • How do we coordinate the processing of the nodes? • Must cope with overhead in global controls - like a large committee working on a single task. • Partitioning algorithms is far from easy - a research topic. • Some algorithms are easier than others to be made parallel - theoretical limits exist. • Some algorithms are intrinsically serial, and others are intrinsically parallel. Real world algorithms are a mixture of both. • Load balancing is a problem.

Speedup and Amdadl’s Law

Speedup and Amdahl’s Law

Speedup and Amdahl’s Law (cont.)

Example- Parallel Addition, pp. 716-717

Example- Parallel Addition, pp. 716-717 (cont.)

TWO BASIC APPROACHES TO INTERCONNECTINGMIMD PROCESSORS • Network/switch connected (see later) • Single (common) bus connection • Each processor has its own private cache connected to a bus on which the main memory is also connected • Traffic per processor and the bus bandwidth determines the useful number of processors possible. • Key problem is cache coherency: keeping the processor cache up to date when other processors change main memory. • Cache-coherence protocols needed

Parallel Programming- Single Bus Fig. 9.2

Parallel Programming- Single Bus • See example page 719 • What started our as a simple minded problem of adding a long column of number ended up s conceptually hairy when done on 10 processors • Split list up into 10 parts • Each processor updates a now lost of 10 partial sums • Use divide and conquer in recursively adding the partial sums • Must use “barrier synchronization” to make sure that the partial sums a given processor is adding are up to date … echoes of semaphores and locks! • Because of the greater complexity and synchronization required, parallel programming is significantly more difficult • It is easy to introduce problems not related to the original logical problem (in this example addition) which could result in incorrect results.

Cache Coherency Problem- Single Bus • Snooping (See p. 720) • Processors monitor the bus to see if the write is to data which is shared by another processor ... addresses (tags) and data is broadcasted by modifying processors …once a block associativity set is identified, only the tag is needed to identify the block using a “hardware” search. • On a write, all processors check to see if they have a copy of the modified block, and then either invalidate it, or update it. • On a read miss: all processors check to see if they have a copy of the requested data ... possible supply the data to the cache that missed • To enable efficient snooping, address tags in caches are duplicated and made available to an independent memory port. … can have simultaneous access to a cache as long as the reference is not to the same block (set). If this happens a stall results.

Cache Coherency Problem- Single Bus • Snooping Protocols, two types: • Write-invalidate (use bus only on 1st write): • Writing processor causes all copies in other caches to be invalidated • Issues an invalidated signal over the bus (along with tag?) - all processors check to see if they have a copy - if so invalidate it. • Allows multiple readers but only a single writer • Write update (use bus on every write): • Rather than invalidating every shared block - the writer broadcasts the new data over the bus (block and tag?) - all processors with copies update them. • Write update is like write through - but now to other caches

Cache Coherency Problem- Single Bus Fig. 9.4

Synchronization in a Single Bus Multiprocessor • In a cooperative processing situation we must coordinate access to shared data when the cooperating processes are running on different processors. • This should be nothing new to people who took Operating Systems … remember the producer/consumer problem or the readers/writers problem. • The multiprocessors and bus must provide lock mechanisms such as semaphores, test and set, or atomic swap functions. • Using locks, a process must acquire the lock in order to access a shared variable. • It serializes access in order to guarantee integrity and avoid race conditions. … ironically after all the “parallizing” effort we now serialize – an example of some processes that are intrinsically serial.

Network Connected MIMD Systems • Previously the bus connected media was placed betweenmultiple processors and memory. • For a single bus system (previous case), the connection media is used for every memory access • Thus making the memory a bottle neck. • This approach limits the number of processors on the system and the physical separation of the processors. • A more flexible configuration: • Use a “network” connection media to interconnect a large number of complete computers (each node now having its own processor, cache, and memory). • For a network connected system, the connection media is used only for interprocessor communication - each processor has its own independent memory. • The “network” could be logically a LAN, put is more likely a massive cross-bar switch allowing “many-to-many” simultaneous communications.

Network Connected MIMD Systems (cont.) • Bus connected uses shared memory and a single memory space. • Network connected uses distributed (physical) memory, and multiple private address spaces. • Message passing used for synchronization and communication (including the exchange of data). Fig. 9.8

Clusters • Generally loosely coupled collections of off the self machines connected logically in a network. • The network is typically high bandwidth switched based still retaining the logical interface of a LAN. • Historical example is the IBM SP2 – 32 nodes –each an RS/6000 workstation – beat chess champion Kasparov in 1997.Basis of the cluster at the Cornell Theory Center. • Highly scalable, reliable and available. • N nodes has N independent memories • Some tradeoffs between clusters and an SMP bus connected system. • Administration cost higher for cluster – an N machine cluster has the same overhead as managing N machines, where as an N processor SMP system is more like managing a single machine. • SMP memory bus connection is faster • The nodes of a cluster do not have direct access to all memory in the system • Separate memories can also be a plus: easier to reconfigure and replace bad machines – generally cluster system software runs on top of the local operating system.

Network Topologies • Network configurations for clusters • High performance end: fully connected: every node has a connection to every other node ... cross bar switch • Low performance end:shared bus or LAN • Ring connected: each processor is connected to a switch which in turn is connected to two neighboring switches in arranged in a loop. • Unlike a bus a ring is capable of many simultaneous transfers - any two nodes can talk at the same time - (except two nodes trying to talk to the same node). • Grids and n-cubes • Cross-bar switch • Any node can directly talk to any other node • Omega network has (generalization of a crossbar) may experience blocking: communication between a pair of nodes may block the communication between another pair of nodes. • Solution is to have redundant or alternate paths: by limiting the number nodes and using alternate paths, an omega network can be non-blocking.

Network Topologies Examples Direct connections only to nearestneighbors ==>

Multiprocessing and Parallel Processing Chapter 9