MIMD

MIMD Multiple Instruction Multiple Data

Parallelism • We’ve looked at instruction level parallelism so far. • VLIW • Superscalar architectures • SIMD?

P P M M M M M P N E T P P M M Today:Parallelism vs. Parallelism • Uni: • Pipelined • Superscalar • VLIW/”EPIC” • SMP (“Symmetric”) • Distributed TLP ILP

MIMD • According to Flynn’s classification, each processor has its own ALU, CU,memory, and its I/O devices. Each processor is capable of performing a processing task totally independent of other processors.

A history… • Many early parallel processors were SIMD • Recently, MIMD most common multiprocessor arch. • Why MIMD? • Can make MIMD machines using “off-the-shelf” chips • Many uniprocessors are made, multiprocessors much fewer • Price low for uniprocesor chips (mass market),high for specialized multiprocessor chips (too few are made) • Can get cheaper multies if they use the same chips as unies

SIMD and MIMD Processors A typical SIMD architecture (a) and a typical MIMD architecture (b).

MIMD Processors • In contrast to SIMD processors, MIMD processors can execute different programs on different processors. • A variant of this, called single program multiple data streams (SPMD) executes the same program on different processors. • It is easy to see that SPMD and MIMD are closely related in terms of programming flexibility and underlying architectural support. • Examples of such platforms include current generation Sun Ultra Servers, SGI Origin Servers, multiprocessor PCs, workstation clusters, and the IBM SP.

SIMD-MIMD Comparison • SIMD computers require less hardware than MIMD computers (single control unit). • However, since SIMD processors are specially designed, they tend to be expensive and have long design cycles. • Not all applications are naturally suited to SIMD processors. • In contrast, platforms supporting the SPMD paradigm can be built from inexpensive off-the-shelf components with relatively little effort in a short amount of time.

MIMD • MIMD machines can be further sub-divided Centralized shared-memory architectures • All processors sit on the same bus anduse the same centralized memory • Works well with smaller # of processors • Bus bandwidth a problem with many processors • Physically distributed memory • Each processor has some memory near it,can access other’s memory over a network • With good data locality, most memory accesses local • Works well even with large # of processors

Shared-Address-Space Platforms • Multiprocessors • Part (or all) of the memory is accessible to all processors. • Processors interact by modifying data objects stored in this shared-address-space. • If the time taken by a processor to access any memory word in the system global or local is identical, the platform is classified as a uniform memory access (UMA), else, a non-uniform memory access (NUMA) machine.

NUMA and UMA Shared-Address-Space Platforms Typical shared-address-space architectures: (a) Uniform-memory access shared-address-space computer; (b) Uniform-memory-access shared-address-space computer with caches and memories; (c) Non-uniform-memory-access shared-address-space computer with local memory only.

NUMA and UMA Shared-Address-Space Platforms • Recall: • The distinction between NUMA and UMA platforms is important from the point of view of algorithm design. NUMA machines require locality from underlying algorithms for performance. • Programming these platforms is easier since reads and writes are implicitly visible to other processors. • However, read-write data to shared data must be coordinated. • Caches in such machines require coordinated access to multiple copies. This leads to the cache coherence problem. • A weaker model of these machines provides an address map, but not coordinated access. These models are called non cache coherent shared address space machines.

Shared-Address-Space vs. Shared Memory Machines • It is important to note the difference between the terms shared address space and shared memory. • We refer to the former as a programming abstraction and to the latter as a physical machine attribute. • It is possible to provide a shared address space using a physically distributed memory. Tuple Space for example.

Message-Passing Platforms • These platforms comprise of a set of processors and their own (exclusive) memory. • Instances of such a view come naturally from clustered workstations and non-shared-address-space multicomputers. • These platforms are programmed using (variants of) send and receive primitives. • Libraries such as MPI and PVM provide such primitives.

Message Passing vs. Shared Address Space Platforms • Message passing requires little hardware support, other than a network. • Shared address space platforms can easily emulate message passing. The reverse is more difficult to do (in an efficient manner).

Motivations for MIMD • Advantages: • 1. Reliability • 2. Potential n-fold high performance.

Key Questions • How do parallel processors share data? • How do parallel processors communicate? • How many processors?

Data Sharing • Key hardware issues • Shared memory: how to keep caches coherent • Message passing: low-cost communication

Cache Coherence • This is more of a shared memory problem but can also be a associated with distributed memory systems.

Communication Costs in Parallel Machines • Along with idling and contention, communication is a major overhead in parallel programs. • The cost of communication is dependent on a variety of features including the programming model semantics, the network topology, data handling and routing, and associated software protocols.

Multicomputer Proc + Cache A Proc + Cache B interconnect memory memory

Multiprocessor“Symmetric” Multiprocessor or SMP Cache A Cache B memory

But both can have a cache coherence problem… Cache A Cache B Read X … Write X Read X … ... Oops! X: 0 X: 1 memory X: 0

Cache coherence protocols • Directory Based: • Whether or not some physical memory location is shared or not is recorded in 1 central location • Called “the directory” • Snooping: • Every cache w/entries from centralized main memory also has a particular block’s “sharing status” • No centralized state kept • Caches connected to shared memory bus • If there is bus traffic, caches check (or “snoop”) to see if they have the block being transferred on bus

Snoopy Cache CPU CPU references check cache tags (as usual) Cache misses filled from memory (as usual) + Other read/write on bus must check tags, too, and possibly invalidate State Tag Data Bus

Method • Method One: Invalidation • Method Two: write update

Maintaining the coherence requirement • Method one: make sure writing processor has the only cached copy of data word before it is written • Called the “write invalidate protocol” • Write invalidates other cached copies of the data • Most common for both snooping and directory schemes

Write invalidate example • Assumes neither cache had value/location X in it 1st • When 2nd miss by B occurs, CPU A responds with value canceling response from memory. • A updates B’s cache & memory contents of X updated

Maintaining the coherence requirement • What if 2 processors try to write at the same time? • One of them does it first • The other’s copy will be invalidated, • When the first write done, the other gets that copy • Then it again invalidates all cached copies and writes… • Caches snoop on the bus, so they’ll detect a “request to write”; so whichever machine gets to the bus 1st, goes 1st • For 2nd to complete its write it needs a new copy first. • The bus access protocol enforces serialization

Snoopy • In snoopy caches, there is a broadcast media that listens to all invalidates and read requests and performs appropriate coherence operations locally. • What actually happens when a miss occurs?

Cache Coherence • With a write through cache, no problem • Data is always in main memory • In shared memory machine, every cache write would go back to main memory – bad for bandwidth! • What about write back caches though? • Much harder. • Most recent value of data could be in a cache instead of memory • How to handle write back caches? • Snoop. • Each processor snoops every address placed on the bus • If a processor has a dirty copy of requested cache block, it responds to read request, and memory request is cancelled

An example protocol • Bus-based protocol usually implemented with a finite state machine controller in each node • Controller responds to requests from processor & bus • Changes the state of the selected cache block and uses the bus to access data or invalidate it

P One of many processors.

P Addr 000000 R W This indicates what operation the processor is trying to perform and with what address.

Addr 000000 R W Tag 0000 0000 0000 0000 11 10 01 00 ID V 0 0 0 0 0 D 0 0 0 0 0 0 S 0 P The processors cache: Tag (4 bits), 4 lines (ID), Valid, dirty and Shared bits.

Addr 000000 R W Tag 0000 0000 0000 0000 ID 00 01 10 11 0 0 V 0 0 0 0 0 D 0 0 S 0 0 0 P Note: For this somewhat simplified example we won’t concern ourselves with how many bytes (or words) are in each line. Assume that it’s more than one.

Addr 000000 R W 0000 0000 0000 Tag 0000 11 ID 00 01 10 0 0 0 V 0 0 0 0 0 D S 0 0 0 0 P The Bus with indication of address and operation. Addr 000000 R W

Addr 000000 R W 0000 0000 0000 Tag 0000 11 ID 00 01 10 0 0 0 V 0 0 0 0 0 D S 0 0 0 0 P These bus operations are coming from other processors which aren’t shown. Addr 000000 R W

Addr 000000 R W Tag 0000 0000 0000 0000 ID 00 11 01 10 0 V 0 0 0 D 0 0 0 0 0 0 S 0 0 P Addr 000000 R W MEMORY Main Memory

Tag 0000 0000 0000 0000 01 00 10 ID 11 0 0 0 0 V D 0 0 0 0 0 0 0 0 S P Processor issues a read Addr 101010 R W Addr 000000 R W MEMORY

P Cache reports... Addr 101010 R W MISS Tag ID V D S 0000 00 0 0 0 0000 01 0 0 0 0000 10 0 0 0 Addr 0000 11 0 0 0 000000 R W MEMORY

P Cache reports... Addr 101010 R W MISS Tag ID V D S 0000 00 0 0 0 0000 01 0 0 0 0000 10 0 0 0 Addr 0000 11 0 0 0 000000 R W Because the tags don’t match! MEMORY

P Data read from memory Addr 101010 R W Tag ID V D S 0000 00 0 0 0 0000 01 0 0 0 1010 10 1 0 1 Addr 0000 11 0 0 0 000000 R W MEMORY

P Data read from memory Addr 101010 R W Tag ID V D S 0000 00 0 0 0 This bit indicates that this line is “shared” which means other caches might have the same value. 0000 01 0 0 0 1010 10 1 0 1 Addr 0000 11 0 0 0 000000 R W MEMORY

P From now on we will show these as 2 step operations…step 1 the request. Addr 101010 R W Tag ID V D S 0000 00 0 0 0 0000 01 0 0 0 0000 10 0 0 0 Addr 0000 11 0 0 0 000000 R W MEMORY

P Step 2…what was the result and the change to the cache. Addr 101010 R W MISS Tag ID V D S 0000 00 0 0 0 0000 01 0 0 0 1010 10 1 0 1 Addr 0000 11 0 0 0 000000 R W MEMORY

P A write... Addr 111100 R W Tag ID V D S 0000 00 0 0 0 0000 01 0 0 0 1010 10 1 0 1 Addr 0000 11 0 0 0 000000 R W MEMORY

1111 00 1 1 0 0000 01 0 0 0 1010 10 1 0 1 0000 11 0 0 0 P Addr 111100 R W Write Miss Tag ID V D S Addr 000000 R W MEMORY

1111 00 1 1 0 0000 01 0 0 0 1010 10 1 0 1 0000 11 0 0 0 P Keep in mind that since most cache configurations have multiple bytes per line a write miss will actually require us to get the line from memory into the cache first since we are only writing one byte into the line. Addr 111100 R W Write Miss Tag ID V D S Addr 000000 R W MEMORY

1111 00 1 1 0 0000 01 0 0 0 1010 10 1 0 1 0000 11 0 0 0 P Note: The dirty bit signifies that the data in the cache is not the same as in memory. Addr 111100 R W Tag ID V D S Addr 000000 R W MEMORY

MIMD

MIMD

Presentation Transcript

CS213 Parallel Processing Architecture Lecture 5: MIMD Program Design

Introduction to MIMD architectures

Computer Architecture Shared Memory MIMD Architectures

TI C6701 VLIW MIMD

Disk Directed I/O for MIMD Multiprocessors

MIMD Computers

Computer Architecture Introduction to MIMD architectures

Design of an MIMD Multimicroprocessor for DSM

MIMD

MIMD Distributed Memory Architectures

SIMD-MIMD Real-Time Comparisons (Chapter 7)

SIMD-MIMD Real-Time Comparisons (Our Chapter 7)

Cache coherence, etc… - MIMD –

MIMD COMPUTERS

MIMD Shared Memory

Parallel MIMD Algorithm Design