Graduate Computer Architecture I

Graduate Computer Architecture I Lecture 11: Distribute Memory Multiprocessors

Natural Extensions of Memory System P P n 1 P P n 1 $ $ $ $ Mem Mem Inter connection network Inter connection network Mem Mem P P Scale 1 n Switch (Interleaved) First-level $ (Interleaved) Main memory Shared Cache Centralized Memory Dance Hall, UMA Distributed Memory (NUMA)

Fundamental Issues • Naming • Synchronization • Performance: Latency and Bandwidth

Fundamental Issue #1: Naming • Naming • what data is shared • how it is addressed • what operations can access data • how processes refer to each other • Choice of naming affects • code produced by a compiler • via load where just remember address or keep track of processor number and local virtual address for msg. passing • replication of data • via load in cache memory hierarchy or via SW replication and consistency

Fundamental Issue #1: Naming • Global physical address space • any processor can generate, address, and access it in a single operation • memory can be anywhere: virtual addr. translation handles it • Global virtual address space • if the address space of each process can be configured to contain all shared data of the parallel program • Segmented shared address space • locations are named <process number, address> uniformly for all processes of the parallel program

Fundamental Issue #2: Synchronization • Message passing • implicit coordination • transmission of data • arrival of data • Shared address • explicitly coordinate • write a flag • awaken a thread • interrupt a processor

Parallel Architecture Framework • Programming Model • Multiprogramming • lots of independent jobs • no communication • Shared address space • communicate via memory • Message passing • send and receive messages • Communication Abstraction • Shared address space • load, store, atomic swap • Message passing • send, recieve library calls • Debate over this topic • ease of programming vs. scalability

Scalable Machines • Design trade-offs for the machines • specialize vs commodity nodes • capability of node-to-network interface • supporting programming models • Scalability • avoids inherent design limits on resources • bandwidth increases with increase in resource • latency does not increase • cost increases slowly with increase in resource

Bandwidth Scalability T y p i c a l s w i t c h e s B u s S S S S C r o s s b a r M u l t i p l e x e r s P M M P M M P M M P M M • Fundamentally limits bandwidth • Amount of wires • Bus vs. Network Switch

Dancehall Multiprocessor Organization M M M ° ° ° S c a l a b l e n e t w o r k S w i t c h S w i t c h S w i t c h ° ° ° $ $ $ $ P P P P

Generic Distributed System Organization S c a l a b l e n e t w o r k S w i t c h S w i t c h S w i t c h ° ° ° Comm Assist M $ P

Key Property of Distributed System • Large number of independent communication paths between nodes • allow a large number of concurrent transactions using different wires • Independent Initialization • No global arbitration • Effect of a transaction only visible to the nodes involved • effects propagated through additional transactions

Programming Models Realized by Protocols CAD Database Scientific modeling Parallel applications Multipr ogramming Shar ed Message Data Pr ogramming models addr ess passing parallel Compilation Communication abstraction or library User/system boundary Operating systems support Har dwar e/Softwar e boundary Communication har dwar e Physical communication medium Network Transactions

Network Transaction CA CA M P M P • Interpretation of the message • Complexity of the message • Processing in the Comm. Assist • Processing power Scalable Network Message Input Processing – checks – translation – buffering – action Output Processing – checks – translation – formatting – scheduling ° ° ° Communication Assist Node Architecture

Shared Address Space Abstraction S o u r c e D e s t i n a t i o n [ ( 1 ) I n i t i a t e m e m o r y a c c e s s L o a d G l o b a l a d d r e s s ] ( 2 ) A d d r e s s t r a n s l a t i o n ( 3 ) L o c a l / r e m o t e c h e c k R e a d r e q u e s t ( 4 ) R e q u e s t t r a n s a c t i o n R e a d r e q u e s t ( 5 ) R e m o t e m e m o r y a c c e s s M e m o r y a c c e s s W a i t R e a d r e s p o n s e ( 6 ) R e p l y t r a n s a c t i o n R e a d r e s p o n s e ( 7 ) C o m p l e t e m e m o r y a c c e s s T i m e • Fundamentally a two-way request/response protocol • writes have an acknowledgement • Issues • fixed or variable length (bulk) transfers • remote virtual or physical address • deadlock avoidance and input buffer full • Memory coherency and consistency

Shared Physical Address Space

Shared Address Abstraction • Source and destination data addresses are specified by the source of the request • a degree of logical coupling and trust • No storage logically “outside the address space” • may employ temporary buffers for transport • Operations are fundamentally request response • Remote operation can be performed on remote memory • logically does not require intervention of the remote processor

Message passing • Bulk transfers • Synchronous • Send completes after matching recv and source data sent • Receive completes after data transfer complete from matching send • Asynchronous • Send completes after send buffer may be reused

Synchronous Message Passing • Constrained programming model • Destination contention very limited • User/System boundary S o u r c e D e s t i n a t i o n ( 1 ) I n i t i a t e s e n d ( 2 ) A d d r e s s t r a n s l a t i o n o n P R e c v P , l o c a l V A , l en s r c S e n d P , l o c a l V A , l e n d e s t s r c ( 3 ) L o c a l / r e m o t e c h e c k S e n d - r d y r e q ( 4 ) S e n d - r e a d y r e q u e st ( 5 ) R e m o t e c h e c k f o r p o s t e d r e c e i v e W a i t T a g c h e c k ( a s s u m e s u c c e s s ) ( 6 ) R e p l y t r a n s a c t i o n R e c v - r d y r e p l y ( 7 ) B u l k d a t a t r a n s f e r S o u r c e V A D e s t V A o r I D D a t a - x f e r r e q T i m e

Asynch Message Passing: Optimistic • More powerful programming model • Wildcard receive  non-deterministic • Storage required within msg layer?

Active Messages • User-level analog of network transaction • transfer data packet and invoke handler to extract it from the network and integrate with on-going computation • Request/Reply • Event notification: interrupts, polling, events? • May also perform memory-to-memory transfer Request handler Reply handler

Message Passing Abstraction • Source knows send data address, dest. knows receive data address • after handshake they both know both • Arbitrary storage “outside the local address spaces” • may post many sends before any receives • non-blocking asynchronous sends reduces the requirement to an arbitrary number of descriptors • Fundamentally a 3-phase transaction • includes a request / response • can use optimistic 1-phase in limited “Safe” cases

Data Parallel • Operations can be performed in parallel • each element of a large regular data structure, such as an array • Data parallel programming languages lay out data to processor • Processing Element • 1 Control Processor broadcast to many PEs • When computers were large, could amortize the control portion of many replicated PEs • Condition flag per PE so that can skip • Data distributed in each memory • Early 1980s VLSI  SIMD rebirth • 32 1-bit PEs + memory on a chip was the PE

Data Parallel • Architecture Development • Vector processors have similar ISAs, but no data placement restriction • SIMD led to Data Parallel Programming languages • Single Program Multiple Data (SPMD) model • All processors execute identical program • Advanced VLSI Technology • Single chip FPUs • Fast µProcs (SIMD less attractive)

Cache Coherent System • Invoking coherence protocol • state of the line is maintained in the cache • protocol is invoked if an “access fault” occurs on the line • Actions to Maintain Coherence • Look at states of block in other caches • Locate the other copies • Communicate with those copies

Scalable Cache Coherence Scalable Networks - many simultaneous transactions Realizing Program Models through net transaction protocols - efficient node-to-net interface - interprets transactions Scalable distributed memory Caches naturally replicate data - coherence through bus snooping protocols - consistency Need cache coherence protocols that scale! - no broadcast or single point of order

Bus-based Coherence • All actions done as broadcast on bus • faulting processor sends out a “search” • others respond to the search probe and take necessary action • Could do it in scalable network too • broadcast to all processors, and let them respond • Conceptually simple, but doesn’t scale with p • on bus, bus bandwidth doesn’t scale • on scalable network, every fault leads to at least p network transactions

One Approach: Hierarchical Snooping • Extend snooping approach • hierarchy of broadcast media • processors are in the bus- or ring-based multiprocessors at the leaves • parents and children connected by two-way snoopy interfaces • main memory may be centralized at root or distributed among leaves • Actions handled similarly to bus, but not full broadcast • faulting processor sends out “search” bus transaction on its bus • propagates up and down hierarchy based on snoop results • Problems • high latency: multiple levels, and snoop/lookup at every level • bandwidth bottleneck at root

Scalable Approach: Directories • Directory • Maintain cached block copies • Maintain memory block states • On a miss in own memory • Look up directory entry • Communicate only with the nodes with copies • Scalable networks • Communication through network transactions • Different ways to organize directory

Basic Directory Transactions 1. RdEx request 1. P P to directory Read request C to directory C P M/D 2. M/D A A C Reply with 2. sharers identity M/D A P Reply with o wner identity C 3. Read req. M/D A In v al. req. In v al. req. to o wner to sharer to sharer 4a. Data Reply In In v al. ack 4b . Re vision message to directory P P P C C C M/D M/D M/D A A A Requestor Requestor Dir ectory node for block 3b . Dir ectory 3a. node 4b . 4a. v al. ack Shar er Node with Shar er dirty cop y (b) Write miss to a block with tw o sharers (a) Read miss to a block in dirty state

Example Directory Protocol (1st Read) R/req R/reply D E E S S S S S I U I Read pA P1: pA Dir ctrl M $ P1 $ P2 ld vA -> rd pA

Example Directory Protocol (Read Share) R/req R/req R/reply D E E R/_ R/_ R/_ S S S S S S I U I P1: pA Dir ctrl M P2: pA $ P1 $ P2 ld vA -> rd pA

Example Directory Protocol (Wr to shared) R/req R/reply W/req E R/req RX/invalidate&reply Inv ACK reply xD(pA) Read_to_update pA E E D R/_ R/_ R/_ S S S I S E D S S Inv/_ I U I P1: pA Excl Dir ctrl M P2: pA Invalidate pA $ P1 $ P2 st vA -> wr pA

A Popular Middle Ground • Two-level “hierarchy” • Coherence across nodes is directory-based • directory keeps track of nodes, not individual processors • Coherence within nodes is snooping or directory • orthogonal, but needs a good interface of functionality • Examples • Convex Exemplar: directory-directory • Sequent, Data General, HAL: directory-snoopy

Two-level Hierarchies P P P P P P P P C C C C C C C C B1 B1 B1 B1 Snooping Snooping Main Assist Assist Main Main Dir . Main Dir . Adapter Adapter Mem Mem Mem Mem B2 Netw ork (a) Snooping-snooping (b) Snooping-directory P P P P P P P P C C C C C C C C A A A A A M/D A A A M/D M/D M/D M/D M/D M/D M/D Netw ork1 Netw ork1 Netw ork1 Netw ork1 Directory adapter Directory adapter Dir/Snoop y adapter Dir/Snoop y adapter Bus (or Ring) Netw ork2 (d) Directory-snooping (c) Directory-directory

Memory Consistency w h i l e ( f l a g = = 0 ) ; A = 1 ; p r i n t A ; f l a g = 1 ; P P P 2 3 1 M e m o r y M e m o r y M e m o r y A : 0 f l a g : 0 - > 1 D e l a y 3 : l o a d A 1 : A = 1 2 : f l a g = 1 I n t e r c o n n e c t i o n n e t w o r k ( a ) P P 3 2 P 1 C o n g e s t e d p a t h ( b ) • Memory Coherence • Consistent view of the memory • Not ensure how consistent • In what order of execution

Memory Consistency • Relaxed Consistency • Allows Out-of-order Completion • Different Read and Write ordering models • Increase in Performance but possible errors • Current Systems • Relaxed Models • Expectation for synchronous programs • Use of standard synchronization libraries

Graduate Computer Architecture I