Scalable Multiprocessors(II)

Scalable Multiprocessors(II)

Realizing Programming Models • What is required to implement programming models on large distributed-memory machines? • network transactions • protocols • safety • input buffer problem • fetch deadlock

CAD Database Scientific modeling Parallel applications Multipr ogramming Shar ed Message Data Pr ogramming models addr ess passing parallel Compilation Communication abstraction or library User/system boundary Operating systems support Har dwar e/softwar e boundary Communication har dwar e Physical communication medium Programming Models Realized by Protocols Network Transactions rather than bus transactions • What is the network trnasaction?

Network Transaction Primitive • One-way transfer of information from a source output buffer to a dest. input buffer • causes some action at the destination • occurrence is not directly visible at source • Deposit data, state change, reply

Bus Transactions vs Net Transactions • Issues: protection check V->P individual component format wires flexible (encapsulation) output buffering reg, FIFO reg, FIFO, descriptors media arbitration global local destination naming implicit destination and routing info and routing input buffering limited many source action memory access, complex action, + response transfer ordering strong order weak ordering delivery guarantees guaranteed discard information or defer transmission when the destination buffer is full

Shared Address Space Abstraction • Fundamentally a two-way request/response protocol • writes have an acknowledgement • Issues • fixed or variable length (bulk) transfers • remote virtual or physical address, where is action performed? • deadlock avoidance and input buffer full • coherent? • consistent?

Network PE The Fetch Deadlock Problem • Even if a node cannot issue a request, it must sink network transactions. • Incoming transaction may be a request, which will generate a response. PE Back

w h i l e ( f l a g = = 0 ) ; A = 1 ; p r i n t A ; f l a g = 1 ; P P P 2 3 1 P P 2 3 M e m o r y M e m o r y M e m o r y A : 0 f l a g : 0 - > 1 D e l a y 3 : l o a d A 1 : A = 1 P 1 C o n g e s t e d p a t h ( b ) 2 : f l a g = 1 I n t e r c o n n e c t i o n n e t w o r k ( a ) Consistency • write-atomicity violated without caching • Sequential Consistency • IN does not serialize memory accesses to locations on different node • multiple writes for performance (long latency)

Key Properties of SAS Abstraction • Source and destination data addresses are specified by the source of the request • a degree of logical coupling and trust • No storage logically “outside the application address space(s)” • may employ temporary buffers for transport • Operations are fundamentally request response • Remote operation can be performed on remote memory • logically does not require intervention of the remote processor

Message passing • A one-way transfer from a source to a destination • Bulk transfers • Pairwise synchronization event • Synchronous • Send completes after matching recv and source data sent • Receive completes after data transfer complete from matching send • Asynchronous • Send completes after send buffer may be reused

Synchronous Message Passing • Constrained programming model. Deterministic! • Sender Initiated or Receiver Initiated (two net transactions)

Asynch. Message Passing: Optimistic • More powerful programming model • Wildcard receive => non-deterministic • Storage required within msg layer?

Asynch. Message Passing: Optimistic • Problems • Received data into a temporary input buffer and copy it to its proper destination – delay • Input buffer problem • Large buffer size if the transfers are large • Multiple senders and one receiver

Asynch. Msg Passing: Conservative • Where is the buffering? • Short message optimizations (without handshaking) • Certain amount of space for receive data • Credit scheme

Key Features of Msg Passing Abstraction • Source knows send data address, dest. knows receive data address • after handshake they both know both • Arbitrary storage “outside the local address spaces” • may post many sends before any receives • Fundamentally a 3-phase transaction • includes a request / response • can use optimisitic 1-phase in limited “Safe” cases • credit scheme

Active Messages • User-level analog of network transaction • transfer data packet and invoke handler to extract it from the network and integrate with on-going computation • Request/Reply • Event notification: interrupts, polling, signal? Request handler Reply handler

Common Challenges • Input buffer overflow • N-1 queue: over-commitment => must slow sources: how? • reserve space per source (credit) • when available for reuse? • Ack or Higher level • Refuse input when full • backpressure in reliable network • tree saturation • deadlock free • what happens to traffic not bound for congested dest? • Reserve ack back channel • drop packets

Challenges (cont’d) • Fetch Deadlock • For network to remain deadlock free, nodes must continue accepting messages, even when cannot send msgs • what if incoming transaction is a request? • Each may generate a response, which cannot be sent! • What happens when internal buffering is full? • logically independent request/reply networks • physical networks • virtual channels with separate input/output queues • bound requests and reserve input buffer space • K(P-1) requests + K responses per node • service discipline to avoid fetch deadlock? • NACK on input buffer full • NACK delivery?

Challenges in Realizing Prog. Models in the Large • One-way transfer of information • No global knowledge, nor global control • barriers, scans, reduce, global-OR give fuzzy global state • Very large number of concurrent transactions • Management of input buffer resources • many sources can issue a request and over-commit destination before any see the effect • Latency is large enough that you are tempted to “take risks” • optimistic protocols • large transfers • dynamic allocation • Many many more degrees of freedom in design and engineering of these system

Summary • Scalability • physical, bandwidth, latency and cost • level of integration • Realizing Programming Models • network transactions • protocols • safety • input buffer problem: N-1 • fetch deadlock • Communication Architecture Design Space • how much hardware interpretation of the network transaction?

Communication Architecture Design

Scalable Network Message Input Processing – checks – translation – buffering – action Output Processing – checks – translation – formatting – scheduling ° ° ° CA Communication Assist CA Node Architecture M P M P Network Transaction Processing • Key Design Issue: • How much interpretation of the message? • How much dedicated processing in the Comm. Assist?

Spectrum of Designs • None: Physical bit stream • blind, physical DMA nCUBE, iPSC, . . . • User/System • User-level port CM-5, *T • User-level handler J-Machine, Monsoon, . . . • Remote virtual address • Processing, translation Paragon, Meiko CS-2 • Global physical address • Proc + Memory controller RP3, BBN, T3D • Cache-to-cache • Cache controller Dash, KSR, Flash Increasing HW Support, Specialization, Intrusiveness, Performance (???)

Net Transactions: Physical DMA • DMA controlled by regs, generates interrupts • Physical => OS initiates transfers • Send-side • construct system “envelope” around user data in kernel area • Receive • must receive into system buffer, since no interpretation in CA

nCUBE Network Interface • independent DMA channel per link direction • leave input buffers always open • segmented messages • routing interprets envelope • dimension-order routing on hypercube • bit-serial with 36 bit cut-through Vendor mess lib 160ms Active Message Os 16 ins 260 cy 13 ms Or 18 300 cy 15 ms - includes interrupt

Addr Len Status Next Addr Len Status Next Addr Len Status Next Addr Len Status Next Addr Len Status Next Addr Len Status Next Conventional LAN Network Interface HostMemory NIC trncv Data NIC Controller addr TX DMA RX len IO Bus mem bus Proc

User Level Ports • initiate transaction at user level • deliver to user without OS intervention • network port in user space • User/system flag in envelope • protection check, translation, routing, media access in src CA • user/sys check in dest CA, interrupt on system

User Level Network ports • Appears to user as logical message queues plus status • The communication primitives allow a portion of the process state to be in the network

D i a g n o s t i c s n e t w o r k C o n t r o l n e t w o r k D a t a n e t w o r k P M P M P r o c e s s i n g P r o c e s s i n g C o n t r o l I / O p a r t i t i o n p a r t i t i o n p a r t i t i o n p r o c e s s o r s S P A R C F P U D a t a C o n t r o l n e t w o r k s n e t w o r k $ $ N I c t r l S R A M M B U S V e c t o r V e c t o r u n i t u n i t D R A M D R A M D R A M D R A M c t r l c t r l c t r l c t r l D R A M D R A M D R A M D R A M Os 50 cy 1.5 ms Or 53 cy 1.6 ms interrupt 10ms Example: CM-5 • Input and output FIFO for each network • 2 data networks • Requests and responses • tag per message • index NI mapping table • *T integrated NI on chip • iWARP also

U s e r / s y s t e m D a t a A d d r e s s D e s t ° ° ° M e m M e m P P User Level Handlers • Hardware support to vector to address specified in message • message ports in registers

J-Machine • Each node a small msg driven processor • HW support to queue msgs and dispatch to msg handler task

Host Interface unit iWARP • Nodes integrate communication with computation on systolic basis • Msg data direct to register • Stream into memory

Scalable Multiprocessors(II)

Scalable Multiprocessors(II)

Presentation Transcript

Multiprocessors

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

Multiprocessors II

General Purpose Node-to-Network Interface in Scalable Multiprocessors

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Scalable Distributed Memory Multiprocessors

Chapter 7 (excl. 7.9): Scalable Multiprocessors

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Node-to-Network Interface in Scalable Multiprocessors

Disco: Running Commodity Operation Systems on Scalable Multiprocessors

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Multiprocessors

Chapter 8: Cache Coherents in Scalable Multiprocessors

Scalable Multiprocessors

Scalable Multiprocessors (III)

Scalable Reader-Writer Synchronization for Shared-Memory Multiprocessors

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Node-to-Network Interface in Scalable Multiprocessors

Multiprocessors