Enhancing Scalable Multiprocessors: Spectrum of Design Approaches

Scalable Multiprocessors (III)

Spectrum of Designs • None: Physical bit stream • blind, physical DMA nCUBE, iPSC, . . . • User/System • User-level port CM-5, *T • User-level handler J-Machine, Monsoon, . . . • Dedicated Processor • Message passing, Remote virtual address • Processing, translation Paragon, Meiko CS-2 • Global physical address • Proc + Memory controller RP3, BBN, T3D • Cache-to-cache • Cache controller Dash, KSR, Flash Increasing HW Support, Specialization, Intrusiveness, Performance (???)

Dedicated Message Processing • Without binding the interpretation in the hardware design • Interpretation by software • Off-loading the protocol processing to the CP • Can support a global address space

Network dest ° ° ° Mem Mem NI NI P P M P M P User System User System Without Specialized Hardware Design • General Purpose processor performs arbitrary output processing (at system level) • General Purpose processor interprets incoming network transactions (at system level) • User Processor «Msg Processor share memory • Msg Processor « Msg Processor via system network transaction

P P Levels of Network Transaction Network User information • User Processor stores cmd / msg / data into shared output queue • must still check for output queue full (or make elastic) • Communication assists make transaction happen • checking, translation, scheduling, transport, interpretation • Effect observed on destination address space and/or events • Protocol divided between two layers dest ° ° ° Mem Mem NI NI M P M P User System

Example: Intel Paragon Service Network I/O Nodes I/O Nodes Devices Devices 16 175 MB/s Duplex rte MP handler Mem 2048 B ° ° ° EOP Var data NI 64 i860xp 50 MHz 16 KB $ 4-way 32B Block MESI 400 MB/s sDMA $ $ rDMA P M P

User Level Abstraction • Any user process can post a transaction for any other in protection domain • communication layer moves OQsrc –> IQdest • may involve indirection: VASsrc –> VASdest IQ IQ Proc Proc OQ OQ VAS VAS IQ IQ Proc Proc OQ OQ VAS VAS

Basic Implementation Costs: Scalar 10.5 µs • Cache-to-cache transfer (quad word ops) • Producer, consumer: cache miss, hit -> bus transactions • to NI FIFO: read status, chk, write, . . . • from NI FIFO: read status, chk, dispatch, read, read, . . . Net CP MP MP CP 2 1.5 2 2 2 2 Registers 7 wds Cache User OQ User IQ Net FIFO 250ns + H*40ns

sDMA rDMA Memory CP CP MP Net MP MP 2 2 2 2 1.5 2 Registers 7 wds Cache hdr 400 MB/s User IQ User OQ 400 MB/s 2048 2048 Net FIFO 175 MB/s Virtual DMA -> Virtual DMA • Send MP segments into 8K pages and does VA –> PA • Recv MP reassembles, does dispatch and VA –> PA per page

Single Page Transfer Rate Effective Buffer Size: 3232 Actual Buffer Size: 2048

N e t w o r k D e s t P P P P P P o u t i n r e p l y o u t i n r e p l y M e m M e m P V P P P V P P c m d e v e n t c m d e v e n t MBUS SparcStation 10 P P Case Study: Meiko CS2 Concept • Asymmetric CP • Circuit-switched Network Transaction • source-dest circuit held open for request response • limited cmd set executed directly on NI • Dedicated communication processor for each step in flow

G e n e r a t e s O u t p u t c o n t r o l s e t - e v e n t E x e c u t e n e t t r a n s a c t i o n s 3 x w r i t e _ w o r d r e q u e s t s f r o m P t h r e a d f r o m P w r i t e _ b l o c k s D M A P i n p u t a n d s e t - e v e n t w r i t e _ w o r d f r o m P r e p l y R u n - S t a r t - S e t - e v e n t D M A f r o m m t h r e a d S W A P : D M A I n t e r r u p t C M D , A d d r A c c e p t M e m i n t e r f a c e T h r e a d s P D M A U s e r d e s c r i p t o r s d a t a M e m Case Study: Meiko CS2 Organization N e t w o r k P P P r e p l y t h r e a d D M A e m o r y I s s u e t r a n s a c t i o n s w r i t e _ b l o c k P c m d m ( 5 0 - s l i m i t ) R I S C i n s t r u c t i o n s e t 6 4 - K n o n p r e e m p t i v e t h r e a d s C o n s t r u c t a r b i t r a r y n e t t r a n s a c t i o n s O u t p u t p r o t o c o l elan microprocessor

Spectrum of Designs • None: Physical bit stream • blind, physical DMA nCUBE, iPSC, . . . • User/System • User-level port CM-5, *T • User-level handler J-Machine, Monsoon, . . . • Dedicated Processor • Message passing, Remote virtual address • Processing, translation Paragon, Meiko CS-2 • Global physical address • Proc + Memory controller RP3, BBN, T3D • Cache-to-cache • Cache controller Dash, KSR, Flash Increasing HW Support, Specialization, Intrusiveness, Performance (???)

S c a l a b l e n e t w o r k T a g S r c A d d r R e a d D e s t P s e u d o - P s e u d o p r o c e s s o r m e m o r y $ P D a t a L d R A d d r M e m o r y m a n a g e m e n t u n i t Shared Physical Address Space • NI emulates memory controller at source • NI emulates processor at dest • must be deadlock free S r c R r s p T a g D a t a O u t p u t p r o c e s s i n g M e m a c c e s s R e s p o n s e C o m m m u n i c a t i o n I n p u t p r o c e s s i n g a s s i s t P a r s e C o m p l e t e r e a d P s e u d o - P s e u d o - m e m o r y p r o c e s s o r M e m M e m $ P M M U

37 bits Case Study: Cray T3D • “shell” of support circuitry that embodies the parallel processing capability • Remote memory operations encoded in address 300 MB/s

Case Study: Cray T3D • No L2 cache • local memory access 155ns(23 cycle): 300 ns on a DEC Alpha workstation • Single blocking remote write : 900 ns plus annex setup and address arithmetic • Special support for synchronization • dedicated network to support global-or and global-and operations • atomic swap and fetch&inc • user-level message queue • involves a remote kernel trap • enqueue : 25 ms ; invoking 75 ms • Small messages • Using fetch-and-inc • Enqueue : 3 ms ; dequeue : 1.5 ms

Clusters and NOW • Cluster • Collections of complete computers with dedicated interconnects • Types of clusters • Older systems • Availability clusters • Multiprogramming clusters : Vax VMS Clusters • New systems : mainly used as parallel machines • High performance clusters : Beowulf • Load-leveling clusters : Mosix • Web-service clusters : Linux Virtual Server • Storage clusters : GFS and OpenGFS • Database clusters : Oracle Parallel Server • High Availability clusters : FailSafe, Heartbeat • SSI clusters : Open SSI cluster project

Technology breakthrough • Scalable, low-latency interconnects • Traditional local area networks • Shared bus : Ethernet • Ring : token ring and FDDI • Scalable bandwidth • Switch-based local area networks • HPPI switches, FDDI switches, and FiberChannels • ATM • Fast Ethernet and Gigabit Ethernet • System Area Net • ServerNet : Tandom Corp. • Myrinet • Switch: 8 ports at 160MB/s each • InfiniBand

Issues • Communication Abstractions • TCP/IP • Active Messages : user-level network transactions • Reflective Memory : Shrimp project • VIA : led by Intel, Microsoft, and Compaq • Hardware support for the Communication Assists • Memory Bus vs. I/O Bus • Node architecture • Single processor vs. SMP

Case Study: NOW • General purpose processor embedded in NIC

Reflective Memory • Writes to local region reflected to remote • one of memory-based message passing N o d e j N o d e i V A P h y s i c a l V A 0 a d d r e s s T 0 T T 1 1 T 2 T T 2 2 T I / O 3 R 0 R 1 R R 2 R 1 2 R 2 V A 2 R 3 N o d e k V A T 3 T 0 R 3 T 1 R 1 R 0

Case Study: DEC Memory Channel • See also Shrimp PCT: page control table

Implications for Parallel Software and Synchronization

Communication Performance • Microbenchmarks • The basic net transactions on a user-to-user basis • Active messages • Shared address space • Standard MPI message-passing • Application Level • Read Section 7.8.4 in the text book

Message Time Breakdown • The end-to-end message time = Round-trip time / 2 • Measured by source processor : no global clock • Overhead : cannot be used for useful computation • Latency : can potentially masked by other useful work T o t a l c o m m u n i c a t i o n l a t e n c y O L O r s O b s e r v e d n e t w o r k D e s t i n a t i o n p r o c e s s o r l a t e n c y e c r u C o m m u n i c a t i o n a s s i s t o s e r N e t w o r k e n i h C o m m u n i c a t i o n a s s i s t c a M S o u r c e p r o c e s s o r T i m e o f t h e m e s s a g e

Message Time Comparison • One-way Active Message time (five-word message) T i m e p e r C o m m u n i c a t i o n P r o c e s s i n g o v e r h e a d , 1 4 m e s s a g e , l a t e n c y ( L ) s e n d i n g s i d e ( O ) s p i p e l i n e d Accessing system memory P r o c e s s i n g o v e r h e a d , s e q u e n c e r e c e i v i n g s i d e ( O ) 1 2 r o f r e q u e s t - r e s p o n s e o p e r a t i o n s 1 0 Uncached Read over I/O bus ( g ) s d 8 n o c e s o r 6 c i M 4 2 0 5 5 2 2 D D n n a a - - - - o 3 o 3 r r S S M M t t g g T T l l C C a U a U C C r r o o a a W W k k P P i i O O e e N N M M

Performance Analysis • Send Overhead • CM-5 : • uncached writes of data and uncached read of NI status • Paragon : • bus-based cache coherence protocol within the node • Meiko : • a pointer is enqueued in the NI with a single swap instr. • Swap is very slow -> the cost of uncached operations, synch, and misses is critical to communication performance • Receive Overhead • Cache-to-cache transfer : Paragon • Uncached transfer : CM-5, CS-2 - faster than cache-to-cache • NOW : uncached read over I/O bus

Performance Analysis (cont’d) • Latency • CA occupancy, channel occupancy, network delay • CM-5: 20MB/s links • Channel occupancy • Paragon : 175 MB/s • CP occupancy • Meiko : 40 MB/s • Accessing system memory from the CP

2 5 L a t e n c y I s s u e 2 0 G a p s d 1 5 n o c e s o r c i 1 0 M 5 0 5 2 D 5 2 D n n a a - - - - o o 3 3 r r S S t t M M g g T T l l C C U U a a C C r r o o a a W W k k P P i i O O e e N N M M SAS Time Comparison • Performance of a remote read

Enhancing Scalable Multiprocessors: Spectrum of Design Approaches

Enhancing Scalable Multiprocessors: Spectrum of Design Approaches

Presentation Transcript

Multiprocessors

Scalable Defect Detection

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

General Purpose Node-to-Network Interface in Scalable Multiprocessors

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Scalable Distributed Memory Multiprocessors

Scalable Multiprocessors(II)

Chapter 7 (excl. 7.9): Scalable Multiprocessors

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Node-to-Network Interface in Scalable Multiprocessors

Disco: Running Commodity Operation Systems on Scalable Multiprocessors

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Chapter 8: Cache Coherents in Scalable Multiprocessors

Scalable Multiprocessors

Scalable Synchronous Queues

Scalable Reader-Writer Synchronization for Shared-Memory Multiprocessors

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Node-to-Network Interface in Scalable Multiprocessors

Multiprocessors