Chapter 7 (excl. 7.9): Scalable Multiprocessors

Chapter 7 (excl. 7.9):Scalable Multiprocessors EECS 570: Fall 2003 -- rev3

Outline Scalability • bus doesn't hack it: need scalable interconnect (network) Realizing Programming Models on Scalable Systems • network transactions • protocols • shared address space • message passing (synchronous, asynchronous, etc.) • active messages • safety: buffer overflow, fetch deadlock Communication Architecture Design Space • where does network attach to processor node? • how much hardware interpretation of the network transaction? • impact on cost & performance EECS 570: Fall 2003 -- rev3

Scalability How do bandwidth, latency, cost, and packaging scale with P? • Ideal: • latency, per-processor bandwidth, per-processor cost are constants • packaging does not create upper bound, does not exacerbate others • Bus: • per-processor BW scales as 1/P • latency increases with P: • queuing delays for fixed bus length (linear at saturation?) • as bus length is increased to accommodate more CPUs, clock must slow • Reality: • “scalable” may just mean sub-Linear dependence (e.g., logarithmic) • practical limits ($/customer), sweet spot + growth path • switched interconnect (network) EECS 570: Fall 2003 -- rev3

Aside on Cost-Effective Computing Traditional view: efficiency = speedup(P)/P Efficiency < I → parallelism not worthwhile? But much of a computer's cost is NOT in the processor (memory, packaging, interconnect, etc.) • [Wood & Hill, IEEE Computer 2/95] Let Costup(P) = Cost(P)/Cost(1) Parallel computing cost-effective: Speedup(P) > Costup(P) E.g. for SGI PowerChallenge w/500MB: Costup(32) = 86 EECS 570: Fall 2003 -- rev3

Network Transaction Primitive one-way transfer of information from source to destination causes some action at the destination • process info and/or deposit in buffer • state change (e.g., set flag, interrupt program) • maybe initiate reply (separate network transaction) EECS 570: Fall 2003 -- rev3

Network Transaction Correctness Issues • protection • User/user, user/system what if VM doesn't apply? • Fault containment (large machine component MTBF may be low) • format • Variable length? Header info? • Affects efficiency, ability to handle in HW • buffering/flow control • Finite buffering in network itself • Messages show up announced e.g., many-to-one pattern • deadlock avoidance • If you're not careful -- details later • action • What happens on delivery? How may options are provided? • system guarantees • Delivery, ordering EECS 570: Fall 2003 -- rev3

Performance Issues Key parameters: latency, overhead, bandwidth LogP: lat/over/gap(BW) as function of P CPUS NIS Network NIR CPUR EECS 570: Fall 2003 -- rev3

Programming Models What is user's view of network transaction? • Depends on system architecture • Remember layered approach: OS/compiler/Iibrary may implement alternative model on top of HW-provided model We'll look at three basic ones: • Active Messages: “assembly language” for msg-passing systems • Message Passing"- MPI-style interface, as seen by appl. programmers • Shared Address Space: ignoring cache coherence for now EECS 570: Fall 2003 -- rev3

Request handler Reply handler Active Messages User-level analog of network transaction • invoke handler function at receiver to extract packet from network • grew out of attempts to do dataflow on msg-passing machines & remote procedure calls • handler may send reply, but no other messages • Event notification: interrupts, polling, events? • May also perform memory-to-memory transfer Flexible (can do almost any action on msg reception), but requires tight cooperation between CPU and network for high performance • May be better to have HW do a few things faster EECS 570: Fall 2003 -- rev3

Message Passing Basic idea: • Send(dest, tag, buffer) -- tag is arbitrary integer • Recv(src, tag, buffer ) -- src/tag may be wildcard (“any”) Completion semantics: • Receive completes after data transfer complete from matching send • Synchronous send completes after matching receive and data sent • Asynchronous send completes after send buffer may be reused • msg may simply be copied into alternate buffer, on src or dest node Blocking vs. non-blocking: • does function wait for “completion” before returning • non-blocking: extra function calls to check for completion • assume blocking for now EECS 570: Fall 2003 -- rev3

Synchronous Message Passing Three-phase operation: ready-to-send, ready-to-receive, transfer • Can skip 1st phase if receiver initiates & specifies source Overhead, latency tend to be high Transfer can achieve high bandwidth w/sufficient msg length Programmer must avoid deadlock (e.g. pairwise exchange) EECS 570: Fall 2003 -- rev3

Asynch. Msg Passing: Conservative Same as synchronous. except msg can be buffered on sender • Allows computation to continue sooner • Deadlock still (not so much) an issue -- buffering is finite EECS 570: Fall 2003 -- rev3

Asynch. Message Passing: Optimistic Sender just ships data, hopes receiver can handle it Benefit: lower latency Problems. • receive was posted need fast lookup while data streams in • receive not posted buffer? nack? discard? EECS 570: Fall 2003 -- rev3

Key Features of Msg Passing Abstraction Source knows send data address, dest. knows receive data address • after handshake they both know both Arbitrary storage for asynchronous protocol • may post many sends before any receives Fundamentally a 3-phase transaction • can use optimistic 1-phase in limited cases Latency, overhead tend to be higher than SAS: high BW easier Hardware support? • DMA: physical or virtual (better/harder) EECS 570: Fall 2003 -- rev3

Shared Address Space Abstraction Two-way request/response protocol • reads require data response • writes have acknowledgment (for consistency) Issues • virtual or physical address on net? (where does translation happen) • coherence, consistency, etc (later) EECS 570: Fall 2003 -- rev3

Key Properties of SAS Abstraction Data addresses are specified by the source of the request • no dynamic buffer allocation • protection achieved through virtual memory translation Low overhead initiation: one instruction (load or store) High bandwidth more challenging • may require prefetching, separate “block transfer engine” Synchronization less straightforward (no explicit event notification) Simple request-response pairs • few fixed message types • practical to implement in hardware w/o remote CPU involvement Input buffering I flow control issue • what if request rate exceeds local memory bandwidth? EECS 570: Fall 2003 -- rev3

Challenge 1: Input Buffer Overflow Options • refuse input when full • creates “back pressure” (in reliable network) • to avoid deadlock: • low-level ack/nack • assumes dedicated network path for ack/nack (common in rings) • retry on nack • drop packets • retry on timeout • avoid by reserving space per source (“credit-based”) • when available for reuse? • scalability? EECS 570: Fall 2003 -- rev3

Challenge 2: Fetch Deadlock Processing a message may require sending a reply; what if reply can't be sent due to input buffer overflow? • step 1: guarantee that replies can be sunk @ destination • requester reserves buffer space for reply • step 2: guarantee that replies can be sent into network • backpressure: logically independent request/reply networks • physical networks or virtual channels • credit-based: bound outgoing requests to K per node • buffer space for K(P-l ) requests + K responses at each node • low-level ack/nack, packet dropping • guarantee that replies will never be nacked or dropped For cache coherence protocols, some requests may require more: forward request to other node, send multiple invalidations • must extend techniques or nack these requests up front EECS 570: Fall 2003 -- rev3

Outline Scalability [7.1 - read] Realizing Programming Models • network transactions • protocols: SAS, MP, Active Messages • safety: buffer overflow, fetch deadlock Communication Architecture Design Space • where does hardware fit into node architecture? • how much hardware interpretation of the network transaction? • how much gap between hardware and user semantics? • remainder must be done in software • increased flexibility, increased latency & overhead • main CPU or dedicated/specialized processor? EECS 570: Fall 2003 -- rev3

Processor Cache Network Interface Memory Bus I/O Bridge Network I/O Bus Main Memory Disk Controller Disk Disk Massively Parallel Processor (MPP) Architectures • Network interface typically close to processor • Memory bus: • locked to specific processor architecture/bus protocol • Registers/cache: • only in research machines • Time-to-market is long • processor already available or work closely with processor designers • Maximize performance and cost EECS 570: Fall 2003 -- rev3

interrupts Processor Cache Core Chip Set I/O Bus Main Memory Disk Controller Graphics Controller Network Interface Graphics Disk Disk Network Network of Workstations Network interface on I/0 bus Standards (e.g., PCI) => longer life, faster to market Slow (microseconds) to access network interface “System Area Network” (SAN): between LAN & MPP EECS 570: Fall 2003 -- rev3

Transaction Interpretation Simple: HW doesn't interpret much if anything • DMA from/to buffer, interrupt or set flag on completion • nCUBE, conventional LAN • Requires OS for address translation, often a user/kernel copy User-level messaging: get the OS out of the way • HW does protection checks to allow direct user access to network • may have minimal interpretation otherwise • May be on I/O bus (Myrinet), memory bus (CM-5), or in regs (J-Machine, *T) • May require CPU involvement in all data transfers (explicit memory-to-network copy) EECS 570: Fall 2003 -- rev3

Transaction Interpretation (cont'd) Virtual DMA: get the CPU out of the way (maybe) • basic protection plus address translation: user-level bulk DMA • Usually to limited region of addr space (pinned) • Can he done in hardware (VIA, Meiko CS-2) or software (some Myrinet, Intel Paragon) Reflective memory • DEC memory channel, Princeton SHRIMP Global physical address space (NUMA): everything in hardware • complexity increases, but performance does too (if done right) Cache coherence: even more so • stay tuned EECS 570: Fall 2003 -- rev3

Net Transactions: Physical DMA • Physical addresses: OS must initiate transfers • system call per message on both ends: ouch • Sending OS copies data to kernel buffer w/ header/trailer • can avoid copy if interface does scatter/gather • Receiver copies packet into OS buffer, then interprets • user message then copied (or mapped) into user space EECS 570: Fall 2003 -- rev3

nCUBE/2 Network Interface independent DMA channel per link direction segmented messages: can inspect header to direct remainder of DMA directly to user buffer • avoids copy at expense of extra interrupt + DMA setup cost • can't let buffer be paged out (did nCUBE have VM?) EECS 570: Fall 2003 -- rev3

Host Memory NIC trncv Data NIC Controller addr Addr Len Status Next Addr Len Status Next Addr Len Status Next Addr Len Status Next Addr Len Status Next Addr Len Status Next TX DMA RX len IO Bus mem bus Proc Conventional LAN Network Interface EECS 570: Fall 2003 -- rev3

User Level Messaging • map network hardware into user’s address space • talk directly to network via loads & stores • user-to-user communication without OS intervention: low latency • protection: user/user & user/system • DMA hard… CPU involvement (copying) becomes bottleneck EECS 570: Fall 2003 -- rev3

User Level Network Ports Appears to user as logical message queues plus status EECS 570: Fall 2003 -- rev3

Example: CM-5 • Input and output FIFO for each network • Two data networks • Save/restore network buffers on context switch EECS 570: Fall 2003 -- rev3

U s e r / s y s t e m D a t a A d d r e s s D e s t ° ° ° M e m M e m P P User Level Handlers • Hardware support to vector to address specified in message • message ports in registers • alternate register set for handler? • Examples: J-Machine, Monsoon, *T (MIT), iWARP (CMU) EECS 570: Fall 2003 -- rev3

J-Machine • Each node a small message-driven processor • HW support to queue msgs and dispatch to msg handler task EECS 570: Fall 2003 -- rev3

Network dest ° ° ° Mem Mem NI NI P P M P M P User System User System Dedicated Message Processing Without Specialized Hardware • General purpose processor performs arbitrary output processing (at system level) • General purpose processor interprets incoming network transactions (in system) • User Processor <–> Msg Processor share memory • Msg Processor <–> Msg Processor via system network transaction EECS 570: Fall 2003 -- rev3

Network dest ° ° ° Mem Mem NI NI P M P M P P Levels of Network Transaction • User Processor stores cmd / msg / data into shared output queue • must still check for output queue full (or grow dynamically) • Communication assists make transaction happen • checking, translation, scheduling, transport, interpretation • Avoid system call overhead • Multiple bus crossings likely bottleneck EECS 570: Fall 2003 -- rev3

Network I/O Nodes I/O Nodes Devices Devices 16 175 MB/s Duplex rte MP handler Mem 2048 B ° ° ° EOP Var data NI 64 i860xp 50 MHz 16 KB $ 4-way 32B Block MESI 400 MB/s sDMA $ $ rDMA P M P Example: Intel Paragon EECS 570: Fall 2003 -- rev3

EECS 570: Fall 2003 -- rev3

Dedicated MP w/specialized NI:Meiko CS-2 • Integrate message processor into network interface • active messages-like capability • dedicated threads for DMA, reply handling, simple remote memory access • supports user-level virtual DMA • own page table • can take a page fault, signal OS, restart • meanwhile, nack other node • Problem: processor is slow, time-slices threads • fundamental issue with building your own CPU EECS 570: Fall 2003 -- rev3

Myricom Myrinet (Berkeley NOW) • Programmable network interface on I/O Bus (Sun SBUS or PCI) • embedded custom CPU (“Lanai”, ~40 MHz RISC CPU) • 256KB SRAM • 3 DMA engines: to network, from network, to/from host memory • Downloadable firmware executes in kernel mode • includes source-based routing protocol • SRAM pages can be mapped into user space • separate pages for separate processes • firmware can define status words, queues, etc. • data for short messages or pointers for long ones • firmware can do address translation too… w/OS help • poll to check for sends from user • Bottom line: I/O bus still bottleneck, CPU could be faster EECS 570: Fall 2003 -- rev3

Shared Physical Address Space • Implement SAS model in hardware w/o caching • actual caching must be done by copying from remote memory to local • programming paradigm looks more like message passing than Pthreads • yet, low latency & low overhead transfers thanks to HW interpretation; high bandwidth too if done right • result: great platform for MPI & compiled data-parallel codes • Implementation: • “pseudo-memory” acts as memory controller for remote mem, converts accesses to network transaction (request) • “pseudo-CPU” on remote node receives requests, performs on local memory, sends reply • split-transaction or retry-capable bus required (or dual-ported mem) EECS 570: Fall 2003 -- rev3

Example: Cray T3D • Up to 2,048 Alpha 21064s • no off-chip L2 to avoid inherent latency • In addition to remote mem ops, includes: • prefetch buffer (hide remote latency) • DMA engine (requires OS trap) • synchronization operations (swap, fetch&inc, global AND/OR) • message queue (requires OS trap on receiver) • Big problem: physical address space • 21064 supports only 32 bits • 2K-node machine limited to 2M per node • external “DTB annex” provides segment-like registers for extended addressing, but management is expensive & ugly EECS 570: Fall 2003 -- rev3

Cray T3E • Similar to T3D, uses Alpha 21164 instead of 21064 (on-chip L2) • still has physical address space problems • E-registers for remote communication and synchronization • 512 user, 128 system; 64 bits each • replace/unify DTB Annex, prefetch queue, block transfer engine, and remote load / store, message queue • Address specifies source or destination E-register and command • Data contains pointer to block of 4 E-regs and index for centrifuge • Centrifuge • supports data distributions used in data-parallel languages (HPF) • 4 E-regs for global memory operation: mask, base, two arguments • Get & Put Operations EECS 570: Fall 2003 -- rev3

T3E (continued) • Atomic Memory operations • E-registers & centrifuge used • F&I, F&Add, Compare&Swap, Masked_Swap • Messaging • arbitrary number of queues (user or system) • 64-byte messages • create msg queue by storing message control word to memory location • Msg Send • construct data in aligned block of 8 E-regs • send like put, but dest must be message control word • processor is responsible for queue space (buffer management) • Barrier and Eureka synchronization EECS 570: Fall 2003 -- rev3

Virtual Virtual Physical Physical DEC Memory Channel (Princeton SHRIMP) • Reflective Memory • Writes on Sender appear in Receiver’s memory • send & receive regions • page control table • Receive region is pinned in memory • Requires duplicate writes, really just message buffers EECS 570: Fall 2003 -- rev3

Performance of Distributed Memory Machines • Microbenchmarking • One-way latency of small (five-word) message • echo test • round-trip divided by 2 • Shared Memory remote read • Message Passing Operations • see text EECS 570: Fall 2003 -- rev3

Network Transaction Performance Figure 7.31 EECS 570: Fall 2003 -- rev3

Remote Read Performance Figure 7.32 EECS 570: Fall 2003 -- rev3

Summary of Distributed Memory Machines • Convergence of architectures • everything “looks basically the same” • processor, cache, memory, communication assist • Communication Assist • where is it? (I/O bus, memory bus, processor registers) • what does it know? • does it just move bytes, or does it perform some functions? • is it programmable? • does it run user code? • Network transaction • input & output buffering • action on remote node EECS 570: Fall 2003 -- rev3

Chapter 7 (excl. 7.9): Scalable Multiprocessors

Chapter 7 (excl. 7.9): Scalable Multiprocessors

Presentation Transcript

Constructing Scalable Overlays for Pub/Sub With Many Topics

Introduction to Clinical Medicine

Cluster and Outlier Analysis

Designing Highly Scalable OLTP Systems

CPE 631: Multiprocessors and Thread-Level Parallelism

Civil Engineering Materials

Modeling Users and Content : Structured Probabilistic Representation and Scalable Online Inference Algorithms

Scalable Reader Writer Synchronization

Chapter 8, Object Design: Reuse and Patterns

EEL 5764 Graduate Computer Architecture Chapter 4 - Multiprocessors and TLP

Data mining @ Mahout

Java Programming

Scalable Molecular Dynamics for Large Biomolecular Systems

Introduction to Clinical Medicine

Preamble Chapter I: Purposes and Principles Chapter II: Membership Chapter III: Organs

Shared Memory Multiprocessors

Scalable Web Architectures

CPOE/POM

Data Mining: Concepts and Techniques