High-Performance Clusters part 2: Generality

High-Performance Clusterspart 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June 28, 1998 SPAA/PODC

What’s Different about Clusters? • Commodity parts? • Communications Packaging? • Incremental Scalability? • Independent Failure? • Intelligent Network Interfaces? • Fast Scalable Communication? => Complete System on every node • virtual memory • scheduler • file system • ... SPAA/PODC

Topics: Part 2 • Virtual Networks • communication meets virtual memory • Scheduling • Parallel I/O • Clusters of SMPs • VIA SPAA/PODC

General purpose requirements • Many timeshared processes • each with direct, protected access • User and system • Client/Server, Parallel clients, parallel servers • they grow, shrink, handle node failures • Multiple packages in a process • each may have own internal communication layer • Use communication as easily as memory SPAA/PODC

Virtual Networks • Endpoint abstracts the notion of “attached to the network” • Virtual network is a collection of endpoints that can name each other. • Many processes on a node can each have many endpoints, each with own protection domain. SPAA/PODC

How are they managed? • How do you get direct hardware access for performance with a large space of logical resources? • Just like virtual memory • active portion of large logical space is bound to physical resources Host Memory Process n Processor *** Process 3 Process 2 Process 1 NIC Mem P Network Interface SPAA/PODC

Endpoint Transition Diagram HOT R/W NIC Memory Evict Write MsgArrival WARM R/O Paged Host Memory Read Swap COLD Paged Host Memory SPAA/PODC

Network Interface Support • NIC has endpoint frames • Services active endpoints • Signals misses to driver • using a system endpont Frame 0 Transmit Receive Frame 7 EndPoint Miss SPAA/PODC

Solaris System Abstractions • Segment Driver • manages portions of an address space • Device Driver • manages I/O device Virtual Network Driver SPAA/PODC

LogP Performance • Competitive latency • Increased NIC processing • Difference mostly • ack processing • protection check • data structures • code quality • Virtualization cheap SPAA/PODC

Msg burst work Client Server Client Server Server Client Bursty Communication among many SPAA/PODC

Multiple VN’s, Single-thread Server SPAA/PODC

Multiple VNs, Multithreaded Server SPAA/PODC

Perspective on Virtual Networks • Networking abstractions are vertical stacks • new function => new layer • poke through for performance • Virtual Networks provide a horizontal abstraction • basis for building new, fast services • Open questions • What is the communication “working set” ? • What placement, replacement, … ? SPAA/PODC

Beyond the Personal Supercomputer • Able to timeshare parallel programs • with fast, protected communication • Mix with sequential and interactive jobs • Use fast communication in OS subsystems • parallel file system, network virtual memory, … • Nodes have powerful, local OS scheduler • Problem: local schedulers do not know to run parallel jobs in parallel SPAA/PODC

Local Scheduling • Schedulers act independently w/o global control • Program waits while trying communicate with its peers that are not running • 10 - 100x slowdowns for fine-grain programs! => need coordinated scheduling SPAA/PODC

Explicit Coscheduling • Global context switch according to precomputed schedule • How do you build it? Does it work? SPAA/PODC

Master LS LS LS LS A A A A A A GS GS GS GS LS LS LS LS A A A A A A Typical Cluster Subsystem Structures Master-Slave Local service Applications Communication Communication Peer-to-Peer Global Service Communication SPAA/PODC

GS GS LS LS A A GS GS LS LS A A A A Ideal Cluster Subsystem Structure • Obtain coordination without explicit subsystem interaction, only the events in the program • very easy to build • potentially very robust to component failures • inherently “service on-demand” • scalable • Local service component can evolve. SPAA/PODC

M LS LS GS GS LS LS A LS A LS A A A A A A GS GS GS GS GS GS LS LS LS LS LS LS A A A A A A A A A A Three approaches examined in NOW • GLUNIX explicit master-slave (user level) • matrix algorithm to pick PP • uses stops & signals to try to force desired PP to run • Explicit peer-peer scheduling assist with VNs • co-scheduling daemons decide on PP and kick the solaris scheduler • Implicit • modify the parallel run-time library to allow it to get itself co-scheduled with standard scheduler SPAA/PODC

Problems with explicit coscheduling • Implementation complexity • Need to identify parallel programs in advance • Interacts poorly with interactive use and load imbalance • Introduces new potential faults • Scalability SPAA/PODC

WS 1 Job A sleep Job A request response WS 2 Job B Job A WS 3 Job B Job A spin WS 4 Job B Job A Why implicit coscheduling might work • Active message request-reply model • Infer non-local state from local observations; react to maintain coordination observation implication action fast response partner scheduled spin delayed response partner not scheduled block SPAA/PODC

Obvious Questions • Does it work? • How long do you spin? • What are the requirements on the local scheduler? SPAA/PODC

How Long to Spin? • Answer: round trip time + context switch + msg processing • round-trip to stay scheduled together • plus wake-up to get scheduled together • keep spinning if serving messages • interval of 3 x wake-up SPAA/PODC

Does it work? SPAA/PODC

Synthetic Bulk-synchronous Apps • Range of granularity and load imbalance • spin wait 10x slowdown SPAA/PODC

With mixture of reads • Block-immediate 4x slowdown SPAA/PODC

Timesharing Split-C Programs SPAA/PODC

Many Questions • What about • mix of jobs? • sequential jobs? • unbalanced placement? • Fairness? • Scalability? • How broadly can implicit coordination be applied in the design of cluster subsystems? • Can resource management be completely decentralized? • Computational economies, ecologies SPAA/PODC

A look at Serious File I/O • Traditional I/O system • NOW I/O system • Benchmark Problem: sort large number of 100 byte records with 10 byte keys • start on disk, end on disk • accessible as files (use the file system) • Datamation sort: 1 million records • Minute sort: quantity in a minute Proc- Mem P-M P-M P-M P-M SPAA/PODC

NOW-Sort Algorithm • Read • N/P records from disk -> memory • Distribute • scatter keys to processors holding result buckets • gather keys from all processors • Sort • partial radix sort on each bucket • Write • write records to disk (2 pass: gather data runs onto disk, then local, external merge sort) SPAA/PODC

Key Implementation Techniques • Performance Isolation: highly tuned local disk-to-disk sort • manage local memory • manage disk striping • memory mapped I/O with m-advise, buffering • manage overlap with threads • Efficient Communication • completely hidden under disk I/O • competes for I/O bus bandwidth • Self-tuning Software • probe available memory, disk bandwidth, trade-offs SPAA/PODC

World-Record Disk-to-Disk Sort • Sustain 500 MB/s disk bandwidth and 1,000 MB/s network bandwidth • but only in the wee hours of the morning SPAA/PODC

Towards a Cluster File System • Remote disk system built on a virtual network Client RD server RDlib Active msgs SPAA/PODC

Streaming Transfer Experiment SPAA/PODC

Results • Data distribution affects resource utilization • Not delivered bandwidth SPAA/PODC

I/O Bus crossings SPAA/PODC

Opportunity: PDISK • Producers dump data into I/O river • Consumers pull it out • Hash data records across disks • Match producers to consumers • Integrated with work scheduling Fast Communication - Remote Queues P P P P P P P P Fast I/O - Streaming Disk Queues SPAA/PODC

SMP memory interconnect memory memory network cards SMP network cloud SMP SMP What will be the building block? Clusters of SMPs NOWs Nodes SMPs Processors per node SPAA/PODC

Multi-Protocol Communication • Uniform Prog. Model is key • Multiprotocol Messaging • careful layout of msg queues • concurrent objects • polling network hurts memory • Shared Virtual Memory • relies on underlying msgs • Pooling vs Contention Send / Write communication layer shared memory network Rcv / Read SPAA/PODC

LogP analysis of shared mem AM SPAA/PODC

Sockets, MPI,Legacy, etc. VI VI VI C COMP S R S R S R Virtual Interface Architecture Application Host VIA Kernel Driver (Slow) VI User Agent (“libvia”) User-Level (Fast) Descriptor Read, Write Open, Connect, Map Memory Doorbells Undetermined RequestsCompleted NIC VI-Capable NIC SPAA/PODC

VIA Implementation Overview Request Host Block Xfer VI NIC 1Write Mapped Doorbells Doorbell Pages 2DMAReq 3DMARd Descriptor Queues Desc Buffer 4DMAReq Tx/Rx Buffers 5DMARd Data Buffers ... 7DMAWrt Kernel MemoryMapped to Application SPAA/PODC

Current VIA Performance SPAA/PODC

VIA ahead • You will be able to buy decent clusters • Virtualization in host memory is easy • will it go beyond pinned regions • still need to manage active endpoints (doorbells) • Complex descriptor queues will hinder low latency short messages • NICs will chew on them, but many instructions on host • Need to re-examine where error handling, flow control, retry are performed • Interactions with scheduling, I/O, locking etc. will dominate application speed-up • will demand new development methodologies SPAA/PODC

Conclusions • Complete system on every node makes clusters a very powerful architecture • can finally get serious about I/O • Extend the system globally • virtual memory systems, • schedulers, • file systems, ... • Efficient communication enables new solutions to classic systems challenges • Opens a rich set of issues for parallel processing beyond the personal supercomputer or LAN • where SPAA and PDOC meet SPAA/PODC

High-Performance Clusters part 2: Generality