1 / 46

High-Performance Clusters part 2: Generality

High-Performance Clusters part 2: Generality. David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June 28, 1998. What’s Different about Clusters?. Commodity parts? Communications Packaging? Incremental Scalability? Independent Failure?

schram
Download Presentation

High-Performance Clusters part 2: Generality

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High-Performance Clusterspart 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June 28, 1998 SPAA/PODC

  2. What’s Different about Clusters? • Commodity parts? • Communications Packaging? • Incremental Scalability? • Independent Failure? • Intelligent Network Interfaces? • Fast Scalable Communication? => Complete System on every node • virtual memory • scheduler • file system • ... SPAA/PODC

  3. Topics: Part 2 • Virtual Networks • communication meets virtual memory • Scheduling • Parallel I/O • Clusters of SMPs • VIA SPAA/PODC

  4. General purpose requirements • Many timeshared processes • each with direct, protected access • User and system • Client/Server, Parallel clients, parallel servers • they grow, shrink, handle node failures • Multiple packages in a process • each may have own internal communication layer • Use communication as easily as memory SPAA/PODC

  5. Virtual Networks • Endpoint abstracts the notion of “attached to the network” • Virtual network is a collection of endpoints that can name each other. • Many processes on a node can each have many endpoints, each with own protection domain. SPAA/PODC

  6. How are they managed? • How do you get direct hardware access for performance with a large space of logical resources? • Just like virtual memory • active portion of large logical space is bound to physical resources Host Memory Process n Processor *** Process 3 Process 2 Process 1 NIC Mem P Network Interface SPAA/PODC

  7. Endpoint Transition Diagram HOT R/W NIC Memory Evict Write MsgArrival WARM R/O Paged Host Memory Read Swap COLD Paged Host Memory SPAA/PODC

  8. Network Interface Support • NIC has endpoint frames • Services active endpoints • Signals misses to driver • using a system endpont Frame 0 Transmit Receive Frame 7 EndPoint Miss SPAA/PODC

  9. Solaris System Abstractions • Segment Driver • manages portions of an address space • Device Driver • manages I/O device Virtual Network Driver SPAA/PODC

  10. LogP Performance • Competitive latency • Increased NIC processing • Difference mostly • ack processing • protection check • data structures • code quality • Virtualization cheap SPAA/PODC

  11. Msg burst work Client Server Client Server Server Client Bursty Communication among many SPAA/PODC

  12. Multiple VN’s, Single-thread Server SPAA/PODC

  13. Multiple VNs, Multithreaded Server SPAA/PODC

  14. Perspective on Virtual Networks • Networking abstractions are vertical stacks • new function => new layer • poke through for performance • Virtual Networks provide a horizontal abstraction • basis for building new, fast services • Open questions • What is the communication “working set” ? • What placement, replacement, … ? SPAA/PODC

  15. Beyond the Personal Supercomputer • Able to timeshare parallel programs • with fast, protected communication • Mix with sequential and interactive jobs • Use fast communication in OS subsystems • parallel file system, network virtual memory, … • Nodes have powerful, local OS scheduler • Problem: local schedulers do not know to run parallel jobs in parallel SPAA/PODC

  16. Local Scheduling • Schedulers act independently w/o global control • Program waits while trying communicate with its peers that are not running • 10 - 100x slowdowns for fine-grain programs! => need coordinated scheduling SPAA/PODC

  17. Explicit Coscheduling • Global context switch according to precomputed schedule • How do you build it? Does it work? SPAA/PODC

  18. Master LS LS LS LS A A A A A A GS GS GS GS LS LS LS LS A A A A A A Typical Cluster Subsystem Structures Master-Slave Local service Applications Communication Communication Peer-to-Peer Global Service Communication SPAA/PODC

  19. GS GS LS LS A A GS GS LS LS A A A A Ideal Cluster Subsystem Structure • Obtain coordination without explicit subsystem interaction, only the events in the program • very easy to build • potentially very robust to component failures • inherently “service on-demand” • scalable • Local service component can evolve. SPAA/PODC

  20. M LS LS GS GS LS LS A LS A LS A A A A A A GS GS GS GS GS GS LS LS LS LS LS LS A A A A A A A A A A Three approaches examined in NOW • GLUNIX explicit master-slave (user level) • matrix algorithm to pick PP • uses stops & signals to try to force desired PP to run • Explicit peer-peer scheduling assist with VNs • co-scheduling daemons decide on PP and kick the solaris scheduler • Implicit • modify the parallel run-time library to allow it to get itself co-scheduled with standard scheduler SPAA/PODC

  21. Problems with explicit coscheduling • Implementation complexity • Need to identify parallel programs in advance • Interacts poorly with interactive use and load imbalance • Introduces new potential faults • Scalability SPAA/PODC

  22. WS 1 Job A sleep Job A request response WS 2 Job B Job A WS 3 Job B Job A spin WS 4 Job B Job A Why implicit coscheduling might work • Active message request-reply model • Infer non-local state from local observations; react to maintain coordination observation implication action fast response partner scheduled spin delayed response partner not scheduled block SPAA/PODC

  23. Obvious Questions • Does it work? • How long do you spin? • What are the requirements on the local scheduler? SPAA/PODC

  24. How Long to Spin? • Answer: round trip time + context switch + msg processing • round-trip to stay scheduled together • plus wake-up to get scheduled together • keep spinning if serving messages • interval of 3 x wake-up SPAA/PODC

  25. Does it work? SPAA/PODC

  26. Synthetic Bulk-synchronous Apps • Range of granularity and load imbalance • spin wait 10x slowdown SPAA/PODC

  27. With mixture of reads • Block-immediate 4x slowdown SPAA/PODC

  28. Timesharing Split-C Programs SPAA/PODC

  29. Many Questions • What about • mix of jobs? • sequential jobs? • unbalanced placement? • Fairness? • Scalability? • How broadly can implicit coordination be applied in the design of cluster subsystems? • Can resource management be completely decentralized? • Computational economies, ecologies SPAA/PODC

  30. A look at Serious File I/O • Traditional I/O system • NOW I/O system • Benchmark Problem: sort large number of 100 byte records with 10 byte keys • start on disk, end on disk • accessible as files (use the file system) • Datamation sort: 1 million records • Minute sort: quantity in a minute Proc- Mem P-M P-M P-M P-M SPAA/PODC

  31. NOW-Sort Algorithm • Read • N/P records from disk -> memory • Distribute • scatter keys to processors holding result buckets • gather keys from all processors • Sort • partial radix sort on each bucket • Write • write records to disk (2 pass: gather data runs onto disk, then local, external merge sort) SPAA/PODC

  32. Key Implementation Techniques • Performance Isolation: highly tuned local disk-to-disk sort • manage local memory • manage disk striping • memory mapped I/O with m-advise, buffering • manage overlap with threads • Efficient Communication • completely hidden under disk I/O • competes for I/O bus bandwidth • Self-tuning Software • probe available memory, disk bandwidth, trade-offs SPAA/PODC

  33. World-Record Disk-to-Disk Sort • Sustain 500 MB/s disk bandwidth and 1,000 MB/s network bandwidth • but only in the wee hours of the morning SPAA/PODC

  34. Towards a Cluster File System • Remote disk system built on a virtual network Client RD server RDlib Active msgs SPAA/PODC

  35. Streaming Transfer Experiment SPAA/PODC

  36. Results • Data distribution affects resource utilization • Not delivered bandwidth SPAA/PODC

  37. I/O Bus crossings SPAA/PODC

  38. Opportunity: PDISK • Producers dump data into I/O river • Consumers pull it out • Hash data records across disks • Match producers to consumers • Integrated with work scheduling Fast Communication - Remote Queues P P P P P P P P Fast I/O - Streaming Disk Queues SPAA/PODC

  39. SMP memory interconnect memory memory network cards SMP network cloud SMP SMP What will be the building block? Clusters of SMPs NOWs Nodes SMPs Processors per node SPAA/PODC

  40. Multi-Protocol Communication • Uniform Prog. Model is key • Multiprotocol Messaging • careful layout of msg queues • concurrent objects • polling network hurts memory • Shared Virtual Memory • relies on underlying msgs • Pooling vs Contention Send / Write communication layer shared memory network Rcv / Read SPAA/PODC

  41. LogP analysis of shared mem AM SPAA/PODC

  42. Sockets, MPI,Legacy, etc. VI VI VI C COMP S R S R S R Virtual Interface Architecture Application Host VIA Kernel Driver (Slow) VI User Agent (“libvia”) User-Level (Fast) Descriptor Read, Write Open, Connect, Map Memory Doorbells Undetermined RequestsCompleted NIC VI-Capable NIC SPAA/PODC

  43. VIA Implementation Overview Request Host Block Xfer VI NIC 1Write Mapped Doorbells Doorbell Pages 2DMAReq 3DMARd Descriptor Queues Desc Buffer 4DMAReq Tx/Rx Buffers 5DMARd Data Buffers ... 7DMAWrt Kernel MemoryMapped to Application SPAA/PODC

  44. Current VIA Performance SPAA/PODC

  45. VIA ahead • You will be able to buy decent clusters • Virtualization in host memory is easy • will it go beyond pinned regions • still need to manage active endpoints (doorbells) • Complex descriptor queues will hinder low latency short messages • NICs will chew on them, but many instructions on host • Need to re-examine where error handling, flow control, retry are performed • Interactions with scheduling, I/O, locking etc. will dominate application speed-up • will demand new development methodologies SPAA/PODC

  46. Conclusions • Complete system on every node makes clusters a very powerful architecture • can finally get serious about I/O • Extend the system globally • virtual memory systems, • schedulers, • file systems, ... • Efficient communication enables new solutions to classic systems challenges • Opens a rich set of issues for parallel processing beyond the personal supercomputer or LAN • where SPAA and PDOC meet SPAA/PODC

More Related