1 / 41

Clusters, Technology, and “All that Stuff”

Clusters, Technology, and “All that Stuff”. Philip Papadopoulos San Diego Supercomputer Center 31 August 2000. Motivation for COTS Clusters. Gigabit Networks! - Myrinet, SCI, FC-AL, Giganet,GigE,ATM, Servernet and (Infiniband, soon).

Download Presentation

Clusters, Technology, and “All that Stuff”

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clusters, Technology, and “All that Stuff” Philip Papadopoulos San Diego Supercomputer Center 31 August 2000

  2. Motivation for COTS Clusters Gigabit Networks! - Myrinet, SCI, FC-AL, Giganet,GigE,ATM, Servernet and (Infiniband, soon) • Killer micros: Low-cost Gigaflop processors here for a few kilo$$’s /processor • Killer networks: Gigabit network hardware, high performance software (e.g. Fast Messages), soon at 100’s-$$$/ connection • Leverage HW, commodity SW (Linux, NT), build key technologies • Technology dislocation coming very soon!

  3. High Performance Communication Switched Multigigabit, User-level access Networks Switched 100 Mbit OS mediated access • Level of network interface support + NIC/network router latency • Overhead and latency of communication  deliverable bandwidth • High-performance communication enables Programmability! • Low-latency, low-overhead, high-bandwidth cluster communication • … much more is needed … • Usability issues, I/O, Reliability, Availability • Remote process debugging/monitoring • Techniques for scalable cluster management

  4. SDSC’s First HPC Cluster Myrinet Fast Ether Power • 16 Compute Nodes (32 Proc) (25 Gflops, 20 Gbit/s BW bisection) • 2 Front-end servers • 11 GB total memory • 216 GB total disk (72 GB on Front ends) Front Back

  5. FY00 Cluster Hardware (August Deployment) • SDSC Cluster: 90 IA-32 2-way Nodes • Keck 1 Cluster: 16 IA-32 2-way Nodes • Keck 2 Cluster: 38 IA-32 2-way Nodes • System Stakeholders: NBCR, SDSC, Keck, Onuchic • Vendors: • IBM and Compaq nodes @ each site • Compaq Servernet II interconnect at Keck sites • Myrinet 2000 (Myricom, Inc.) interconnect at SDSC

  6. FY01 SDSC Cluster Hardware Plans • Additional 104 nodes via loaners and donations expected • Additional stakeholders possible (e.g., SIO) • Goal: Get to 256 nodes

  7. Key Vendors • IBM • Best node pricing • Donations • Strong relationship • Compaq • Best packaging • Best for Keck centers • Equipment loans • Offers evaluation of another interconnect • Myricom • Myrinet2000 --> >> 200 MB/sec inter-node BW

  8. Application Collaborations at UCSD • Keck 1 (Ellisman, 3D Electron Microscope Tomography) • Keck 2 (McCammon, Onuchic, Computation Chemistry • SIO (Stammer, Global Ocean Modeling) • NBCR (Baldridge, National Biomedical Computational Resource) • Goals: • Assistance, consulting, software packaging, troubleshooting, federation of clusters on campus with high-speed network (our own version of chaos)

  9. Keck Sites • Working with two key sites to work out kinks in transferring management/infrastructure technology to application groups • Start of a Campus Grid for Computing • Microscopy Lab (Ellisman) • 32-way (64 Processor) Cluster by Jan ‘01 • Servernet II interconnect • 3D Tomography • Computational Chemistry Lab • 64-way (128 Processor) Cluster by Jan ‘01 • Servernet II Interconnect

  10. HPC Machine History • The 1980’s – Decade of Vector Supers • Many New Vendors • The 1990’s – Decade of MPPs • Many Vendors Lost (Dead Supercomputer Society) • The 2000’s – Decade of Clusters • End-users as “vendors”? • The 2010’s – The Grid • Harnessing Chaos (or perhaps just Chaos )

  11. HPC Reality: Losing Software Capability • As HPC has gone through technological shifts, we have lost capability in software • HPC has ever-decreasing influence on computing trends • AOL, Amazon, Exodus, …, WalMart, Yahoo, ZDNet is where the $$ are • Challenge is to leverage commodity trends with a community-maintained HPC software base (“The Open Source Buzz”)

  12. Technological Shifts Coming Now (or Soon) • Memory bandwidth of COTS systems • 4 – 8X increase this year (RIMM, Double Clocked SDRAM) • Increased I/O performance • 4X improvement today (64bit/66MHz) • 10X (PCI-X) within 12 months • Increased network performance/decrease in $$ • 1X infiniband (2.5 Gbits/sec) – hardware convergence • Intel designing Mboards with multiple I/O busses and on-board Infiniband. • 64 bit integer performance everywhere.

  13. Taking Stock • Clusters are Proven Computational Engines (Many existence proofs) • Upcoming technology dislocation makes them very attractive at multiple scales • Today’s Vendors care about HPC issues, but the economic realities make it harder and harder to fully support our unique software stack. • Can they be turned into general-purpose, highly-scalable, maintainable, production-quality machines? • YES! (But there is work to do)

  14. Cluster InterconnectToday • Myrinet (Myricom) • 1.28 Gb/s, full duplex • 18 us latency • 145 MB/s bandwidth • $1500 / port • Servernet II (Compaq)

  15. Cluster InterconnectTomorrow • Myrinet (Myricom) • 2.0 Gb/s, full duplex • 9 us latency • 250 MB/s bandwidth • $ ??? / port • Available: today • Infiniband

  16. Cluster Compute NodeToday

  17. Cluster Compute NodeTomorrow 1.6 GHz 64 bit @ 400MHz 3.2 GB/s • In the next 9 months, every speed and feed gets at least a 2x bump! 2 channels 16bit @ 800 MHz 3.2 GB/s PCI-X 64 bit @ 133 MHz 1.06 GB/s

  18. Commodity CPU – Pentium 3 • 0.8 Gflops (Peak) • 1 Flop / cycle @ 800 MHz • 25.6 GB/s L2 cache feed • 800 MHz * 256-bit • 1.06 GB/s Memory-I/O bus • 133 MHz * 64-bit

  19. 44 GB/s L2 Cache and Control (256-bit @ 1.4 GHz) BTB Store AGU Load AGU ALU Integer RF ALU 3.2 GB/s System Interface (64-bit @ 400 MHz) ALU 3 3 Trace Cache ALU Decoder BTB & I-TLB Rename/Alloc uop Queues L1 D-Cache and D-TLB Schedulers FP move FP store FP RF FMul FAdd MMX SSE uCode ROM Commodity CPU – Pentium 4

  20. Commodity CPU – Pentium 4 • 2.8 Gflops • 2 Flops / cycle @ 1.4 GHz • 128-bit vector registers (Streaming SIMD Extensions • Can apply operations on 2 64-bit floating point values per clock • 44 GB/s L2 cache feed • 3.2 GB/s Memory-I/O bus

  21. Commodity Frequency Trend

  22. Looking Forward • Historical CAGRs • Pentium Core: 41% CAGR • P6 Core (Pentium Pro, Pentium II, Pentium III): 49% CAGR • 1.9 GHz clock in 2H01 • 3.8 Gflops • L2 cache feed of 60 GB/s!

  23. Power3 • Current CPU used in Blue Horizon (222 MHz) • 4 Flops/cycle Peak • Fused Multiply-Add • 888 MFlops for BH • L2 Cache Feed: • 7.1 GB/s • 256-bit @ 222MHz • Memory-I/O Bus Feed: • 1.8 GB/s • 128-bit @ 111MHz

  24. Power3 @ 375 MHz

  25. Power3 @ 375 MHz • 1.5 GFlops (Peak) • 4 Flops / cycle @ 375MHz • 8 GB/s L2 Cache Feed: • 256-bit @ 250MHz • Memory-I/O Bus Feed: • 1.5 GB/s • 128-bit @ 93.75MHz

  26. Power4 • Chip Multiprocessor (CMP) • 4.0 GFlop / CPU (Peak) • 50 GB/s / CPU L2 cache feed • 2.5 GB/s / CPU memory bus feed • Numbers in the figure are aggregate • 10 GB/s / CPU in 8-way configuration • 5 GB/s / CPU I/O feed • Available 2H01

  27. CPU Summary * GB/s Per CPU

  28. Commodity Benefits • Ride the commodity performance curve

  29. Commodity Benefits • Ride the commodity RDRAM price curve

  30. MPICH – GM Performance

  31. MPICH – GM Performance

  32. Some Small NPB Results BH Faster Cluster Faster

  33. Some Realities (and Advantages)? • More Heterogeneity • Node performance • Node architecture • System can be designed with different resources in different partitions • Large Memory • Large Disk • Bandwidth • Staged acquisitions can take advantage of commodity trends

  34. Some Deep Dark Secrets • Clusters are phenomenal price/performance computational engines … • Can be hard to manage without experience • High-performance I/O is still unsolved • Finding out where something has failed increases at least linearly as cluster size increases • Not cost-effective if every cluster “burns” a person just for care and feeding • We’re working to change that…

  35. It’s all in the software … … and the management

  36. Cluster Projects have focused on high-performance messaging • BIP (Basic Interface for Parallelism) [Linux] • MVIA – Berkeley Lab Modular VIA project • Active Messages – Berkeley NOW/Millennium • GM – From Myricom • General purpose (what we use on our Linux Cluster), • Real World Computing Partnership – Japanese consortium • UNET – Cornell • High-performance over ATM and Fast Ethernet • HPVM – Fast Messaging and NT

  37. We’re starting with infrastructure • Concentrating on management, deployment, scaling, automation, security • Complete cluster install of 16 nodes in under 10 minutes • Reinstallation (unattended) in under 7 minutes • Easy to maintain software consistency • If there's a question about what's on a machine, don't think about it, reinstall it! • Easy to upgrade • Add new packages to the system configuration (stored on a front-end machine) then reinstall all the machines • Automatic scheduling of upgrades • can schedule upgrades through a batch system or periodic script

  38. Working with the MPI-GM Model: Usher/Patron • Sender transmits to a hostname & port number • Receiver de-multiplexes by port number • Port numbers assigned from a single configuration file • Consequences • Port numbers must be agreed upon a priori • “Guaranteed” collision of port numbers when multiple jobs are run. • Usher/Patron (developed by Katz) removes a centralized database for port assignment • Uses RPC-based reservation/claim system on each node to dynamically assign port numbers to applications. • Timeouts allow for recovery of allocated ports

  39. Job launch • MPI-Launch • Handles reserve/claim/vacate system • Starts jobs with SSH • Runs the first node in the foreground • Interactive node (MPI Rank 0) • Runs subsequent nodes in the background • Non-interactive nodes • Multiple cluster-wide jobs now work • A secure and scalable replacement for mpirun

  40. Other software … • Portland Group F77/F90 Compiler • Portable Batch System • Trivial configuration at the moment • Rudimentary up/down health monitoring • GM Hardware/software on each node for High-Performance network. • MPI (GM-aware version of MPICH 1.2) • Globus 1.1.3 – Integration with batch system not done yet. • Public Key Certificate Client (and server) (Link, Schroeder) • Standard Redhat Linux 6.2 on each node

  41. Conclusions • Clustering is actively being pursued at SDSC • Becoming a gathering point of technology for NPACI • Aggressively attacking issues for ultra-scale • Actively transferring technology to make clusters more available to application groups • Working to build a Campus Grid of Clusters • Want to harness expertise at SDSC to • Define the production system environment • Build/port/deploy the needed infrastructure components • Web site coming …. But not ready yet. • New mailing list: clusters@sdsc.edu

More Related