1 / 23

Can Commodity Linux Clusters Scale to Petaflops?

Can Commodity Linux Clusters Scale to Petaflops?. P. Beckman. Refining The Question. A petaflop is 1E+15 floating point operations per second as reported by the Top500 (not peak theoretic) What is commodity ? Beowulf Classic: Computer Shopper catalog. Not great, but good intent.

marsha
Download Presentation

Can Commodity Linux Clusters Scale to Petaflops?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Can Commodity Linux ClustersScale to Petaflops? P. Beckman

  2. Refining The Question • A petaflop is 1E+15 floating point operations per second as reported by the Top500 (not peak theoretic) • What is commodity? • Beowulf Classic: Computer Shopper catalog. • Not great, but good intent. • How many suppliers thrive? • How bad would it be if you could no longer buy a part or get the support you want easily? • What can I fix or modify myself? • HW: usually x86 servers SW: Linux Pete Beckman Lyon: CCGSC

  3. Refining The Question (part 2) • What is a Cluster? • A “box” can be sold separately, and usually is • A complete OS, it can run standalone • Capacity is expanded by adding another box • Is the SP2 a cluster? Is the ES a cluster? • A Petaflop-scale Commodity Cluster: • A large collection of interconnected boxes running Linux and achieving a Petaflop on the Top500 Pete Beckman Lyon: CCGSC

  4. Is It Possible? • Maybe not “if”, but “when” • Could we do it now? If not, how soon? Pete Beckman Lyon: CCGSC

  5. Home-grown Clusters • June 1997, Berkeley Sun (Solaris) cluster is the first cluster to make the Top500. • June 1998, First Linux Cluster debuts on the Top500 • By June of 2002, 10% of all machines on the Top500 are Linux Clusters! • However, Linux represents only 7.5% of aggregate performance on the Top500. • Two possible conclusions…. Pete Beckman Lyon: CCGSC

  6. The Expansion of Linux in the Top500 Berkeley NOW (Solaris) Pete Beckman Lyon: CCGSC

  7. ? How Efficient Are Linux Clusters? Cplant Aramco, 2048, eth Pete Beckman Lyon: CCGSC

  8. Ouch! (scaling could be a problem) Earth Simulator Pete Beckman Lyon: CCGSC

  9. Delivered Performance/CPU Pete Beckman Lyon: CCGSC

  10. How Far Behind is Commodity? ES (7gf/cpu) Pete Beckman Lyon: CCGSC

  11. Are Linux Clusters Keeping Up?Comparing Apples to Oranges (really, money spent) Pete Beckman Lyon: CCGSC

  12. Compared To What? Excelling at mediocrity Pete Beckman Lyon: CCGSC

  13. Observations? • Clusters are the most popular HPC platform • Linux is expanding fastest for HPC, but mostly in the mid and lower tiers of the Top500 • Why? • CPU efficiency is mediocre • PIII cluster with 1K nodes only 67% efficient • CPU Cost Effectiveness? • NCSA Itanium: 2.11 GF/CPU • P4 Xeon: 2.05 GF/CPU • AMD: 2.0 GF/CPU (3 years slower than ES?) Pete Beckman Lyon: CCGSC

  14. Petaflops Now? • Assume: • 2 GF/CPU Linpack, no loss for scaling (obviously wrong) • A Petaflop Linux cluster would require about 500,000 processors • Al Geist: “The next generation of Peta-scale computers are being designed with 50,000 to 100,000 processors” • Ignoring power, wiring plans, and interconnection networks, how big is it ? • (disagreeing with Thomas) Pete Beckman Lyon: CCGSC

  15. VIA C3 800MHz CPU, 100/133MHz FSB 1Gig Ram, PCI card Commodity is shrinking • Special “blades” not required • New form factors can achieve approx 528 nodes per rack • Sans management… arg!!! • Each rack needs ~12 ft2 floor space (space to move the rack) • 500K CPUs requires 11.3K ft2. • Not the Nimitz, but simply 1 former dot com office space in the Bay • Cost? At $3K/node, = $1.5B • (one big black plane) 17cm (6.7in) Pete Beckman Lyon: CCGSC

  16. But Would It Work? NO • Buying the hardware is not the bottleneck, the system software won’t scale. • Software is so hard and costly, spending lots and lots more money won’t show immediately effects Pete Beckman Lyon: CCGSC

  17. Silly examples: (part 1/2) • mpicc myprog.com; mpirun a.out • For a 10MB executable with 100BT, it would take 8.3 days to start your job • Or: 6.8 million dollars of machine lifetime (5 years) • With 2 Gbit Myrinet, it would take about 7 hrs • Or: $240K dollars of machine lifetime (5 years) Pete Beckman Lyon: CCGSC

  18. Silly examples: (part 2/2) • mpicc hello_world.com; mpirun a.out • Probably 1 million socket connections would be required. Linux scales to a couple thousand • A recv() where the “master” node collects 1000 floating point numbers from each node would required 3.8 Gig of RAM Pete Beckman Lyon: CCGSC

  19. Cluster Sizing Rule of Thumb • System software (Linux, MPI, Filesystems, etc) scale from 64 nodes to at most 2048 nodes for most HPC applications • Max socket connections • Direct access message tag lists & buffers • NFS / storage system clients • Debugging • Etc • It is probably hard to rewrite MPI and all Linux system software for O(100,000) node clusters Pete Beckman Lyon: CCGSC

  20. Eliminating Flat Could Help, But It Must Be Nearly Transparent • A lesson from the IP address space crisis about 5 years ago: • Nearly every workstation at every institution had a real, global, IP address • IP space became difficult to find, and there were fears of running out. • Router tables were growing to big • NAT (Network Address Translation (IP Masquerading)) came to the rescue • Large institutions now have a few global IP addresses, instead of class B networks Pete Beckman Lyon: CCGSC

  21. Maybe A Similar Technique Could Apply To Large Clusters & The Grid • Currently, Firewalls and NATing make Grid computing nearly impossible • To scale the Grids and clusters to 100K nodes we may want something like Grid/Cluster NAT Hypothesis: Nearly transparent Grid/MPI NAT translation may let system software scale to 100K nodes… Flat is bad Pete Beckman Lyon: CCGSC

  22. Waiting For Commodity Linux Cluster Petaflops • If system software is most likely to work at about 1000 nodes, to reach Petaflops, each node will be a Teraflop (Linpack) • To go from 2GF nodes now to 1TF will probably take 12 to 15 years • Large SMPs could shave some time • Conclusion: • Petaflop Linux Cluster: 10-15 years “naturally” • Solve scalable system software issues, and we can reduce time by buying more nodes (32K nodes in 6-7 years) Pete Beckman Lyon: CCGSC

More Related