Current and Future for NT Clustering with HPVM

Current and Future for NT Clustering with HPVM Philip M. Papadopoulos Department of Computer Science and Engineering University of California, San Diego JPC4 - Oak Ridge, TN

Outline • NT Clustering - Our clusters, Software • What’s new in the latest version of HPVM 1.9 • Looking at performance • Gratuitous bandwidth and latency • Iowa State results (Luecke, Raffin, Coyle) • Futures for HPVM • Natural upgrade paths (Windows 2K, Lanai 7, …) • Adding Dynamics JPC4 - Oak Ridge, TN

Why NT? • Technical reasons • Good support of SMP systems • System is designed to be threaded at all levels • User-scheduled ultra-lightweight threads (NT Fibers) are very powerful • Integrated/Extensible performance monitoring system • Well-supported device driver development environment JPC4 - Oak Ridge, TN

Remote Access (NT is Challenged) • Myth: You can’t do things remotely in NT • Fact: You can, it just doesn’t have a unified remote abstraction like rsh/ssh. (Think Client/Server ) • Remote manipulation of registry (regini.exe) • Remote administrative access to file system • Ability to create remote threads (CreateRemoteThread) • Ability to start/stop services (sc.exe) • Too many interfaces! One must essentially learn new tools to perform (scripted) remote admin. • NT Terminal Server and Win2K improve access, but still fall short of X-Windows. JPC4 - Oak Ridge, TN

Hardware/Software Environment • Our clusters • 64 Dual Processor Pentium IIs • 32 HP Kayak. 300MhZ, 384MB, 100 GB disk • 32 HP LPr NetServer 450Mhz, 1024MB, 36GB disk • Myrinet – Lanai4 32-bit PCI cards all 64 Machines • Giganet – Hardware VIA only on NetServers • NT Terminal Server 4.0 on all nodes • LSF for managing/starting parallel jobs • HPVM is the “clusterware” JPC4 - Oak Ridge, TN

High Performance Virtual Machines PI: Andrew A. Chien, co-PIs: Daniel Reed, David Padua Students: Scott Pakin, Mario Lauria*, Louis Giannini, Paff Liu*, Geta Sampemane, Kay Connelly, and Andy Lavery Research Staff: Philip Papadopoulos, Greg Bruno, Caroline Papadopoulos*, Mason Katz*, Greg Koenig, and Qian Liu *Funded from other sources URL: http://www-csag.ucsd.edu/projects/hpvm.html DARPA #E313, AFOSR F30602-96-1-0286 JPC4 - Oak Ridge, TN

What is HPVM? • High-performance (MPP-class) thread-safe communication • A layered set of APIs (not just MPI) that allow applications to obtain a significant fraction of HW performance • A small number of services that allow distributed processed to find out and communicate with each other • Device driver support for Myrinet. Vendor for VIA • Focus/contribution has been effective layering • Especially short message performance. JPC4 - Oak Ridge, TN

Supported APIs • FM (Fast Messages) • Core messaging layer. Reliable, in-order delivery • MPI – MPICH 1.0 based • SHMEM – put/get interface (Similar to Cray) • BSP – Bulk Synchronous Parallel (Oxford) • Global Arrays - Global abstraction for matrix operations. (PNNL) • TCGMSG – Theoretical Chemistry Group Messaging JPC4 - Oak Ridge, TN

SHMEM Global Arrays MPI BSP Fast Messages Myrinetor VIA Shared Memory (SMP) Libraries/Layering • All libraries layered on top of FM • Semantics are active-message like • FM designed to build other libraries, FM level not desirable for applications • Designed for efficient gather/scatter and header processing JPC4 - Oak Ridge, TN

What’s New in HPVM 1.9 • Better Performance (ref v. 1.1 @NCSA) • 25% Bandwidth (80MB/s  100+MB/s) • 14% Latency reduction 10s  8.6 s • Three transports • Shared Memory Transport + [Myrinet,VIA] • Standalone desktop version uses shared mem • Integration with NT Performance Monitor • Improved configuration/installation • BSP API added JPC4 - Oak Ridge, TN

Performance Basics (Ver 1.9) • Basics • Myrinet • FM: 100+MB/sec, 8.6 µsec latency • MPI: 91MB/sec @ 64K, 9.6 µsec latency • Approximately 10% overhead • Giganet (VIA) • FM: 81MB/sec, 14.7 µsec latency • MPI: 77MB/sec, 18.6 µsec latency • 5% BW overhead, 26% latency! • Shared Memory Transport • FM: 195MB/sec, 3.13 µsec latency • MPI: 85MB/sec, 5.75 µsec latency • Our software structure requires 2 mem copies/packet :-( JPC4 - Oak Ridge, TN

Gratuitous Bandwidth Graphs • FM bandwidth usually a good indicator of deliverable bandwidth • High BW attained for small messages • N1/2 ~ 512 Bytes JPC4 - Oak Ridge, TN

“Nothing is more humbling or more revealing than having others use your software.” JPC4 - Oak Ridge, TN

Iowa State Performance Results • “Comparing the Communication Performance and Scalability of a Linux and a NT Cluster of PCs, a Cray Origin 2000, an IBM SP and a Cray T3E-600” • Glenn R. Luecke, Bruno Raffin and James J. Coyle, Iowa State • Machines • 64 Node NT SuperCluster, NCSA, Dual PIII 550, HPVM 1.1 • 64 Node AltaCluster, ABQ HPCC Dual PII 450, GM • O2K, 64 Node, Eagan MN, Dual 300MhZ R12000 • T3E-600, 512 proc, Eagan MN, Alpha EV5 300MhZ • IBM SP, 250 proc, Maui, (96 were 160MhZ) • They ran MPI benchmarks for 8 byte, 10000 Byte, 1MB JPC4 - Oak Ridge, TN

Right Shift - 8 Byte Messages Time (ms) # processors • FM optimization for short messages JPC4 - Oak Ridge, TN

Right Shift - 10000 Bytes Messages Time (ms) • FM: starts at 25MB/sec and drops to 12MB/sec above 64 nodes JPC4 - Oak Ridge, TN

Right Shift - 1MB Messages Time (ms) • Change at 64 processors prompted Shared Memory Transport in HPVM1.9 • Curve flattened (better scalability) • Recently (last week), found a fairness issue in FM Lanai Control Program JPC4 - Oak Ridge, TN

MPI Barrier - 8 Bytes Time (ms) • FM Significantly faster at 128 Procs (4x - 9x) JPC4 - Oak Ridge, TN

MPI Barrier - 10000 Bytes Time (ms) • FM 2.5x slower than T3E, 2x Slower than O2K JPC4 - Oak Ridge, TN

Interpreting These Numbers • Concentration on short message performance puts clusters on par with (expensive) traditional supers • Longer message performance not as competitive. Version 1.9 addresses some issues • Lends some understanding of large application performance on NT SuperCluster JPC4 - Oak Ridge, TN

Future HPVM Development • (Obvious) things that will happen • Support of Windows 2000 • Alpha NT -- move towards 64 bit code base • Support for new Myrinet Lanai 7 Hardware • HPVM development will move into support role for other projects • Agile Objects: High-performance OO computing • Federated Clusters • Tracking NCSA SuperCluster hardware curve JPC4 - Oak Ridge, TN

Current State for Reference • HPVM supports multiple processes/node, multiple process groups/cluster • Inter-group communication not supported • In-order reliable messaging guaranteed by • Credit-based flow control scheme • Static scheme is simple but inflexible • Only one route between any pair of processes • Even if multiple routes available, only one used • Comm within cluster very fast, outside is not • Speed comes from many static constraints JPC4 - Oak Ridge, TN

Designed and Now Implementing • Dynamic flow control scheme for better scalability • Support larger clusters • Multiple routes and out-of-order packet re-sequence • Allow parallel paths for high-performance WAN connections • Support inter-group communication • Driven by agile objects need for remote method invocation/ client-server interactions • Support “Federated Clusters” • Integration into Grid. Bring performance of cluster outside of the machine room JPC4 - Oak Ridge, TN

Is Linux in HPVM’s Future? • Maybe ;-) • Critical technical hurdle is finding a user-scheduled lightweight thread package • NT version makes use of “Fibers” • Major impediment is time/project driver JPC4 - Oak Ridge, TN

Summary • HPVM gives good relative and absolute performance • HPVM moving past the “numbers game” • Concentrate on overall usability • Integration into Grid • Software will continue development but takes on a support role for driving projects • Check out www-csag.ucsd.edu JPC4 - Oak Ridge, TN

Current and Future for NT Clustering with HPVM

Current and Future for NT Clustering with HPVM

Presentation Transcript

NT Applications Support – Status and Future Developments

Current and Future Capabilities

Current and Future Funding

Current Future

Current and Future Projects

NICAM current and future

Current and Future Smart Grid

Linking Future Value with Current Value

Current Treatments for Dementia and Future Prospects

PHENIX: Current and Future

Current plans and thoughts for the future

Current and Future SZ Surveys

Globus Current and Future

Anticoagulation: Current and Future therapies

Current and Future Analyses

Current Growth and Future

Current and Future Trends

Current and Future Skill Needs

Current and Future Science with NRAO Instruments

Current and Future Capabilities

Technology for LCD TV -Current and Future

NT Applications Support – Status and Future Developments