Linux Clusters in ITD

Linux Clusters in ITD Efstratios Efstathiadis Information Technology Division

Outline • Linux in Scientific Computing • Large Scale Linux Installation & Configuration • File Sharing: NAS/SAN, NFS, PVFS • Cluster Interconnects • Load Management Systems • Parallel Computing • System Monitoring Tools • Linux Clusters in ITD • Thoughts, Conclusions

Linux in Scientific Computing • Features of Scientific Computing: • Floating point performance is very important • User write their own codes. • Fortran is common. • GUIs and user-friendly interfaces are not required. • Goal is Science, Not Computer Science

Linux in Scientific Computing • Scientific computing is one of the first areas where Linux has had a major impact on production and mission-critical computing. • Access to cheap hardware. • License Issues. • Vendor Response/Support is slow. • Access to Source code is needed to implement desired features. • Availability of man power. • Availability of Scientific Tools/Resources.

Linux Endorsement • SUN: (www.sun.com/linux) • Porting its software products to Linux (Java 2, Forte for Java, OpenOffice, Grid Engine) • Porting Linux for the UltraSPARC architecture • Provides common utilities for Solaris and Linux so that users can move between the two • Improves the compatibility between the two so that applications can run on both • Sun StorEdge T3 Arrays are compatible with Linux • IBM: (www-1.ibm.com/linux) • AFS • Storage Devices • Linux pre-installed ( 20% of its Intel-based servers are Linux) • openclustergroup • spends over $1.3B in supporting Linux.

Processor Support in Linux • Most Popular: • x86 • Alpha • Sparc • PowerPC • MIPS Which Processor? It Depends on: • Cost • Performance • Availability of Software

Performance SPEC: Standard Performance Evaluation Corporation (http://www.spec.org) SPECint95: 8 integer-intensive C codes SPECfp95: 10 floating-point scientific FORTRAN codes

Software • Compilers: • gcc, g++, g77: Available on all platforms, but: • Generated code is not very fast • No parallelization for SMPs • g77 is Fortran 77 only • g++ has its limitations • x86 Compilers: • Portland Group (www.pgroup.com) • Fortran 90/95, OpenMP parallelization, HPF, better performance (15%) • Kuck and Associates (www.kai.com) • C++/ OpenMP Parallelization • NAG, Absoft, Fujitsu etc

What is a Cluster? • International Data Corp. (IDC) Cluster requirements: • Software must provide environment that looks to all as much as like a single system as possible. • Environment must provide higher data and app availability than is possible on single systems. • Developers must not have to use special APIs to have the app work in clustering environment. • Administrators must be able to treat the configuration as a single management domain. • There must be facilities for components of a single or entire app to be run in parallel on many different processors to improve single app performance or overall scalability of the environment.

What is a Cluster? Cluster A cluster is a collection of interconnected computers that can be viewed and used as a single, unified computing resource.

Large Scale Linux Installation • Choice of: • Diskless Install One copy of Linux to Maintain Requires Special Tools It doesn’t scale to large Number of nodes • Local Install • Kickstart • System Imager • LUI (Linux Utilities: Installation) • g4u (Ghost for Unix: http://www.feyrer.de/g4u/ )

Large Scale Linux Installation Kickstart Pulls and installs a list of RPM files from a RedHat mirror site (such as linux.bnl.gov) specified in a configuration file (ks.cfg). • Cluster nodes must be on a public network. • Have to maintain several configuration files ks.cfg • RedHat only. • No easy way to propagate configuration changes.

Large Scale Linux Installation System Imager (http://systemimager.sourceforge.net) It pulls the system image of a master-client into an Image Server. Cluster nodes can pull the image they choose from the Image Server. • The Image Server “pulls” the system image of a master-client. • Cluster nodes use rsync and tftp to pull images from the Image Server • Can be done on a private network. • It Supports several Linux distributions • Configuration changes can be easily propagated to clients through rsync. • Rsync (http://rsync.samba.org) is capable to just “pull” the new &/ modified files off the server rather than the whole system image.

File Sharing: DAS DAS: Direct Attached Storage

File Sharing: NAS NAS: Network Attached Storage

File Sharing: NAS Network Attached Storage (NAS): Shared storage on a network. A dedicated high-performance single purpose machine. • Separate Data Servers from Application Servers • Provide Centralized Data Management • Scalability • Dynamic Growth of Filesystems (LVM). • Journaling Filesystems • RAID controllers • Support for Multiple protocols (NFS, CIFS, HTTP, FTP etc) • Multiple Network Interfaces • Uses existing Network Infrastructure • Web Admin/Monitor GUI (netattach) • Redundant Power Supplies/fans/cables • Linux Support

File Sharing: SAN SAN: Storage Area Network

File Sharing: SAN Storage Area Network (SAN): Shared storage on a network. A dedicated high-performance network connecting storage elements and the back end of the servers. • Provides the Benefits of NAS and also isolates the network traffic into a dedicated high performance network. • Disk Drives are attached directly on a Fiber Channel Network (not acceptable on a TCP/IP network). SAN Disadvantages • Expensive: Must build a dedicated, high-performance network • Lack of strong standards • Proprietary Solutions Only

File Sharing: NFS • What is NFS ? • The Network File System (NFS) protocol provides transparent remote access to shared file systems across networks. The NFS protocol is designed to be machine, operating system, network architecture and transport protocol independent. This independence is achieved through the use of Remote Procedure Calls (RPC) primitives built on top of eXternal Data Representation (XDR). • How is NFS3 different from NFS2 ? • NFS version 3 support allows 64 bit file system support (version 2 is limited to 32 bit support), reliable asynchronous writes (version 2 supports only synchronous writes), better cache consistency by providing attribute information before and after an operation which is an added feature and better performance on directory lookups by using READDIRPLUS calls which reduce the number of messages passed between the client and server. The readdirplus calls return file handles and attributes in addition to directory entries. The maximum data transfer size which was set to be 8k in NFS2 is now set by values in the FSINFO return structure.

File Sharing: NFS • NFS Benchmarking: Bonnie (http://www.textuality.com/bonnie) The same filesystem (/scratch) mounted on the Linux client using different NFS version. The NFS server is running Solaris 7 (sun3.bnl.gov), The Client is a dual 800MHz RedHat 6.2 host (BLC) NFSv2 -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 1000 698 2.2 619 0.5 655 1.1 5615 17.3 9994 12.9 176.2 2.8 NFSv3 -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 1000 4630 15.6 4631 4.4 2329 4.6 11233 39.3 11185 16.6 695.4 11.5

File Sharing: NFS • NFS Benchmarking: Bonnie Linux Server - Linux Client (RedHat 6.2, 2.2.18) -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU NFSv3-v2 1000 9787 33.5 9886 8.7 3222 5.6 8630 29.3 9110 13.6 115.9 0.9 NFSv3-v3 1000 9848 34.1 9911 8.9 3227 5.6 8740 30.0 9087 12.7 116.5 1.0 Local 1000 19643 61.5 25040 11.7 8297 12.7 15651 40.9 18885 11.1 1150.4 6.9

File Sharing: NFS • Network Attached Storage (NAS) Linux Server (VA 9450NAS ) Dual PIII Xeon 700MHz, 2.0GB RAM, RedHat 6.2 NFSv3, Mylex extremeRAID 2000, ext3 ) Linux Client (VA 2200) Dual PIII 800 MHz, 0.5GB RAM, RedHat 6.2 2.2.18 kernel with NFSv3). Server and Client are on the same Network switch (cisco 4006) -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU NFSv3-v3 1000 9848 34.1 9911 8.9 3227 5.6 8740 30.0 9087 12.7 116.5 1.0 NASv3-v3 1024 7551 25.8 7544 6.8 5048 10.7 11504 36.9 11488 15.9 1640.2 22.1 NASv3-v3 2047 7428 25.5 7427 7.0 4021 7.8 10297 33.7 9940 13.9 740.5 13.5 NFSv3-v2 1000 9787 33.5 9886 8.7 3222 5.6 8630 29.3 9110 13.6 115.9 0.9 NASv3-v2 512 7430 25.1 7280 7.0 6500 10.4 26453 63.8 19378 16.2 6881.1 87.7 NASv3-v2 1024 7360 25.3 7431 7.0 4811 9.1 11357 36.5 10985 16.3 1742.8 24.0 NASv3-v2 2047 7348 25.1 7398 6.9 4154 8.2 10813 34.9 10474 15.1 981.3 13.5

File Sharing: NFS The setup of having a SUN workstation as a “main node” serving home directories is pretty common. quark.phy.bnl.gov, sun1.sns.bnl.gov, sun2.bnl.gov (sun65.bnl.gov) etc • http://linux.itd.bnl.gov/NFS • http://nfs.sourceforge.net

File Sharing: PVFS PVFS: Parallel Virtual File System http://www.parl.clemson.edu/pvfs/desc.html Stripes file data across multiple disks in different nodes (I/O nodes) in a Cluster. This way large files can be created and bandwidth is increased. Four major components to the PVFS system: • Metadata server (mgr) • I/O server (iod) • PVFS native API (libpvfs) • PVFS Linux kernel support

Cluster Network Fast Ethernet • Transmission Speed: 0.1Gbps, Latency: 100 ms, Cost/Connection:<$1000 Gigabit Ethernet • Maximum Bandwidth: 1.0Gbps; Cost: $1,650/connection (based on 64 ports, copper). Myrinet • Low Latency, small distance network (System Area Network). • Maximum Bandwidth: 1.2Gbps; Latency: 9 ms, Cost: > $2,500/connection • Single Vendor Hardware. CDIC Cluster Interconnect: Cisco 4006: 48x3 port Full-duplex Fast Ethernet Switch

Network Graph http://linux.itd.bnl.gov/netpipe

Network Signature

Cluster Network: Private vs Public • Private Network Cluster Security/Setup/Administration much easier Applications cannot interact with the outside world • Public Network Security/setup/administration difficult IP addresses needed Interaction possible.

Load Management Systems (LMS) • Transparent Load Sharing The users submit jobs w/o being concerned with which cluster resource is being used to process the job. • Control Over Resource Sharing Rather than leaving it up to individuals to search the network for available resources and capacity to run their jobs, LMS controls the resources in the cluster. LMS takes into account the specifications or requirements of the job when assigning resources. It matches the requirements with the resources available. • Implement Policies In an LMS, rules can be established that automatically set priorities for jobs among groups or teams. This enables the LMS to implement resource sharing between groups.

Load Management Systems (LMS) • Batch queuing • Load Balancing • Failover Capability • Job Accounting/Statistics • User specifiable Resources • Relinking/Recompiling of Application Programs • Fault tolerant • Suspend/Resume jobs • Job Status • Host Status • Meta-Job Capability • Cluster-wide resources • Job Migration • Central Control

Load Management Systems (LMS) Numerous LMS (or CMS) available on Linux. • Portable Batch System(PBS) (http://pbs.mrj.com) Developed by NASA. It is freely distributed by a commercial company which can also provide service and support. • Load Share Facility (LSF) (http://www.platform.com) • Sun Grid Engine (CODINE) (http://www.sun.com/software/gridware/linux/) • Distributed Queuing System (DQS) (http://www.scri.fsu.edu/~pasko/dqs.html) • Generic Network Queuing System (GNQS) (http://www.gnqs.org/) • LoadLeveler (http://www.austin.ibm.com/software/sp_products/loadlev.html) Developed by IBM, is a modified version of the Condor batch queuing system http://www.cs.wisc.edu/condor/

Load Management Systems (LMS) • Portable Batch System (PBS) PBS was designed and developed by NASA to provide control over the initiating, scheduling, and execution of batch jobs. • User Interfaces: Gui xPBS and Command Line Interface (CLI) • Heterogeneous Clusters • Interactive (debugging sessions or jobs that require user command-line input) and Batch Jobs • Parallel code support for MPI, PVM, HPF • File Staging • Automatic Load-Leveling: The PBS Scheduler numerous ways to distribute workload across the cluster, based on hardware configuration, resource availability and keyboard activity • Job Accounting • Cross-system Scheduling • Web Site: http://pbs.mrj.com • Short introduction at BNL: http://www.itd.bnl.gov/bcf/cluster/pbs/ • SPF (Single Point of Failure)

Parallel Processing • The use of multiple processors to execute different parts of the program simultaneously. • Main Goal is to reduce wall-clock time; (also cost, memory constraints, etc) • Things to consider: • Is the problem parallelizable? (F(k+2)=F(k+1)+F(k)) • Parallel Overhead (the amount of time required to coordinate parallel tasks) • Synchronization of parallel tasks (waiting of two or more tasks to reach a specified point). • Granularity of the problem • SMP vs DMP (is the network a factor ?)

Parallel Processing • Threads Used on SMP hosts only; Not widely used in scientific computing. • Compiler generated parallel programs • Compiler detects concurrency in loops and distributes work in a loop to different threads. • Compiler is usually assisted by compiler directives. • MPI, PVM • Embarrassing parallelism Independent processes can be executed in parallel with little or no coupling between them.

Message Passing Interface (MPI) • MPI is a message passing library, a collection of subroutines to facilitate the communication (exchange of data and synchronization) among processes in a distributed memory program. • MPI offers portability and performance; Not a true standard. • Messages are the actual data that you send/receive and an envelope of information that helps route the data. In MPI message-passing calls there are three parameters that describe the data and another three that specify the routing (envelope). Data: startbuf, count, datatype Envelope: dest, tag, communicator • Messages are sent over TCP sockets.

Message Passing Interface (MPI) • Number of processors. (are we overdoing it with too many processes? Creating too much message passing over the network, enhancing synchronization time?) • Message size. (it is the optimum for our network technology? What bandwidth do we get when we pass different size of messages?) • Design: Is our problem very fine gained?? • Do we take advantage of loop unrolling? Use the right compiler flags • Take advantage of the resources. Avoid nodes that are busy. Sometimes quite slow nodes can do the job equally well than a fast and busy node. • Benchmark. Get the numbers. What is important to your code? Memory, CPU, etc

Benchmarking MPI • Tools included in the MPICH distribution • LLC Bench (http://icl.cs.utk.edu/projects/llcbench/index.htm) • Vampir MPI Performance Analysis Tool • NAS Parallel Benchmarks (NPB) • http://www.nas.nasa.gov/software/NPB • NAS: Numerical Aerospace Simulation • NPB are installed on BGC under /opt/pgi/bench/NPB2.3 • NPB benchmark Results: only one program seems to benefit from the Gigabit interconnect upgrade, due to large message passing. • NAS serial version (help understand the architecture)

NPB-2.3-Serial (W Class)

Cluster Monitors Administrators: • Cluster Usage • Log File scans • Intrusion Detection • Hardware and Software Inventories, etc Users: • How many nodes are in the cluster? • What type of nodes (architecture)? PC, Spark, SGI • Available resources (CPU speed, disk space, memory etc) • What is being used? What nodes are “empty”?

Cluster Monitors • Load Management Systems provide some sort of monitoring. • HP OpenView Open Source Products • Pong3 (Perl, www.megacity.org/pong3) • System-Info (Perl, http://blaine.res.wpi.net/) • spong (Perl, spong.sourceforge.net) • bWatch (Tcl/Tk and ssh, user customizable) • Vacum (VA product) • rps (Perl, rps.sourceforge.net)

HP OpenView • IT Operations • ITO agent running on client • ITO central console • ITO agents monitor client log files and report events back to the central console • Actions can be taken either manually or automatically at the console in response to events • RADIA (Novadigm Product) • Hardware and software Inventories • Software distributions • NNM (Network Node Manager) • Net Metrics: Report Network Statistics • Net Rangers: Intrusion detection

Spong (spong.sourceforge.net) • Provides CPU, memory, disk utilization • Checks Availability of services (ssh, http, PBS etc)(*) • List running jobs (sorted by CPU usage) (*) • Keeps history of events per host • Warns (email) Admins of status changes • Usage graphs (per hour/day/month/year) (**) • Scans Log files • Per Host Configuration • Open Source (*) Modified/Enchanced (**)unstable

Remote ps (http://rps.sourceforge.net)

The CDIC Cluster (BGC) • 49 Nodes on a local network bgc000: Master Node, 2NIC 001-029: dual PIII 700MHz, 1GB Memory, 8GB disk 030-047: dual PIII 500MHz, 0.5GB Memory, 2GB disk bgc-f1: fileserver hosting “local” user Home directories (50GB) (“public” home directories are mounted on the master node only, under /itd) • RedHat 6.2 • PBS (with MPI support) • MPICH-1.2.1 • Initial installation with kickstart. Use rsync to propagate updates. • Portland Group Compilers (3.2) 16 CPU, 2 users • Monitors: bWatch, spong, pong3 • Interim backup solution

The Brookhaven Cluster (BLC) • General Purpose Cluster • 60 Nodes on a public network blc000.bnl.gov: Master Node 001-040: dual PIII 800MHz, 0.5GB Memory, 2x9GB disk 041-059: dual PIII 500MHz, 0.5GB Memory, 18GB disk • RedHat 6.2 • PBS (with MPI support). • MPICH-1.2.1 • Home directories are hosted on a Solaris File Server (userdata.bnl.gov) • Initial installation, configuration and updates with System Imager. (Host Images are kept on the master node) • Portland Group Compilers (3.2) 64 CPU, 4 users • Monitors: bWatch, spong, pong3, HPOV

The SNS Cluster (SNSC) • 6 Nodes on public network snsc00.sns.bnl.gov: Master Node 01-05: dual PIII 700MHz, 0.5GB Memory, 18GB disk • RedHat 7.0 • PBS (with MPI support). (?) • MPICH-1.2.1 • Home directories are hosted on a Solaris File Server (sun1.sns.bnl.gov) • Initial installation, configuration and updates with System Imager ( Host Images are kept on the master node).

Thoughts ... Need for Centralized Cluster Management and Homogeneity : • Easier monitoring, administration, maintenance and recovery. • Users will have the option to share resources: Idle CPUs, filesystems (NAS), Network Switches, Printers etc. and costs (licenses, software). • Increase user interaction. • Faster Integration of new groups.

Linux Clusters in ITD

Linux Clusters in ITD

Presentation Transcript

Protocol-Dependent Message-Passing Performance on Linux Clusters

Can Commodity Linux Clusters Scale to Petaflops?

Clusters in India

ITD Technology Inventory

The Largest Linux Clusters

3-D Graphics Rendering Using LINUX CLUSTERS

Introduction to Scientific Computing on Linux Clusters

ITD 2323

Presentation on WORKMEN COMPENSATION POLICY FOR ITD -ITD CEM J.V.

High Performance Linux Clusters

Distributed Security Model for Linux Clusters

ITD is alive in Griffith Education

Linux Clusters for High-Performance Computing

ITD Overview

Clusters in Lithuania

High Performance, Dense, Low Power Linux Clusters

PVFS: A Parallel File System for Linux Clusters

ITD Overview

Protocol-Dependent Message-Passing Performance on Linux Clusters

OpenSSI - Kickass Linux Clusters Dr. Bruce J. Walker