1 / 57

Scyld ClusterWare System Administration

Scyld ClusterWare System Administration. Orientation Agenda – Part 1. Scyld ClusterWare foundations Booting process Startup scripts File systems Name services Cluster Configuration Cluster Components Networking infrastructure NFS File servers IPMI Configuration Break.

garran
Download Presentation

Scyld ClusterWare System Administration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scyld ClusterWare System Administration Confidential – Internal Use Only

  2. Orientation Agenda – Part 1 • Scyld ClusterWare foundations • Booting process • Startup scripts • File systems • Name services • Cluster Configuration • Cluster Components • Networking infrastructure • NFS File servers • IPMI Configuration • Break Confidential – Internal Use Only

  3. Orientation Agenda – Part 2 • Parallel jobs • MPI configuration • Infiniband interconnect • Queuing • Initial setup • Tuning • Policy case studies • Other software and tools • Troubleshooting • Questions and Answers Confidential – Internal Use Only

  4. Orientation Agenda – Part 1 • Scyld ClusterWare foundations • Booting process • Startup scripts • File systems • Name services • Cluster Configuration • Cluster Components • Networking infrastructure • NFS File servers • Break Confidential – Internal Use Only

  5. Optional Disks Interconnection Network Master Node Internet or Internal Network Cluster VirtualizationArchitectureRealized • Minimal in-memory OS with single daemon rapidly deployed in seconds - no disk required • Less than 20 seconds • Virtual, unified process space enables intuitive single sign-on, job submission • Effortless job migration to nodes • Monitor & manage efficiently from the Master • Single System Install • Single Process Space • Shared cache of the cluster state • Single point of provisioning • Better performance due to lightweight nodes • No version skew is inherently more reliable Manage & use a cluster like a single SMP machine Confidential – Internal Use Only

  6. Optional Disks Interconnection Network Master Node Internet or Internal Network Elements of Cluster Systems • Some important elements of a cluster system • Booting and Provisioning • Process creation, monitoring and control • Update and consistency model • Name services • File Systems • Physical management • Workload virtualization Confidential – Internal Use Only

  7. Booting and Provisioning • Integrated, automatic network boot • Basic hardware reporting and diagnostics in the Pre-OS stage • Only CPU, memory and NIC needed • Kernel and minimal environment from master • Just enough to say “what do I do now?” • Remaining configuration driven by master • Logs are stored in: • /var/log/messages • /var/log/beowulf/node.* Confidential – Internal Use Only

  8. DHCP and TFTP services • Started from /etc/rc.d/init.d/beowulf • Locate vmlinuz in /boot • Configure syslog and other parameters on the head node • Loads kernel modules • Setup libraries • Creates ramdisk image for compute nodes • Starts DHCP/TFTP server (beoserv) • Configures NAT for ipforwarding if needed • Starts kickback name service daemon (4.2.0+) • Tune network stack Confidential – Internal Use Only

  9. Compute Node Boot Process • Starts with /etc/beowulf/node_up • Calls /usr/lib/beoboot/bin/node_up • Usage: node_up <nodenumber> • Sets up: • System date • Basic network configuration • Kernel modules (device drivers) • Network routing • setup_fs • Name services • chroot • Prestages files (4.2.0+) • Other init scripts in /etc/beowulf/init.d Confidential – Internal Use Only

  10. Compute Node Boot Process • Starts with /etc/beowulf/node_up • Calls /usr/lib/beoboot/bin/node_up • Usage: node_up <nodenumber> • Sets up: • System date • Basic network configuration • Kernel modules (device drivers) • Network routing • setup_fs • Name services • chroot • Prestages files (4.2.0+) • Other init scripts in /etc/beowulf/init.d Confidential – Internal Use Only

  11. Subnet configuration • Default used to be class C Network • netmask 255.255.255.0 • Limited to 155 compute nodes ( 100 + $NODE < 255 ) • Last octect denotes special devices • x.x.x.10 switches • x.x.x.30 storage • Infiniband is a separate network • x.x.1.$(( 100 + $NODE )) • Needed eth0:1 to reach IPMI network • x.x.2.$(( 100 + $NODE )) • /etc/sysconfig/network-scripts/ifcfg-eth0:1 • ifconfig eth0:1 10.54.2.1 netmask 255.255.255.0 Confidential – Internal Use Only

  12. Subnet configuration • New standard is class B Network • netmask 255.255.0.0 • Limited to 100 * 256 compute nodes • 10.54.50.x – 10.54.149.x • Third octect denotes special devices • x.x.10.x switches • x.x.30.x storage • Infiniband is a separate network • x.$(( x+1)).x.x • IPMI is on the same network (eth0:1 not needed) • x.x.150.$NODE Confidential – Internal Use Only

  13. Compute Node Boot Process • Starts with /etc/beowulf/node_up • Calls /usr/lib/beoboot/bin/node_up • Usage: node_up <nodenumber> • Sets up: • System date • Basic network configuration • Kernel modules (device drivers) • Network routing • setup_fs • Name services • chroot • Prestages files (4.2.0+) • Other init scripts in /etc/beowulf/init.d Confidential – Internal Use Only

  14. Setup_fs • Script is in /usr/lib/beoboot/bin/setup_fs • Configuration file: /etc/beowulf/fstab • # Select which FSTAB to use.if [ -r /etc/beowulf/fstab.$NODE ] ; then FSTAB=/etc/beowulf/fstab.$NODEelse FSTAB=/etc/beowulf/fstabfiecho "setup_fs: Configuring node filesystems using $FSTAB...“ • $MASTER is determined and populated • “nonfatal” option allows compute nodes to finish boot process and log errors in /var/log/beowulf/node.* • NFS mounts of external servers needs to be done via IP address because name services have not been configured yet Confidential – Internal Use Only

  15. beofdisk • Beofdisk configures partition tables on compute nodes • To configure first drive: • bpsh 0 fdisk /dev/sda • Typical interactive usage • Query partition table: • beofdisk -q --node 0 • Write partition tables to other nodes: • for i in $(seq 1 10); do beofdisk -w --node $i ; done • Create devices initially • Use head nodes /dev/sd* as reference: • [root@scyld beowulf]# ls -l /dev/sda*brw-rw---- 1 root disk 8, 0 May 20 08:18 /dev/sdabrw-rw---- 1 root disk 8, 1 May 20 08:18 /dev/sda1brw-rw---- 1 root disk 8, 2 May 20 08:18 /dev/sda2brw-rw---- 1 root disk 8, 3 May 20 08:18 /dev/sda3[root@scyld beowulf]# bpsh 0 mknod /dev/sda1 b 8 1 Confidential – Internal Use Only

  16. Create local filesystems • After partitions have been created, mkfs • bpsh –an mkswap /dev/sda1 • bpsh –an mkfs.ext2 /dev/sda2 • ext2 is a non-journaled filesystem, faster than ext3 for scratch file system • If corruption occurs, simply mkfs again • Copy int18 bootblock if needed: • bpcp /usr/lib/beoboot/bin/int18_bootblock $NODE:/dev/sda • /etc/beowulf/config options for file system creation • # The compute node file system creation and consistency checking policies.fsck fullmkfs never Confidential – Internal Use Only

  17. Compute Node Boot Process • Starts with /etc/beowulf/node_up • Calls /usr/lib/beoboot/bin/node_up • Usage: node_up <nodenumber> • Sets up: • System date • Basic network configuration • Kernel modules (device drivers) • Network routing • setup_fs • Name services • chroot • Prestages files (4.2.0+) • Other init scripts in /etc/beowulf/init.d Confidential – Internal Use Only

  18. Name services • /usr/lib/beoboot/bin/node_up populates /etc/hosts and /etc/nsswitch.conf on compute nodes • beo name service determines values from /etc/beowulf/config file • bproc name service determines values from current environment • ‘getent’ can be used to query entries • getent netgroup cluster • getent hosts 10.54.0.1 • getent hosts n3 • If system-config-authentication is run, ensure that proper entries still exist in /etc/nsswitch.conf (head node) Confidential – Internal Use Only

  19. Optional Disks Interconnection Network Master Node Internet or Internal Network BeoNSS Hostnames n0 n1 n2 n3 n4 n5 • Opportunity: We control IP address assignment • Assign node IP addresses in node order • Changes name lookup to addition • Master: 10.54.0.1GigE Switch: 10.54.10.0IB Switch: 10.54.11.0NFS/Storage: 10.54.30.0Nodes: 10.54.50.$node • Name format • Cluster hostnames have the base form n<N> • Options for admin-defined names and networks • Special names for "self" and "master" • Current machine is ".-2" or "self". • Master is known as ".-1", “master”, “master0” .-1 master Confidential – Internal Use Only

  20. Changes • Prior to 4.2.0 • Hostnames default to .<NODE> form • /etc/hosts had to be populated with alternative names and IP addresses • May break @cluster netgroup and hence NFS exports • /etc/passwd and /etc/group needed on compute nodes for Torque • 4.2.0+ • Hostnames default to n<NODE> form • Configuration is driven by /etc/beowulf/config and beoNSS • Username and groups can be provided by kickback daemon for Torque Confidential – Internal Use Only

  21. Compute Node Boot Process • Starts with /etc/beowulf/node_up • Calls /usr/lib/beoboot/bin/node_up • Usage: node_up <nodenumber> • Sets up: • System date • Basic network configuration • Kernel modules (device drivers) • Network routing • setup_fs • Name services • chroot • Prestages files (4.2.0+) • Other init scripts in /etc/beowulf/init.d Confidential – Internal Use Only

  22. ClusterWare Filecache functionality • Provided by filecache kernel module • Configured by /etc/beowulf/config libraries directives • Dynamically controlled by ‘bplib’ • Capabilities exist in all ClusterWare 4 versions • 4.2.0 add prestage keyword in /etc/beowulf/config • Prior versions needed additional scripts in /etc/beowulf/init.d • For libraries listed in /etc/beowulf/config, files can be prestaged by ‘md5sum’ the file • # Prestage selected libraries. The keyword is generic, but the current# implementation only knows how to "prestage" a file that is open'able on# the compute node: through the libcache, across NFS, or already exists# locally (which isn't really a "prestaging", since it's already there).prestage_libs=`beoconfig prestage`for libname in $prestage_libs ; do # failure isn't always fatal, so don't use run_cmd echo "node_up: Prestage file:" $libname bpsh $NODE md5sum $libname > /dev/nulldone Confidential – Internal Use Only

  23. Compute Node Boot Process • Starts with /etc/beowulf/node_up • Calls /usr/lib/beoboot/bin/node_up • Usage: node_up <nodenumber> • Sets up: • System date • Basic network configuration • Kernel modules (device drivers) • Network routing • setup_fs • Name services • chroot • Prestages files (4.2.0+) • Other init scripts in /etc/beowulf/init.d Confidential – Internal Use Only

  24. Compute nodes init.d scripts • Located in /etc/beowulf/init.d • Scripts start on the head node and need explicit bpsh and beomodprobe to operate on compute nodes • $NODE has been prepopulated by /usr/lib/beoboot/bin/node_up • Order is based on file name • Numbered files can be used to control order • beochkconfig is used to set +x bit on files Confidential – Internal Use Only

  25. Cluster Configuration • /etc/beowulf/config is the central location for cluster configuration • Features are documented in ‘man beowulf-config’ • Compute node order is determined by ‘node’ parameters • Changes can be activated by doing a ‘service beowulf reload’ Confidential – Internal Use Only

  26. Orientation Agenda – Part 1 • Scyld ClusterWare foundations • Booting process • Startup scripts • File systems • Name services • Cluster Configuration • Cluster Components • Networking infrastructure • NFS File servers • IPMI configuration • Break Confidential – Internal Use Only

  27. Optional Disks Interconnection Network Master Node Internet or Internal Network Elements of Cluster Systems • Some important elements of a cluster system • Booting and Provisioning • Process creation, monitoring and control • Update and consistency model • Name services • File Systems • Physical management • Workload virtualization Confidential – Internal Use Only

  28. Compute Node Boot Process • Starts with /etc/beowulf/node_up • Calls /usr/lib/beoboot/bin/node_up • Usage: node_up <nodenumber> • Sets up: • System date • Basic network configuration • Kernel modules (device drivers) • Network routing • setup_fs • Name services • chroot • Prestages files (4.2.0+) • Other init scripts in /etc/beowulf/init.d Confidential – Internal Use Only

  29. Optional Disks Interconnection Network Master Node Internet or Internal Network Remote Filesystems • Remote - Share a single disk among all nodes • Every node sees same filesystem • Synchronization mechanisms manage changes • Locking has either high overhead or causes serial blocking • "Traditional" UNIX approach • Relatively low performance • Doesn't scale well; server becomes bottleneck in large systems • Simplest solution for small clusters, reading/writing small files Confidential – Internal Use Only

  30. NFS Server Configuration • Head node NFS services • Configuration in /etc/exports • Provides system files (/bin, /usr/bin) • Increase number of NFS daemons • echo “RPCNFSDCOUNT=64” > /etc/sysconfig/nfs ; service nfs restart • Dedicated NFS server • SLES10 was recommended; RHEL5 now includes some xfs support • xfs has better performance • OS has better IO performance than RHEL4 • Network trunking can be used to increase bandwidth (with caveats) • Hardware RAID • Adaptec RAID card • CTRL-A at boot • arcconf utility from http://www.adaptec.com/en-US/support/raid/ • External storage (Xyratex or nStor) • SAS-attached • Fibre channel attached Confidential – Internal Use Only

  31. Network trunking • Use multiple physical links as a single pipe for data • Configuration must be done on host and switch • SLES 10 configuration • Create a configuration file /etc/sysconfig/network/ifcfg-bond0 for the bond0 interface • BOOTPROTO=staticDEVICE=bond0IPADDR=10.54.30.0NETMASK=255.255.0.0STARTMODE=onbootMTU='‘BONDING_MASTER=yesBONDING_SLAVE_0=eth0BONDING_SLAVE_1=eth1BONDING_MODULE_OPTS='mode=0 miimon=500' Confidential – Internal Use Only

  32. Network trunking • HP switch configuration • Create trunk group via serial or telnet interface • Netgear (admin:password) • Create trunk group via http interface • Cisco • Create etherchannel configuration Confidential – Internal Use Only

  33. External Storage • Xyratex arrays have a configuration interface • Text based via serial port • Newer devices (nStor 5210, Xyratex F/E 5402/5412/5404) have embedded StorView • http://storage0:9292 • admin:password • RAID arrays, logical drives are configured and monitored • LUNs are numbered and presented on each port. Highest LUN is the controller itself • Multipath or failover needs to be configured Confidential – Internal Use Only

  34. Need for QLogic Failover • Collapse LUN presentation in OS to a single instance per LUN • Minimize potential for user error which maintaining failover and static loadbalancing Confidential – Internal Use Only

  35. Physical Management • ipmitool • Intelligent Platform Management Interface (IPMI) is integrated into the base management console (BMC) • Serial-over-LAN (SOL) can be implemented • Allows access to hardware such as sensor data or power states • E.g. ipmitool –H n$NODE-ipmi –U admin –P admin power {status,on,off} • bpctl • Controls the operational state and ownership of compute nodes • Examples might be to reboot or power off a node • Reboot: bpctl –S all -R • Power off: bpctl –S all –P • Limit user and group access to run on a particular node or set of nodes Confidential – Internal Use Only

  36. IPMI Configuration • Full spec is available here: • http://www.intel.com/design/servers/ipmi/pdf/IPMIv2_0_rev1_0_E3_markup.pdf • Penguin Specific configuration • Recent products all have IPMI implementations. Some are in-band (share physical media with eth0), some are out-of-band (separate port and cable from eth0) • Altus 1300, 600, 650 – In-band, lan channel 6 • Altus 1600, 2600, 1650, 2650; Relion 1600, 2600, 1650, 2650, 2612 – Out-of-band, lan channel 2 • Relion 1670 – In-band, lan channel 1 • Altus x700/x800, Relion x700 – Out-of-band OR in-band, lan channel 1 • Some ipmitool versions have a bug and need to following command to commit a write • bpsh $NODE ipmitool raw 12 1 $CHANNEL 0 0 Confidential – Internal Use Only

  37. Orientation Agenda – Part 2 • Parallel jobs • MPI configuration • Infiniband interconnect • Queueing • Initial setup • Tuning • Policy case studies • Other software and tools • Questions and Answers Confidential – Internal Use Only

  38. Explicitly Parallel Programs • Different paradigms exist for parallelizing programs • Shared memory • OpenMP • Sockets • PVM • Linda • MPI • Most distributed parallel programs are now written using MPI • Different options for MPI stacks: MPICH, OpenMPI, HP, and Intel • ClusterWare comes integrated with customized versions of MPICH and OpenMPI Confidential – Internal Use Only

  39. Compiling MPICH programs • mpicc, mpiCC, mpif77, mpif90 are used to automatically compile code and link in the correct MPI libraries from /usr/lib64/MPICH • GNU, PGI, and Intel compilers are supported • Effectively set libraries and includes for compile and linking • prefix="/usr“part1="-I${prefix}/include“part2="“part3="-lmpi -lbproc“…part1="-L${prefix}/${lib}/MPICH/p4/gnu $part1“…$cc $part1 $part2 $part3 Confidential – Internal Use Only

  40. Running MPICH programs • mpirun is used to launch MPICH programs • If Infiniband is installed, the interconnect fabric can be chosen using the machine flag: • -machine p4 • -machine vapi • Done by changing LD_LIBRARY_PATH at runtime • export LD_LIBRARY_PATH=${libdir}/MPICH/${MACHINE}/${compiler}:${LD_LIBRARY_PATH} • Hooks for using mpiexec for Queue system • elif [ -n "${PBS_JOBID}" ]; then for var in NP NO_LOCAL ALL_LOCAL BEOWULF_JOB_MAP do unset $var done for hostname in `cat $PBS_NODEFILE` do NODENUMBER=`getent hosts ${hostname} | awk '{print $3}' | tr -d '.'` BEOWULF_JOB_MAP="${BEOWULF_JOB_MAP}:${NODENUMBER}“ done # Clean a leading : from the map export BEOWULF_JOB_MAP=`echo ${BEOWULF_JOB_MAP} | sed 's/^://g'` # The -n 1 argument is important here exec mpiexec -n 1 ${progname} "$@" Confidential – Internal Use Only

  41. Environment Variable Options • Additional environment variable control: • NP — The number of processes requested, but not the number of processors. As in the example earlier in this section, NP=4 ./a.out will run the MPI program a.out with 4 processes. • ALL_CPUS — Set the number of processes to the number of CPUs available to the current user. Similar to the example above, --all-cpus=1 ./a.out would run the MPI program a.out on all available CPUs. • ALL_NODES—Set the number of processes to the number of nodes available to the current user. Similar to the ALL_CPUS variable, but you get a maximum of one CPU per node. This is useful for running a job per node instead of per CPU. • ALL_LOCAL — Run every process on the master node; used for debugging purposes. • NO_LOCAL — Don’t run any processes on the master node. • EXCLUDE — A colon-delimited list of nodes to be avoided during node assignment. • BEOWULF_JOB_MAP — A colon-delimited list of nodes. The first node listed will be the first process (MPI Rank 0) and so on. Confidential – Internal Use Only

  42. Running MPICH programs • Prior to ClusterWare 4.1.4, mpich jobs were spawned outside of the queue system • BEOWULF_JOB_MAP had to be set based on machines listed in $PBS_NODEFILE • number_of_nodes=`cat $PBS_NODEFILE | wc -l`hostlist=`cat $PBS_NODEFILE | head -n 1 `for i in $(seq 2 $number_of_nodes ) ; do   hostlist=${hostlist}:`cat $PBS_NODEFILE | head -n $i | tail -n 1`doneBEOWULF_JOB_MAP=`echo $hostlist | sed 's/\.//g' | sed 's/n//g'`export BEOWULF_JOB • Starting with ClusterWare 4.1.4, mpiexec was included with the distribution. mpiexec is an alternative spawning mechanism that starts processes as part of the queue system • Other MPI implementations have alternatives. HP-MPI and Intel MPI use rsh and run outside of the queue system. OpenMPI uses libtm to properly start processes Confidential – Internal Use Only

  43. MPI Primer • Only a brief introduction is provided here for MPI. Many other in-depth tutorials are available on the web and in published sources. • http://www.mpi-forum.org/docs/mpi-11-html/mpi-report.html • http://www.llnl.gov/computing/tutorials/mpi/ • Paradigms for writing parallel programs depend upon the application • SIMD (single-instruction multiple-data) • MIMD (multiple-instruction multiple-data) • MISD (multiple-instruction single-data) • SIMD will be presented here as it is a commonly used template • A single application source is compiled to perform operations on different sets of data • The data is read by the different threads or passed between threads via messages (hence MPI = message passing interface) • Contrast this with shared memory or OpenMP where data is locally via memory • Optimizations in the MPI implementation can perform localhost optimization; however, the program is still written using a message passing construct • MPI specification has many functions; however most MPI programs can be written with only a small subset Confidential – Internal Use Only

  44. Infiniband Primer • Infiniband provides a low-latency, high-bandwidth interconnect for message to minimize IO for tightly coupled parallel applications • Infiniband requires hardware, kernel drivers, O/S support, user land drivers, and application support • Prior to 4.2.0, software stack was provided by SilverStorm • Starting with 4.2.0, ClusterWare migrated to using the OpenFabrics (ofed, openIB) stack Confidential – Internal Use Only

  45. Infiniband Subnet Manager • Every Infiniband network requires a Subnet Manager to discover and manage the topology • Our clusters typically ship with a Managed QLogic Infiniband switch with an embedded subnet manager (10.54.0.20; admin:adminpass) • Subnet Manager is configured to start at switch boot • Alternatively, a software Subnet Manager (e.g. openSM) can be run on a host connected to the Infiniband fabric. • Typically the embedded subnet manager is more robust and provides a better experience Confidential – Internal Use Only

  46. Communication Layers • Verbs API (VAPI) provides a hardware specific interface to the transport media • Any program compiled with VAPI can only run on the same hardware profile and drivers • Makes portability difficult • Direct Access Programming Language (DAPL) provides a more consistent interface • DAPL layers can communicate with IB, Myrinet, and 10GigE hardware • Better portability for MPI libraries • TCP/IP interface • Another upper layer protocol provides IP-over-IB (IPoIB) where the IB interface is assigned an IP address and most standard TCP/IP applications work Confidential – Internal Use Only

  47. MPI Implementation Comparison • MPICH is provided by Argonne National Labs • Runs only over Ethernet • Ohio State University has ported MPICH to use the Verbs API => MVAPICH • Similar to MPICH but uses Infiniband • LAM-MPI was another implementation which provided a more modular format • OpenMPI is the successor to LAM-MPI and has many options • Can use different physical interfaces and spawning mechanisms • http://www.openmpi.org • HP-MPI, Intel-MPI • Licensed MPICH2 code and added functionality • Can use a variety of physical interconnects Confidential – Internal Use Only

  48. OpenMPI Configuration • ./configure --prefix=/opt/openmpi --with-udapl=/usr --with-tm=/usr --with-openib=/usr --without-bproc --without-lsf_bproc --without-grid --without-slurm --without-gridengine --without-portals --without-gm --without-loadleveler --without-xgrid --without-mx --enable-mpirun-prefix-by-default --enable-static • make all • make install • Create scripts in /etc/profile.d to set default environment variables for all users • mpirun -v -mca pls_rsh_agent rsh -mca btl openib,sm,self -machinefile machinefile ./IMB-MPI1 Confidential – Internal Use Only

  49. Queuing • How are resources allocated among multiple users and/or groups? • Statically by using bpctl user and group permissions • ClusterWare supports a variety of queuing packages • TaskMaster (advanced MOAB policy based scheduler integrated ClusterWare) • Torque • SGE Confidential – Internal Use Only

  50. Interacting with TaskMaster • Because TaskMaster uses the MOAB scheduler with Torque pbs_server and pbs_mom components, all of the Torque commands are still valid • qsub will submit a job to Torque, MOAB then polls pbs_server to detect new jobs • msub will submit a job to Moab which then pushes the job to pbs_server • Other TaskMaster commands • qstat -> showq • qdel, qhold, qrls -> mjobctl • pbsnodes -> showstate • qmgr -> mschedctl, mdiag • Configuration in /opt/moab/moab.cfg Confidential – Internal Use Only

More Related