XC 3.0 SystemsAdministration

XC 3.0 SystemsAdministration XC Systems administration Fred Cassirer January 25, 2006

Agenda • Management Overview • Command broadcast • System services • Configuration and management database • Monitoring the System • Remote console • Firewall • Customization • Network device naming • Troubleshooting

Management Philosophy • Simple is better • Minimize load on application nodes, concentrate management processing on management hubs where possible • Use industry standard tools and techniques • Provide the necessary “glue”, configuration, and policy for best in class cluster management

Components • Database • MySQL server using schema designed for HP’s HPC environment • Nagios • System monitoring, reporting, management, eventing, and notification • SuperMon • Optimized metric aggregation across entire cluster. • Syslog-NG • Next Generation syslog replaces syslogd functionality and provides tiered aggregation of system logs • Pdsh • Parallel distributed command execution and copy services • Console management • Log and control console output

XC Management Stack Monitoring & logging Distributed commands Nagios cmf pdsh pdcp Syslog Super-Mon Nagiosplug-ins Installation Configuration Kickstart XCConfig SystemImager XC Database mySQL Red Hat Compatible Distribution Orange components are open tools configured / adapted for use by XC.

System Operations:Monitoring and logging SystemFiles Nagios Mgmt Server XCDB syslog-ng forwarding Supermon aggregation Aggregation points MgmtHub MgmtHub MgmtHub

pdsh (1) • Multithreaded remote shell • Used to send a shell command to multiple nodes • [root@n5 root]# pdsh -a hostname • n5: n5 • n4: n4 • n3: n3 • n1: n1 • n2: n2 • [root@n5 root]#

pdsh (2) • Options: • -a all nodes • -f # sets max number of similtaneous commands • -w nodelist sets a list of target nodes • -x nodelist exclude a list of nodes Where « nodelist » is standard node list syntax such as: n[1-50,75,80-1044]

cexec • A shell script that invoke pdsh to perform commands on a group of hosts • The host group has previously been defined with the hostgroup command # hostgroup -c group1 # hostgroup -a n1 group1 n1 # hostgroup -a n2 group1 n2, n1 # cexec -r group1 hostname n1: n1 n2: n2

pdcp • Copy files to multiple nodes: pdcp –a –n `nodename` /etc/passwd /etc/passwd

Services in XC cluster (1)

Displaying all services • shownode servers cmf: n[15-16] compute_engine: n[10-14,16] dbserver: n16 dhcp: n16 gather_data: n[10-16] gmmon: n16 hpasm: n[10-16] hptc-lm: n16 hptc_cluster_fs: n16 hptc_cluster_fs_client: n[10-16] httpd: n16 imageserver: n16 iptables: n[10-16] lkcd: n[10-16] lsf: n16 mpiic: n16 munge: n[10-16] nagios: n16 nagios_monitor: n[15-16] nat: n16 network: n[10-16] nfs_server: n16 nrpe: n[10-16] nsca: n16 ntp: n16 pdsh: n[10-16] pwrmgmtserver: n16 slurm: n16 supermond: n[15-16] swmlogger: n16 syslogng_forward: n[15-16]

Displaying nodes that provide a given service shownode servers [service] shownode servers imageserver n8 shownode servers ntp n8 shownode servers lvs n[6-7]

Displaying the services provided by a given node • shownode services n7 mond lsf network lkcd slurm_compute slurm_controller lvs iptables cmf_client hptc_cluster_fs_client syslogng gather_data pdsh nrpe hpasm slurm_launch • shownode services n8 clients • cmf: n[1-7] • hptc_cluster_fs: n[1-7] • nagios: n[1-7] • nat: n[1-7] • ntp: n[1-8] • supermond: n[1-7] • syslogng_forward: n[1-7]

Displaying which are the service providers for a given node shownode services n3 servers cmf: n8 hptc_cluster_fs: n8 nagios: n8 nat: n[8,10] ntp: n8 supermond: n8 syslogng_forward: n8

Starting/stopping a service • on one node: service slurm stop service slurm restart • On a group of nodes: pdsh –w[n1-4] service ntpd stop pdsh –w[n1-4] service ntpd restart • On all nodes: pdsh –a service ntpd stop pdsh –a service ntpd restart

Adding a new service • /opt/hptc/config/roles_service.ini ….. [roles_to_services] compute = <<EOT slurm_compute mond nrpe EOT external = nat disk_io = <<EOT nfs_server EOT login = lvs management_hub = <<EOT supermond syslogng_forward EOT …. Service « stanza » Service name (compute, login, etc)

Adding a service to an existing role • cd /opt/hptc/config • cp roles_services.ini roles_services.ini.ORIG • Add the new service inside the role (in between <<EOT and EOT ) • Create the appropriate gconfig and nconfig files • reset_db • /opt/hptc/config/sbin/cluster_prep • /opt/hptc/config/sbin/discover –system –verbose • /opt/hptc/config/sbin/cluster_config

Creating a new role and adding a new service to it • cd /opt/hptc/config • cp roles_services.ini roles_services.ini.ORIG • Insert the new role stanza • Insert the appropriate services list in the stanza • Create the appropriate gconfig and nconfig files • reset_db • /opt/hptc/config/sbin/cluster_prep • /opt/hptc/config/sbin/discover –system –verbose • /opt/hptc/config/sbin/cluster_config

Configuration and management database • Key to the XC configuration • Keep track of: which node provides which service which node received which service network interface configuration enabled/disabled nodes And other things

Database • MySQL server version 4.0.20 • MySQL text-based client • Perl DBI interface for MySQL • HPC value added management tools • shownodes: displays information on node configuration, statistics, services and status • managedb: assists in three areas • backs up the entire database • archives supermon log table data • dumps entire database in human readable form • reset_db: if you need to run cluster_config or discover again

Database: using mysql directly (1) Location of sql database: /opt/hptc/database/lib # mysql -p Enter password: Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 55969 to server version: 4.0.20-Max Type 'help;' or '\h' for help. Type '\c' to clear the buffer. mysql> show databases; +--------------+ | Database | +--------------+ | cmdb |  this is the XC Configuration and Management Database | hptc_archive | | mysql | | qsnet | | test

Database HPC value added: shownode • shownode --help • USAGE: /opt/hptc/bin/shownode subcommand [options] • subcommand is one of: • all list all nodes • metrics show various statistics • roles show information about node roles • servers show which nodes provide services to whom • services show which services are provided by which nodes • clients show who are the clients of the services • status show which nodes are up and which are down • enabled show which nodes are enabled • disabled show which nodes are disabled • config show configuration details for nodes and other hardware • /opt/hptc/bin/shownode subcommand --help for more details

Database HPC value added: shownodes (cont) [root@n9 root]# shownode roles common: n[1-9] compute: n[1-9] disk_io: n9 external: n[8-9] login: n[8-9] management_hub: n9 nat_client: n[1-7] node_management: n9 resource_management: n[8-9]

Database HPC value added: shownodes (cont) shownode enabled n1 n3 n4 n5 n6 n7 n8

Database HPC value added: shownodes (cont) shownode status n1 ON n2 ON n3 ON n4 ON n5 ON n6 ON n7 ON

Database HPC value added: shownodes (cont) # shownode config --help USAGE: /opt/hptc/bin/shownode config [options] { nodes [name ...] | cp_ports [name ...] | switches [name ...] | otherports [name ...] | hostgroups [name ...] | roles | sysparams | node_prefix | golden_client | all } options: --help this text --indent n indent each level by n spaces --labelwidth n print up to n characters of each label --perl output in format understood by perl --admininfo print extra information useful for system admins

Database HPC value added: enablenode and disablenode • The operation manipulates the is_enabled bit in database • Example: • setnode --disable n4 • setnode --disable n[3,4] • Control for cluster related commands • Currently not used by most services • Affects startsys(8) and stopsys(8) node lists • Disabled nodes will not be affected by these commands • Use shownode(1) to list • “shownode enabled” • “shownode disabled”

Database HPC value added:backup/restore cmdb managedb backup • Creates/opt/hptc/datbase/cmdbbackup-YYMMDDhhmmss.sql • Example: date Tue Jul 26 15:54:54 CEST 2005 managedb backup ls /opt/hptc/database/*.sql /opt/hptc/database/dbbackup-20050726155458.sql • Restore sequence: ls /opt/hptc/database/*.sql /opt/hptc/database/dbbackup-20050726155458.sql managedb restore /opt/hptc/database/dbbackup-20050726155153.sql mysql -u root -p cmdb < cmdbbackup-YYMMDDhhmmss.sql • Restoration will remove all old data and replace with data from the backup file.

Database HPC value added:archiving metrics data • Example 1: archiving metrics data older than four days: managedb archive –archive ./artfile_Tuesday 4d • Example 2: restoring metrics data: Managedb restore –archive ./artfile_Tuesday

Database HPC value added:purging metrics data • Metrics data on cmdb can be purged without archiving: • Example: purging data older than 2 weeks: managedb purge 2w

Monitoring the system • Health and Metrics • Nagios • Supermon • Syslog-ng • Console • CMF

Management roles • XC management services are part of three XC roles • Management Server • Denotes the node where the nagios master resides • Only *ONE* node may be assigned the management_server role • For V3.0 the management_server role must be assigned to the headnode (nh) • Management Hub • Denotes those nodes that provide management aggregation services. • Assign this role to as many nodes as you like. Cluster-config will by default assign 1 management_hub for every 150-200 compute nodes. • The Management Server node is also a management hub. • Supermon and CMF services are associated with a Management Hub • Common • Every node that is monitored has an XC “nrpe” service defined. • Every node that is monitored has an XC “mond” service defined • Console Services • By default, a console service role is assigned identically to Management Hubs. For systems where physical console connectivity is limited, assign this role to those nodes that have direct console access.

Mgmt Server – Nagios master and headnode, 1 per system. Mgmt Hub – Distributed management services such as nagios monitor, supermon and syslog-ng aggregator Console Services – Service nodes connected to the console network. Utilized by Mgmt Server/Hub’s to proxy requests that require console connectivity such as sensor collection, firmware log (SEL/IML) and console services (cmf) Console Services Monitoring Mgmt Server Mgmt Hub Mgmt Hub Console Services Console Services … …

Root Supermon – Supermond connects to all supermonds and manages a set of subnodes mond – Run on every node (even those with supermond) and collect metrics via the kernel and the “monhole” (if configured) Nagios/Nagios monitor – Runs the check_metrics plug-in periodically to cause supermon data collection and storage into the database Nagios master check_metrics Nagios monitor DB supermon Supermon manages requests for mond children supermon Manage requests for mond collection mond mond mond mond mond mond per-node data collector reports up to parent supermon “monhole” can be configured to pass any metric data for aggregation

Nagios • Version 2.0-b6 • No changes to Nagios engine for Perseus • Value add is in the dynamic configuration and template model used by XC • Easily extendable and configurable by the end user • Dynamically adapts to the cluster configuration based on XC service assignments (roles) • Distributed monitoring supported in V3.0! Supports failover for active plugins from nagios monitors • Graphically monitors nodes, environmentals, processes, power state, slurm, jobs, LSF, syslog, load average etc • Monitors ProCurve switches via SNMP. Checks for slow ethernet links (linkspeed <1GB) • Initiates supermon metric gathering

Nagios • Nagios gets all of its data from “plug-ins”, • Essentially shell/perl/C programs • return 0/1/2 for Ok, Warning, Critical • produce a status string on stdout. • Runs as the user “nagios” • nagiostats command new for V3.0 • Nagios master tracks distributed nagios monitors and reports if they fail to provide status • User customizable threshold values for all nodes/classes via nagios_vars.ini

Nagios Runs as user “nagios” Nagios demon does most all work on management nodes Serves as scheduler for supermon data collection LSF failover functionality invoked by nagios plug-in Dynamically creates nagios configuration files based on cluster topology and role assignment Offloads management from nagios clients Nagios Nagios master demon on nh Nagios Monitor Nagios Monitor Nagios Monitor Nagios clients Nagios process offloading nagios master Forwards results to master Nagios clients run mond for metrics gathering hourly root key synchronization

Navigation

Nagios Tactical Overview

Hostgroup summary (Roles)

Nagios master services

Management Server and Hub services • Apache HTTPS Server – Apache webserver status • Configuration Monitor – Updates node configuration • LSF Failover Monitor - Watch LSF and failover if needed • Monitor Power – Watch power nodes, report overall status • Nagios Monitor – Watch nagios master and monitor • Resource Monitor – Report squeue info • Root key synchronization – Report on ssh keys • Slurm Monitor – Report sinfo for each node • Supermon Metrics Monitor – Gather predetermined metrics • Syslog Alert Monitor – Watch for patterns in syslogalertrules file and report for each node • Load average - Provide per-node load average information and alerts • Environment - Provide per-node sensor reporting and alerts • Node information - Per node process, user, and disk statistics

Plug-ins that report data for individual nodes or switches • System Event Log - Monitor hardware event log for iLO and IPMI based systems and alert based on patterns in selRules file. • SFS monitoring - Monitor an attached SFS appliance • Procurve switch monitor - Gather switch status and metrics via SNMP

XC 3.0 SystemsAdministration

XC 3.0 SystemsAdministration

Presentation Transcript

Planmeca Proline XC

Passion for XC

Safe XC Soaring from Byron

HGP XC Wants Athletes Which Means HGP XC Wants You

Safe XC Soaring from Byron

RANDOLPH WILDCATS XC - 2014

KTM 2011 XC-F LAUNCH

XC ounter Group

Perry Pumas XC

XC Basics

WFHS XC 2013

XC Training Planner

Howard XC 2019

Introduce XC hardware and software

Safe XC Soaring from Byron

xc vikl