VERITAS Cluster Server for Solaris

VERITAS Cluster Serverfor Solaris Troubleshooting

Objectives After completing this lesson, you will be able to: Monitor system and cluster status. Apply troubleshooting techniques in a VCS environment. Detect and solve VCS communication problems. Identify and solve VCS engine problems. Correct service group problems. Resolve problems with resources. Solve problems with agents. Correct resource type problems. Plan for disaster recovery.

Monitoring VCS VCS log files System log files The hastatus utility SNMP traps Event notification triggers Cluster Manager

VCS Log Entries Engine log: /var/VRTSvcs/log/engine_A.log View logs using the GUI or the hamsg command: hamsg engine_A Example entries: TAG_D 2001/04/03 12:17:44 VCS:11022:VCS engine (had) started TAG_D 2001/04/03 12:17:44 VCS:10114:opening GAB library TAG_C 2001/04/03 12:17:45 VCS:10526:IpmHandle::recv peer exited errno 10054 TAG_E 2001/04/03 12:17:52 VCS:10077:received new cluster membership TAG_E 2001/04/03 12:17:52 VCS:10080:Membership: 0x3, Jeopardy: 0x0 MostRecent

Agent Log Entries Agent logs kept in /var/VRTSvcs/log Log files named AgentName_A.log LogLevel attribute settings: none error (default setting) info debug all To change log level: hatype -modify res_type LogLevel debug

Troubleshooting Guide Start by running hastatus -summary: Cluster communication problems are indicated by the message: Cannot connect to server -- Retry Later VCS engine startup problems are indicated by systems in one of the WAIT states. Service group, resource, or agent problems are indicated within the hastatus display.

Cluster Communication Problems Run lltconfig to determine if LLT is running. If LLT is not running: Check the /etc/llttab file: Verify that the node number is within range (0-31) Verify that the cluster number is within range (0-255). Determine whether the link directive is specified correctly (qf3 should be qfe, for example). Check the /etc/llthosts file: Verify that node numbers are within range. Verify that the system names match the entries in the llttab or sysname files. Check the /etc/VRTSvcs/conf/sysname file: Make sure there is only one system name in the file. Verify that the system name matches the entry in the llthosts file.

Problems with LLT If LLT is running: Run lltstat -n to determine if systems can see each other on the LLT link. Check the physical network connection(s) if LLT cannot see each node. train11# lltconfig LLT is running train11# lltstat -n LLT node information: Node State Links * 0 train11 OPEN 2 1 train12 CONNWAIT 2 train12# lltconfig LLT is running train12# lltstat -n LLT node information: Node State Links 0 train11 CONNWAIT 2 * 1 train12 OPEN 2

Problems with GAB Check GAB by running gabconfig –a: No port a membership indicates a GAB problem. Check the seed number in /etc/gabtab. If a node is not operational, hence the cluster is not seeded, force GAB to start: gabconfig -x If GAB starts and immediately shuts down, check LLT and private network cabling. No port h membership indicates a VCS engine (had) startup problem. HAD not running: GAB and LLT functioning # gabconfig -a GAB Port Memberships ======================== # gabconfig -a GAB Port Memberships =================================== Port a gen 24110002 membership 01

VCS Engine Startup Problems Check the VCS engine (HAD) by running hastatus –sum: Check GAB and LLT if you see this messsage: Cannot connect to server -- Retry Later Verify that the main.cf file is valid and that system names match llthosts and llttab: hacf –verify /etc/VRTSvcs/conf/config Check for systems in WAIT states: STALE_ADMIN_WAIT: The system has a stale configuration and no other system is in a RUNNING state. ADMIN_WAIT: The system cannot build or obtain a valid configuration.

STALE_ADMIN_WAIT To recover from STALE_ADMIN_WAIT state: Visually inspect the main.cf file to determine whether it is valid. Edit the main.cf file, if necessary. Verify the syntax of main.cf, if modified. hacf –verify config_dir Start VCS on the system with the valid main.cf file: hasys -force system_name All other systems perform a remote build from the system now running.

ADMIN_WAIT A system can be in the ADMIN_WAIT state under these circumstances: A .stale flag exists and the main.cf file has a syntax problem. A disk error occurs affecting main.cf during a local build. The system is performing a remote build and last running system fails. Restore main.cf and use the procedure for STALE_ADMIN_WAIT.

Identifying Other Problems After verifying that HAD, LLT, and GAB are functioning properly, run hastatus –sum to identify problems in other areas: Service groups Resources Agents and resource types

Service Group Problems: Group Not Configured to Start or Run Service group not onlined automatically when VCS starts: Check AutoStart and AutoStartList attributes: hagrp –displayservice_group Service group not configured to run on the system: Check the SystemList attribute. Verify that the system name is included.

Service Group AutoDisabled Autodisable occurs when: GAB sees a system but had is not running on the system. Resources of the service group are not fully probed on all systems in the SystemList. A particular system is visible through disk heartbeat only. Make sure that the service group is offline on all systems in SystemList attribute. Clear the AutoDisabled attribute: hagrp –autoenableservice_group -sys system Bring the service group online.

Service Group Not Fully Probed Usually a result of improperly configured resource attributes: Check ProbesPending attribute: hagrp -display service_group Check which resources are not probed: hastatus -sum Check Probes attribute for resources: hares -display To probe resources: hares –probe resource -sys system

Service Group Frozen Verify value of Frozen and TFrozen attributes: hagrp -display service_group Unfreeze the service group: hagrp -unfreeze group [-persistent] If you freeze persistently, you must unfreeze persistently.

Service Group Is Not Offline Elsewhere Determine which resources are online/offline: hastatus -sum Verify the State attribute: hagrp -display service_group Offline the group on the other system: hagrp -offline Flush the service group: hagrp -flush service_group -sys system

Service Group Waiting for Resource Review Istate attribute of all resources to determine which resource is waiting to go online. Use hastatus to identify the resource. Make sure the resource is offline (at the operating system level). Clear the internal state of the service group: hagrp –flushservice_group -sys system Bring all other resources in the service group offline and try to bring these resources online on another system. Verify that the resource works properly outside VCS. Check for errors in attribute values.

Incorrect Local Name A service group cannot be brought online if the system name is inconsistent in llthosts, llttab, or main.cf files. Check each file for consistent use of system names. Correct any discrepancies. If main.cf is changed, stop and restart VCS. If ltthosts or ltttab is changed: Stop VCS, GAB, and LLT. Restart LLT, GAB, and VCS.

Concurrency Violations Occurs when a failover service group is online or partially online on more than one system Notification provided by the Violation trigger: Invoked on the system that caused the concurrency violation Notifies the administrator and takes the service group offline on the system causing the violation Configured by default with the violation script in /opt/VRTSvcs/bin/triggers Can be customized: Send message to the system log. Display warning on all cluster systems. Send e-mail messages.

Service Group Waiting for Resource to Go Offline Identify which resource is not offline: hastatus –summary Check logs. Manually bring the resource offline, if necessary. Configure ResNotOff trigger for notification or action.

Resource Problems: Unable to Bring Resources Online Possible causes of failure while bringing resources online: Waiting for child resources Stuck in a WAIT state Agent not running

Problems Bringing Resources Offline Waiting for parent resources to come offline Waiting for a resource to respond Agent not running

Critical Resource Faults Determine which critical resource has faulted: hastatus –summary Make sure that the resource is offline. Examine the engine log. Fix the problem. Verify that the resources work properly outside of VCS. Clear fault in VCS.

Clearing Faults After external problems are fixed: Clear any faults on nonpersistent resources. hares -clear resource -sys system Check attribute fields for incorrect or missing data. If service group is partially online: Flush wait states: hagrp -flush service_group -sys system Bring resources offline first before bringing them online.

Agent Problems: Agent Not Running Determine whether the agent for that resource is FAULTED: hastatus –summary Use the ps command to verify that the agent process is not running. Check the log files for: Incorrect pathname for the agent binary Incorrect agent name Corrupt agent binary

Resource Type Problems A corrupted type definition can cause agents to fail by passing invalid arguments. Verify that the agent works properly outside of VCS. Verify values for ArgList and ArgListValues type attributes: hatype –display res_type Restart the agent after making changes: haagent –start res_type -sys system

Planning for Disaster Recovery Back up key VCS files: types.cf and customized types files main.cf main.cmd sysname LLT and GAB configuration files Customized trigger scripts Customized agents Use hagetcf to create an archive.

The hasnap Utility Use the hasnap command to take snapshots of VCS configuration files on each node in the cluster. You can also restore the configuration from a snapshot. • hasnap -backup Backs up files in a snapshot format. • hasnap -restore Restores a previously created snapshot. • hasnap -display Displays details of previously created snapshots. • hasnap -sdiff Displays files that were changed on the local system after a specific • snapshot was created. • hasnap -fdiff Displays the differences between a file in the cluster and its copy • stored in a snapshot. • hasnap -export Exports a snapshot from the local, predefined directory to the • specified file. • hasnap -include Configures the list of files or directories to be included in new • snapshots, in addition to those included automatically by the • -backup command. • hasnap -exclude Configures the list of files or directories to be excluded from new • snapshots when backing up the configuration using the -backup • command. • hasnap -delete Deletes snapshots from the predefined local directory on each node.

Summary You should now be able to: Monitor system and cluster status. Apply troubleshooting techniques in a VCS environment. Resolve communication problems. Identify and solve VCS engine problems. Correct service group problems. Resolve problems with resources. Solve problems with agents. Correct resource type problems. Plan for disaster recovery.

VERITAS Cluster Server for Solaris

VERITAS Cluster Server for Solaris

Presentation Transcript

Server Cluster and LVS based Cluster

Windows Compute Cluster Server 2003

VERITAS NetBackup for Linux

VERITAS Cluster Server

SQL Server on a Cluster

Veritas

Cluster Server EJB JDBC JMS ---------- ?

SOLARIS s.r.l.

VERITAS Cluster Server Solaris

Windows Compute Cluster Server 2003

OpenTS for Windows Compute Cluster Server

Energy Efficient Web Server Cluster

SOLARIS

Solaris

VERITAS Cluster Server

VERITAS Global Cluster Manager™

VERITAS NetBackup for Linux

VERITAS Solutions for AIX