Common Cluster Behaviors: General

E.3. Common Cluster Behaviors: General

Loss of connectivity to a power switch or failure to fence a member

Common Causes: Serial power switch disconnected from controlling member. Network power switch disconnected from network.

Expected Behavior: Members controlled by the power switch will not be able to be shut down or restarted. In this case, if the member hangs, services will not fail-over from any member controlled by the switch in question.

Verification: Run clustat to verify that services are still marked as running on the member, even though it is inactive according to membership.

Dissolution of the cluster quorum

Common Causes: A majority of cluster members (for example, 3 of 5 members) go offline

Test Case: In a 3 member cluster, stop the cluster software on two members.

Expected Behavior: All members which do not have controlling power switches reboot immediately. All services stop immediately and their states are not updated on the shared media (when running clustat, the service status blocks may still display that the service is running). Service managers exit. Cluster locks are lost and become unavailable.

Verification: Run clustat on one of the remaining active members.

Member loses participatory status in the cluster quorum but is not hung

Common Causes: Total loss of connectivity to other members.

Test Case: Disconnect all network cables from a cluster member.

Expected Behavior: If the member has no controlling power switches, it reboots immediately. Otherwise, it attempts to stop services as quickly as possible. If a quorum exists, the set of members comprising the cluster quorum will fence the member.

clumembd crashes

Test Case: killall -KILL clumembd

Expected Behavior: System reboot.

clumembd hangs, watchdog in use

Test Case: killall -STOP clumembd

Expected Behavior: System reboot may occur if clumembd hangs for a time period greater than (failover_time - 1) seconds. Triggered externally by watchdog timer.

clumembd hangs, no watchdog in use

Test Case: killall -STOP clumembd

Expected Behavior: System reboot may occur if clumembd hangs for a time period greater than (failover_time) seconds. Triggered internally by clumembd.

cluquorumd crashes

Test Case: killall -KILL cluquorumd

Expected Behavior: System reboot.

clusvcmgrd crashes

Test Case: killall -KILL clusvcmgrd

Expected Behavior: cluquorumd re-spawns clusvcmgrd, which runs the stop phase of all services. Services which are stopped are started.

Verification: Consult system logs for a warning message from cluquorumd.

clulockd crashes

Test Case: killall -KILL clulockd

Expected Behavior: cluquorumd re-spawns clulockd. Locks may be unavailable (preventing service transitions) for a short period of time.

Verification: Consult system logs for a warning message from cluquorumd.

Unexpected system reboot without clean shutdown of cluster services

Common Causes: Any noted scenario which causes a system reboot.

Test Case: reboot -fn; pressing the reset switch.

Expected Behavior: If a power switch controls the rebooting member in question, the system will also be fenced (generally, power-cycled) if a cluster quorum exists.

Loss of quorum during clean shutdown

Test Case: Stop cluster services (service clumanager stop) on all members.

Expected Behavior: Any remaining services are stopped uncleanly.

Verification: Consult the system logs for warning message.

Successful STONITH fencing operation

Expected Behavior: Services on member which was fenced are started elsewhere in the cluster, if possible.

Verification: Verify that services are, in fact, started after the member is fenced. This should only take a few seconds.

Unsuccessful fencing operation on cluster member

Common Causes: Power switch returned error status or is not reachable.

Test Case: Disconnect power switch controlling a member and run reboot -fn on the member.

Expected Behavior: Services on a member which fails to be fenced are not started elsewhere in the cluster. If the member recovers, services on the cluster are restarted. Because there is no way to accurately determine the member's state, it is assumed that it is now still running even if heartbeats have stopped. Thus, all services should be reported as running on the down member.

Verification: Run clustat to verify that services are still marked as running on the member, even though it is inactive according to membership. Messages will be logged stating that the member is now in the PANIC state.

Error reading from one of the shared partitions

Test Case: Run dd to write zeros to the shared partition

	    
dd if=/dev/zero of=/dev/raw/raw1 bs=512 count=1
shutil -p /cluster/header

Expected Behavior: Event is logged. The data from the good shared partition is copied to the partition which returned errors

Verification: A second read pass of the same data should not produce a second error message.

Error reading from both of the shared partitions

Common Causes: Shared media is either unreachable or both partitions have corruption.

Test Case: Unplug SCSI or Fibre Channel cable from a member.

Expected Behavior: The event is logged. Configured action is taken to address loss of access to shared storage (reboot/halt/stop/ignore). Default action is to reboot