Handling Failed Services

4.8. Handling Failed Services

The cluster puts a service into the Failed state if it is unable to successfully start the service across all members and then cannot cleanly stop the service. A Failed state can be caused by various problems, such as a misconfiguration as the service is running or a service hang or crash. The Cluster Status Tool displays the service as being Failed.

Figure 4-2. Service in Failed State

NoteNote
 

You must disable a Failed service before you can modify or re-enable the service.

Be sure to carefully handle failed services. If service resources are still configured on the owner member, starting the service on another member may cause significant problems. For example, if a file system remains mounted on the owner member, and you start the service on another member, the file system is mounted on both members, which can cause data corruption. If the enable fails, the service remains in the Disabled state.

After highlighting the service and clicking Disable, you can attempt to correct the problem that caused the Failed state. After you modify the service, the cluster software enables the service on the owner member, if possible; otherwise, the service remains in the Disabled state. The following list details steps to follow in the event of service failure:

  1. Modify cluster event logging to log debugging messages. Viewing the logs can help determine problem areas. Refer to Section 8.6 Modifying Cluster Event Logging for more information.

  2. Use the Cluster Status Tool to attempt to enable or disable the service on one of the cluster or failover domain members. Refer to Section 4.3 Disabling a Service and Section 4.4 Enabling a Service for more information.

  3. If the service does not start or stop on the member, examine the /var/log/messages and (if configured to log separately) /var/log/cluster log files, and diagnose and correct the problem. You may need to modify the service to fix incorrect information in the cluster configuration file (for example, an incorrect start script), or you may need to perform manual tasks on the owner member (for example, unmounting file systems).

  4. Repeat the attempt to enable or disable the service on the member. If repeated attempts fail to correct the problem and enable or disable the service, reboot the member.

  5. If still unable to successfully start the service, verify that the service can be manually restarted outside of the cluster framework. For example, this may include manually mounting the file systems and manually running the service start script.