HACMP

List of HACMP related documents can be found here
http://www-03.ibm.com/servers/eserver/pseries/library/hacmp_docs.html

HACMP Daemons
HACMP Log files
HACMP Startup and Shutdown

HACMP Version 5.x
What is new in HACMP 5.x
Cluster Communication Daemon
Heart Beating
Forced Varyon of Volume Groups
Custom Resource Group
Application Monitoring
Resource Group Tasks

HACMP Daemon


01. clstrmgr
02. clinfo
03. clmuxpd
04. cllockd

HACMP Log files


/tmp/hacmp.out: It records the output generated by the event scripts as they execute. When checking the /tmp/hacmp.out file, search for EVENT FAILED messages. These messages indicate that a failure has occurred. Then, starting from the failure message, read back through the log file to determine exactly what went wrong.

The /tmp/hacmp.out file is a standard text file. The system creates a new hacmp.out log file every day and retains the last seven copies. Each copy is identified by a number appended to the file name. The most recent log file is named /tmp/hacmp.out; the oldest version of the file is named /tmp/hacmp.out.7

/usr/es/adm/cluster.log: It is the main HACMP log file. HACMP error messages and messages about HACMP-related events are appended to this log with the time and date at which they occurred

/usr/es/sbin/cluster/history/cluster.mmddyyyy: It contains time-stamped, formatted messages generated by HACMP scripts. The system creates a cluster history file whenever cluster events occur, identifying each file by the file name extension mmddyyyy, where mm indicates the month, dd indicates the day, and yyyy indicates the year.

/tmp/cspoc.log: It contains time-stamped, formatted messages generated by HACMP C-SPOC commands. The /tmp/cspoc.log file resides on the node that invokes the C-SPOC command.

/tmp/emuhacmp.out: It records the output generated by the event emulator scripts as they
execute. The /tmp/emuhacmp.out file resides on the node from which the event emulator is
invoked.

HACMP Startup and shutdown


HACMP startup option:

Cluster to re-aquire resources: If cluster services were stopped with the forced option, hacmp expects all cluster resources on this node to be in the same state when cluster services are restarted. If you have changed the state of any resources while cluster services were forced down, you can use this option to have hacmp reacquire resources during startup.

HACMP Shutdown Modes:

Graceful: Local machine shuts itself gracefully. Remote machine interpret this as a graceful down and do not takeover resources

Takeover: Local machine shuts itself down gracefully. Remote machine interpret this as a non-graceful down and takeover resources

Forced: Local machine shuts down cluster services without releasing any resources. Remote machine do not take over any resources. This mode is use ful for system maintenence.

HACMP 5.x

New in AIX 5.1

  • SMIT Standard and Extended configuration paths (procedures)
  • Automated configuration discovery
  • Custom resource groups
  • Non IP networks based on heartbeating over disks
  • Fast disk takeover
  • Forced varyon of volume groups
  • Heartbeating over IP aliases
  • Heartbeating over disks
  • Heartbeat monitoring of service IP addresses/labels on takeover node(
  • Now there is only HACMP/ES,based on IBM Reliable Scalable Cluster Technology
  • Improved security, by using cluster communication daemon
  • Improved performance for cluster customization and synchronization
  • Fast disk takeover
  • GPFS integration
  • Cluster verification enhancements

New In AIX 5.2

  • Custom only resource groups
  • Cluster configuration auto correction
  • Cluster file collections
  • Automatic cluster verification
  • Application startup monitoring and multiple application monitors
  • Cluster lock manager dropped
  • Resource Monitoring and Control (RMC) subsystem replaces Event Management

HACMP 5.3 Limits

  • 32 nodes in a cluster
  • 64 resource group in a cluster
  • 256 IP addresses known to HACMP (Service and boot IP lables)
  • RSCT limit: 48 heartbeat rings

Cluster Communication Daemon

The Cluster Communication Daemon, clcomdES, provides secure remote command execution and HACMP ODM configuration file updates by using the principle of the "least privilege".

The cluster communication daemon (clcomdES) has the following characteristics:

  • Since cluster communication does not require the standard AIX \"r\" commands, the dependency on the /.rhosts file has been removed. Thus, even in \"standard\" security mode, the cluster security has been enhanced.
  • Provides reliable caching mechanism for other node's ODM copies on the local node (the node from which the configuration changes and synchronization are performed).
  • Limits the commands which can be executed as root on remote nodes (only the commands in /usr/es/sbin/cluster run as root).
  • clcomdES is started from /etc/inittab and is managed by the system resource controller (SRC) subsystem.
  • Provides its own heartbeat mechanism, and discovers active cluster nodes (even if cluster manager or RSCT is not running).
  • Uses HACMP ODM classes and the /usr/es/sbin/cluster/rhosts file to determine legitimate partners.

Heartbeating

Starting with HACMP V5.1, heartbeating is exclusively based on RSCT topology services

The heartbeat via disk (diskhb) is a new feature introduced in HACMP V5.1, with a proposal to provide additional protection against cluster partitioning and simplified non-IP network configuration. This type of network can use any type of shared disk storage (Fibre Channel, SCSI, or SSA), as long as the disk used for exchanging KA messages is part of an AIX enhanced concurrent volume group. The disks used for heartbeat networks are not exclusively dedicated for this purpose; they can be used to store application shared data

Forced varyon of volume groups

HACMP V5.1 provides a new facility, the forced varyon of a volume group option on a node. You should use a forced varyon option only for volume groups that have mirrored logical volumes, and use caution when using this facility to avoid creating a partitioned cluster.

When using a forced varyon of volume groups option in a takeover situation, HACMP first tries a normal varyonvg. If this attempt fails due to lack of quorum, HACMP checks the integrity of the data to ensure that there is at least one available copy of all data in the volume group before trying to force the volume online. If there is, it runs varyonvg -f; if not, the volume group remains offline and the resource group results in an error state.

Custom Resource groups

Startup preferences

  • Online On Home Node Only: At node startup, the RG will only be brought online on the highest priority node. This behavior is equivalent to cascading RG behavior.
  • Online On First Available Node: At node startup, the RG will be brought online on the first node activated. This behavior is equivalent to that of a rotating RG or a cascading RG with inactive takeover. If a settling time is configured, it will affect RGs with this behavior.
  • Online On All Available Nodes: The RG should be online on all nodes in the RG. This behavior is equivalent to concurrent RG behavior. This startup preference will override certain fall-over and fall-back preferences.

Fallover preferences

  • Fallover To Next Priority Node In The List: The RG will fall over to the nextavailable node in the node list. This behavior is equivalent to that of cascading and rotating RGs.
  • Fallover Using Dynamic Node Priority: The RG will fall over based on DNP calculations. The resource group must specify a DNP policy.
  • Bring Offline (On Error Node Only): The RG will not fall over on error; it will simply be brought offline. This behavior is most appropriate for concurrent-like RGs.

The settling time specifies how long HACMP waits for a higher priority node (to join the cluster) to activate a custom resource group that is currently offline on that node. If you set the settling time, HACMP waits for the duration of the settling time interval to see if a higher priority node may join the cluster, rather than simply activating the resource group on the first possible node that reintegrates into the cluster.

Fallback preferences

  • Fallback To Higher Priority Node: The RG will fall back to a higher priority node if one becomes available. This behavior is equivalent to cascading RG behavior. A fall-back timer will influence this behavior.
  • Never Fallback: The resource group will stay where it is, even if a higher priority node comes online. This behavior is equivalent to rotating RG behavior.

A delayed fall-back timer lets a custom resource group fall back to its higher priority node at a specified time. This lets you plan for outages for maintenance associated with this resource group.

You can specify the following types of delayed fall-back timers for a custom resource group:

  • Daily
  • Weekly
  • Monthly
  • Yearly
  • On a specific date

Application Monitoring

HACMP can also monitor applications in one of the following two ways:

  • Application process monitoring: Detects the death of a process, using RSCT event management capability.
  • Application custom monitoring: Monitors the health of an application based on a monitoring method (program or script) that you define.

When application monitoring is active, HACMP behaves

  • For application process monitoring, a kernel hook informs manager that the monitored process has died, and HACMP application recovery process.

For the recovery action to take place, you must provide and restart the application (the application start/stop application server definition may be used). HACMP tries to restart the application and waits for the a specified number of times, before sending an notification actually moving the entire RG to a different node (next priority list).

  • For custom application monitoring (custom method), cleanup and restart methods, you must also provide

used for performing periodic application tests.

Resource Group Tasks


To list the resource groups configured for a cluster

 # cllsgrp

To list the details of of a resource group

 # clshowres

To bring RG1 offline on Node3

 # clRGmove -g RG1 -n node3 -d  <--- -d for down)

To bring CrucialRG online on Node3

 # clRGmove -g CrucialRG -n node3 -u

To check the current resource status

 # clfindres 
    or
 # clRGinfo

To find out the current cluster stat and obtain informatin about cluster

 # cldump

 Obtaining information via SNMP from Node: err3qci0...

 _____________________________________________________________________________
 Cluster Name: erpqa1
 Cluster State: UP
 Cluster Substate: STABLE
 _____________________________________________________________________________


 Node Name: err3qci0             State: UP

  Network Name: corp_ether_01      State: UP

    Address: 10.0.5.2        Label: r3qcibt1cp         State: UP
    Address: 10.0.6.2        Label: r3qcibt2cp         State: UP
    Address: 10.253.1.75     Label: sapr3qci           State: UP

  Network Name: prvt_ether_01      State: UP

    Address: 10.0.7.2        Label: r3qcibt1pt         State: UP
    Address: 10.0.8.2        Label: r3qcibt2pt         State: UP
    Address: 192.168.200.79  Label: psapr3qci          State: UP

  Network Name: ser_rs232_01       State:


 Node Name: err3qdb0             State: UP

  Network Name: corp_ether_01      State: UP

    Address: 10.0.5.1        Label: r3qdbbt1cp         State: UP
    Address: 10.0.6.1        Label: r3qdbbt2cp         State: UP
    Address: 10.253.1.55     Label: sapr3qdb           State: UP

  Network Name: prvt_ether_01      State: UP

    Address: 10.0.7.1        Label: r3qdbbt1pt         State: UP
    Address: 10.0.8.1        Label: r3qdbbt2pt         State: UP
    Address: 192.168.200.8   Label: psapr3qdb          State: UP

  Network Name: ser_rs232_01       State: UP

    Address:                 Label: r3qdb_ser          State: UP



 Cluster Name: erpqa1

 Resource Group Name: SapCI_RG
 Startup Policy: Online On Home Node Only
 Fallover Policy: Fallover To Next Priority Node In The List
 Fallback Policy: Never Fallback
 Site Policy: ignore
 Priority Override Information:
     Primary Instance POL:
 Node                         Group State
 ---------------------------- ---------------
 err3qci0                     ONLINE
 err3qdb0                     OFFLINE

 Resource Group Name: OraDB_RG
 Startup Policy: Online On Home Node Only
 Fallover Policy: Fallover To Next Priority Node In The List
 Fallback Policy: Never Fallback
 Site Policy: ignore
 Priority Override Information:
     Primary Instance POL:
 Node                         Group State
 ---------------------------- ---------------
 err3qdb0                     ONLINE
 err3qci0                     OFFLINE  

Syncronizing the VG info in HACMP if cluster is already running:

01. In the system where the VG changes are made, break the reserve on disks using varyonvg command

 # varyonvg -b -u <vgname>

02. Import the VG in the system where the VG info need to be updated. Use the -n and -F flag to not to vary on the VG

 # importvg -V <major #> -y <VG Name> -n -F <hdisk_name>

03. Varyon the VG without the SCSI reserves

 # varyonvg -b -u <vgname>

04. Change the VG not to caryon automatically

 # chvg -an -Qy <vgname>

05. Varyoff the VG

 # varyoffvg <vgname>

06. Put the SCSI reserves back in the primary server

 # varyonvg <vgname>

Some useful HACMP Commands

To list all the app servers configured including start and stop script

 # cllsserv
 OraDB_APP  /usr/local/bin/dbstart   /usr/local/bin/dbstop
 SapCI_APP  /usr/local/bin/sapstart  /usr/local/bin/sapstop

To list the application monitoring configured on a cluster

 # cllsappmon
 OraDB_Mon       user
 SapCI_Mon       user

To get the detailed information about application monitoring

 # cllsappmon <app_mon_name>
 # cllsappmon -h OraDB_Mon
 #name   type  MONITOR_METHOD  MONITOR_INTERVAL        INVOCATION      HUNG_MONITOR_SIGNA 
 STABILIZATION_INTERVAL  FAILURE_ACTION  RESTART_COUNT   RESTART_INTERVAL        RESTART_METHOD  
 NOTIFY_METHOD   CLEANUP_METHOD  PROCESSES       PROCESS_OWNER   INSTANCE_COUNT  RESOURCE_TO_MONITOR
 OraDB_Mon  user  /usr/local/bin/dbmonitor   30  longrunning   9  180  fallover
 1   600  /usr/local/bin/dbstart  /usr/local/bin/dbstop

To clear a hacmp logs

 # clclear

HACMP Upgrading options

01. Rolling Migration
02. Snapshot Migration

To apply the online worksheet

 /usr/es/sbin/cluster/utilities/cl_opsconfig  <online worksheet file name>