VMware Aria Suite Lifecycle 8.14 introduces an innovative capability known as "VMware Identity Manager Cluster Auto-Recovery".
Why do we need this ?
The aim of this 'autorecovery' service is to minimize the necessity to do the time-consuming 'Remediate' operation from the Suite Lifecycle UI.
In greenfield deployments, this feature is automatically activated, while in brownfield deployments, it needs to be manually enabled after upgrading to VMware Aria Suite Lifecycle 8.14
How does it work?
The 'autorecovery' service is deployed as a Linux service and operates on all three nodes within the vIDM cluster
Operations like start/restart of the pgPool service is controlled by the script individually on each node
The handling of cluster-VIP or delegateIP is handled only on nodes with the role as primary
Detachment of VIO is done on standby nodes based on their role as "standby"
Operations like "recovery" is synchronized base don the node's status as database "primary" and in one case and cluster "leader" if all nodes are in standby
Because of which there would be no duplicate operations which are triggered by any of the nodes
What challenges or issues does this feature tackle?
Cluster-VIP loss on primary
Cluster issues due to network outages
Recovery of "down" cluster node/s
Avoids necessity to initiate the "Remediate" from UI in most cases, which involves node(s) restarts contributing to downtime
This script eliminates the necessity of rebooting vIDM nodes because of PostgreSQL cluster problems
Recovery in cases with significant replication delay ('significant' configurable in bytes, say more than 1000 bytes of lag between primary and secondary)
Recovery in rare cases of all nodes are in a ‘standby’ state
Prevent discrepancies in the /etc/hosts
Is there any downtime during the execution of auto-recovery?
No, there is no downtime when the auto-recovery script is triggered in the backend for any of the reasons
How do i enable and disable this feature?
Once enabled , users can come to day-2 operations pane of globalenvironment or vIDM and then choose to disable and vice-versa
Great article. Even with auto_recovery_replication_delay_threshold=1000 in lcm-pgpool.conf, LCM still sends critical email alerts saying "Node(s) x.x.x.x, have a replication delay with respect to the master node" when the replication delay to the master is over 100 bytes. Where do you configure the lcm alert to only trigger if delay is greater then say 1000 bytes? Thanks.