Came across a problem recently where provisioning and reconfiguration of virtual machines through vRealize Automation were failing
Exceptions seen in catalina.out
2019-06-02 09:53:25,643 vcac: [component="cafe:event-broker" priority="INFO" thread="event-broker-service-taskExecutor8" tenant="" context="" parent="" token=""] com.vmware.vcac.core.event.broker.integration.PublishReplyEventServiceActivator.onApplicationEvent:96 - Message Broker unavailable[internal stop]
2019-04-02 09:53:25,711 vcac: [component="cafe:console-proxy" priority="WARN" thread="Grizzly(1)" tenant="" context="" parent="" token=""] com.vmware.vcac.platform.event.broker.client.stomp.StompEventSubscribeHandler.handleException:493- Error during message processing: session:[f4329ebb-4e2e-7690-8b6a-3f420c8bd226], command[null], headers[{message=[Connection to broker closed.], content-length=[0]}], payload [{}]. Reason : [Connection to broker closed.] 2019-04-02 09:53:25,712 vcac: [component="cafe:console-proxy" priority="ERROR" thread="Grizzly(1)" tenant="" context="" parent="" token=""] com.vmware.vcac.core.service.event.ServerEventBrokerServiceFacade.handleError:337 - Error for command 'null', headers: '{message=[Connection to broker closed.], content-length=[0]}'java.lang.Exception: Connection to broker closed.
Above exceptions clearly state the problem is with messaging broker which is rabbitmq
Performing rabbitmq reset's and then adding second or third node ( if available ) to the master would eventually resolve the problem.
After fair bit of research under rabbitmq logs we see
on node psvra01.nukescloud.com:
=INFO REPORT==== 11-Jun-2019::16:31:41 ===
rabbit on node 'rabbit@psvra03.nukescloud.com' down
=INFO REPORT==== 11-Jun-2019::16:31:41 ===
Keep rabbit@psvra03.nukescloud.com listeners: the node is already back
on node psvra03.nukescloud.com:
=INFO REPORT==== 11-Jun-2019::16:54:12 ===
rabbit on node 'rabbit@psvra01.nukescloud.com' down
=INFO REPORT==== 11-Jun-2019::16:54:12 ===
Keep rabbit@rabbit@psvra01.nukescloud.com listeners: the node is already back
...
=INFO REPORT==== 11-Jun-2019::18:55:09 ===
rabbit on node 'rabbit@rabbit@psvra01.nukescloud.com' down
=INFO REPORT==== 11-Jun-2019::18:55:09 ===
Keep rabbit@rabbit@psvra01.nukescloud.com listeners: the node is already back
Above snippets clearly show network partitions / issues happening.
Rabbitmq does not tolerate network partitioning events and does not recover from them properly.
Executing below command would help in somewhat resilient in case of these network partitions
rabbitmqctl set_policy ha-all "" '{"ha-mode":"all","ha-sync-mode":"automatic","ha-promote-on-failure":"always","ha-promote-on-shutdown":"always"}'
New versions of vRealize Automation have mechanisms in place to detect this sort of issues and attempt an automated recovery
The command mentioned above will help to certain extent but there has to be 100% available and redundant network available between vRealize Automation nodes.
Comments