Timed out while waiting for Event Broker response

Arun Nukula
Jun 27, 2019
1 min read

Came across a problem recently where provisioning and reconfiguration of virtual machines through vRealize Automation were failing

Exceptions seen in catalina.out

2019-06-02 09:53:25,643 vcac: [component="cafe:event-broker" priority="INFO" thread="event-broker-service-taskExecutor8" tenant="" context="" parent="" token=""] com.vmware.vcac.core.event.broker.integration.PublishReplyEventServiceActivator.onApplicationEvent:96 - Message Broker unavailable[internal stop]

2019-04-02 09:53:25,711 vcac: [component="cafe:console-proxy" priority="WARN" thread="Grizzly(1)" tenant="" context="" parent="" token=""] com.vmware.vcac.platform.event.broker.client.stomp.StompEventSubscribeHandler.handleException:493- Error during message processing: session:[f4329ebb-4e2e-7690-8b6a-3f420c8bd226], command[null], headers[{message=[Connection to broker closed.], content-length=[0]}], payload [{}]. Reason : [Connection to broker closed.] 2019-04-02 09:53:25,712 vcac: [component="cafe:console-proxy" priority="ERROR" thread="Grizzly(1)" tenant="" context="" parent="" token=""] com.vmware.vcac.core.service.event.ServerEventBrokerServiceFacade.handleError:337 - Error for command 'null', headers: '{message=[Connection to broker closed.], content-length=[0]}'java.lang.Exception: Connection to broker closed.

Above exceptions clearly state the problem is with messaging broker which is rabbitmq

Performing rabbitmq reset's and then adding second or third node ( if available ) to the master would eventually resolve the problem.

After fair bit of research under rabbitmq logs we see

on node psvra01.nukescloud.com:

=INFO REPORT==== 11-Jun-2019::16:31:41 ===

rabbit on node 'rabbit@psvra03.nukescloud.com' down

=INFO REPORT==== 11-Jun-2019::16:31:41 ===

Keep rabbit@psvra03.nukescloud.com listeners: the node is already back

on node psvra03.nukescloud.com:

=INFO REPORT==== 11-Jun-2019::16:54:12 ===

rabbit on node 'rabbit@psvra01.nukescloud.com' down

=INFO REPORT==== 11-Jun-2019::16:54:12 ===

Keep rabbit@rabbit@psvra01.nukescloud.com listeners: the node is already back

...

=INFO REPORT==== 11-Jun-2019::18:55:09 ===

rabbit on node 'rabbit@rabbit@psvra01.nukescloud.com' down

=INFO REPORT==== 11-Jun-2019::18:55:09 ===

Keep rabbit@rabbit@psvra01.nukescloud.com listeners: the node is already back

Above snippets clearly show network partitions / issues happening.

Rabbitmq does not tolerate network partitioning events and does not recover from them properly.

Executing below command would help in somewhat resilient in case of these network partitions

rabbitmqctl set_policy ha-all "" '{"ha-mode":"all","ha-sync-mode":"automatic","ha-promote-on-failure":"always","ha-promote-on-shutdown":"always"}'

New versions of vRealize Automation have mechanisms in place to detect this sort of issues and attempt an automated recovery

The command mentioned above will help to certain extent but there has to be 100% available and redundant network available between vRealize Automation nodes.

ARUN NUKULA

ARUN NUKULA

Timed out while waiting for Event Broker response

Recent Posts

Comments