NSX data collection unavailable

Arun Nukula
Sep 16, 2019
4 min read

In preparation for using NSX network, security, and load balancing capabilities in vRealize Automation , at first we have to create an NSX endpoint

I was asked to look into a problem where even after creating an endpoint successfully along with association mapped , selecting data collection under Compute Resource does not show Network and Security Inventory

Looking at the logs after NSX endpoint was created we do see there is a data collection workitem created , that's VCNSInventory

Reference : ManagerService / All.log

[UTC:2019-09-03 10:29:48 Local:2019-09-03 15:59:48] [Debug]: [sub-thread-Id="45" context="" token=""] DC: Created data collection item, WorkflowInstanceId 183022, Task VCNSInventory, EntityID 8ed67519-99fb-4afa-811f-227e753a24eb, StatusID = 457b3af7-b739-45b2-ab9f-0cdd79596af0

Taking one of the instance 183022 into consideration and inspecting worker logs

Worker initialises instance

2019-09-03T10:29:49.962Z DC-DEM02 vcac: [component="iaas:DynamicOps.DEM.exe" priority="Trace" thread="4268"] [sub-thread-Id="27" context="" token=""] Worker Controller: initializing instance 183022 - vSphereVCNSInventory of the workflow execution unit

2019-09-03T10:29:52.009Z DC-DEM02 vcac: [component="iaas:DynamicOps.DEM.exe" priority="Trace" thread="4268"] [sub-thread-Id="27" context="" token=""] WorkflowExecutionUnit: initialize started: 183022

2019-09-03T10:30:14.401Z DC-DEM02 vcac: [component="iaas:DynamicOps.DEM.exe" priority="Trace" thread="4864"] [sub-thread-Id="28" context="" token=""] Workflow ID: 183022 Activity <Mark Data Collection Complete>: State: Closed

2019-09-03T10:30:14.417Z DC-DEM02 vcac: [component="iaas:DynamicOps.DEM.exe" priority="Debug" thread="4864"] [sub-thread-Id="28" context="" token=""] Workflow Complete: 183022 - Successful

2019-09-03T10:30:14.417Z DC-DEM02 vcac: [component="iaas:DynamicOps.DEM.exe" priority="Trace" thread="4864"] [sub-thread-Id="28" context="" token=""] Worker Controller: WriteCompletedWorkflow

As shown above it did go through data collection and marked as successful but it was never showing up in UI.

At this time when we performed a Test Connection for an endpoint and click on OK, though test connection was successful , it was unable to save this endpoint.

That's when I got an idea that there must be something wrong with endpoints table

Assumption was changed to confirmation after reviewing API data captured from HAR file

Now that we know that there is definitely something wrong with the endpoints

Using query select * from ManagementEndpoints found that there were stale entries for all vSphere endpoints

Ideally there should be only one entry per endpoint ( vSphere ) inside this table. But here we have 2 per vSphere endpoint.

How do we now identify which is the correct one and what ManagementEndpointId to be deleted

For this you have to grep vSphereAgent.log ( Proxy Agent logs ) and search for managementEndpointId. This managementEndpointId what you find in the log is the correct one and this entry must remain under ManagementEndpointID of dbo.ManagementEndpoints table

Example 2019-09-09T03:54:23.466Z DC-AGENT01 vcac: [component="iaas:VRMAgent.exe" priority="Debug" thread="900"] [sub-thread-Id="6" context="" token=""] Ping Sent Successfully : [<?xml version="1.0" encoding="utf-16"?><pingReport agentName="vCenter" agentVersion="7.3.0.0" agentLocation="PRDVC" WorkitemsProcessed="9254"><Endpoint externalReferenceId="cbeebd33-245a-4b18-a8a8-d337e8c46627" productName="VMware vCenter Server" version="6.5.0" licenseName="VMware vCenter Server 6 Standard" /><ManagementEndpoint Name="vCenter" /><Nodes><Node name="SINGAPORE" type="Cluster" identity="prodvc/IDBI DC/host/SINGAPORE" datacenterExternalReferenceId="datacenter-21" externalReferenceId="domain-c26" isCluster="True" managementEndpointId="e5b052e1-0792-465a-a2a8-6b8b031f48ac" /><Node name="DC_PRODUCTION_RHEL_CLUSTER" type="Cluster" identity="prodvc/SGP/host/SINGAPORE" datacenterExternalReferenceId="datacenter-21" externalReferenceId="domain-c1310" isCluster="True" managementEndpointId="e5b052e1-0792-465a-a2a8-6b8b031f48ac" /></Nodes><AgentTypes><AgentType name="Hypervisor" /><AgentType name="vSphereHypervisor" /></AgentTypes></pingReport>]

Now that we know which ones are correct by cross checking vSphereAgent.log and then ManagementEndpoints table , we had to remove stale entries from this table

Took a backup of SQL IaaS database along with snapshots and then executed delete statements on the one's we thought are the stale entries

delete from dbo.ManagementEndpoints where ManagementEndpointID = 'E15DFAAE-229E-4874-AACB-793BDB6076F4';

delete from dbo.ManagementEndpoints where ManagementEndpointID = '03CACB31-23DD-444C-A493-8DDC8BC4E4CF';

But this did not solve our problem. Removing stale entries and then saving endpoints threw a different exception this time

So when you create an endpoint in vRA , it not only creates an entry in IaaS but it also creates an entry inside vRA's postgres database

We explored table called epconf_endpoint , this table has all entries of endpoints created through vRA UI and the id from Postgres database must match ManagementEndpointId of SQL database ( IaaS )

Remember these were the id's we deleted from SQL, the reason for "Endpoint with id [xxxxxx] is not found in iaas " is this discrepancy between IaaS and Postgres

Now updating id's taken for appropriate endpoints and updating here in Postgres would resolve this data mismatch. But there is a catch here.

As you can see above there is already a NSX endpoint created. Which we all know it is , as that's what we are troubleshooting to make it work.

Along with NSX endpoint , there is an association created, this association information is stored under epconf_association table

This association table contains

id of the association

from_endpoint_id : This is your NSXEndpointId from IaaS database and Id from epconf_endpoint of your postgres database

to_endpoint_id : This is your mapping you create to one of the vSphere endpoints.

Note : NSX endpoint information is stored inside table ,[DynamicOps.VCNSModel].[VCNSEndpoints] of IaaS database

This is where we found an answer to our problem

The to_endpoint_id inside epconf_association was pointing to a wrong id
Both the id's under epconf_endpoint has to modified to the one's present under IaaS

Remediation Plan

As a first step , we deleted NSX endpoint from vRA UI , this removed entry from epconf_association , so there is no need to update this table anymore

After removal of NSX endpoint from UI , we then moved onto epconf_endpoint to update id's with correct one's taken from IaaS database

Updating vCenter endpoint

update epconf_endpoint set id = 'e5b052e1-0792-465a-a2a8-6b8b031f48ac' where name = 'vCenter'

Updating vCenter01 endpoint

update epconf_endpoint set id = '5646fa1e-6a2b-4d08-9381-219fe6d92a5e' where name = 'vCenter01'

After we corrected id's inside epconf_endpoint ( Postgres ) to match with ManagementEndpointID ( IaaS Database ) , we were successfully able to save endpoints

Post this , creation of NSX endpoint and mapping it with a correct vSphere endpoint did result in a successful NSX data collection.

!! Hope this helps !!

ARUN NUKULA

ARUN NUKULA

NSX data collection unavailable

Recent Posts

Comments