Search Results

252 results found with an empty search

Version table below vRA cluster tab showing incorrect status
I was working on a vRA upgrade that was being performed using vRLCM Most of the components were upgraded apart from DEM Orchestrator. This one was constantly failing due to some unknown issues. Since this was the last component we decided to go ahead and install it manually. It did install and the whole upgrade was complete. But when we go to VAMI and check the Cluster tab and the versions, DEM Orchestrator status on the node was set to "Upgrade Failed" This information under the Cluster tab of VAMI is actually coming from two places 1. Management Agent installed on the IAAS node where the respective components are installed 2. cluster_nodes table of vPostgres Inspecting cluster_nodes we do find that there is a ping_info column has the above information { "installationDrive":"C:\\", "name":"XXXXXXX.adDEO", "certificate":{ }, "metrics":{ }, "metrics":{ "cpuPercentage":0.0, "memory":52379648 }, "state":"Started", "version":"7.6.0.16195", "type":"DemOrchestrator", "condition":"Failed" } As you can see above there is this "condition" tag inside the ping_info column of cluster_nodes table. This is coming from or being updated inside this table from the Management agent. In order to resolve this issue, we have to remove this condition tag inside this table. Management Agent pulls this information from the registry of the IAAS node where the components are installed. To remediate Go to following path HKEY_LOCAL_MACHINE\SOFTWARE\WOW6432Node\VMware, Inc.\VMware vCloud Automation Center DEM Click on DemInstance01 or DemInstance02 ( depends on your environment ) On the right side check if you have ComponentCondition registry key If yes , delete the registry key Restart Management Agent on the node Now if you go back to VAMI the version column and the data inside ping_info column should have appropriate info and VAMI should know correct data
vSphere Web (Flash) Client Supportability and End of Life
vSphere Web Client uses Adobe Flash and Adobe Flash is going End of Life (EOL) in Dec 2020. All the browsers have aligned their efforts to disable / stop running the Flash application for this date. VMware vSphere 6.5 and 6.7 are supported until Nov 2021 and both these versions are shipped with vSphere Web Client (Flash) with them. The vSphere Client (HTML5), starting with vSphere 6.7 Update 1, has become feature complete to support all vSphere management capabilities. VMware recommends for customers to upgrade VMware vCenter Server(s) to 6.7 Update 3 by Dec 2020 and use vSphere Client (HTML5) to manage the vSphere environment. For more information please review the following resources: KB Article 78589 VMware vSphere Blog - vSphere Web Client Support beyond Adobe Flash EOL
Disable health broker service ( vrhb-service ) on vRA 7.x
Perform the following steps to ensure that vhrb service is disabled and not monitored by cron jobs Note: Take Snapshots ( No Memory, No Quiescing ) Stop the health service monitor by commenting out the cron job in /etc/cron.d/monitor-vrhb-cron Kill any instances of monitor that might be running by executing ps -A | grep monitor-vrhb.sh | awk '{print $1}' | xargs --no-run-if-empty kill -9 $1 Stop the health service by executing the command service vrhb-service stop Verify if the service is stopped if a process is found kill it manually ps aux | grep Quorum Cleanup the Health Service datastores ( aka Sandboxes ) rm -r /var/lib/vrhb/service-host/sandbox rm -r /var/lib/vrhb/vra-tests-host/sandbox Note : I have not configured health broker service in my lab so you would not see vra-test-host/sandbox command being executed As a last step, turn off vrhb-service using chkconfig chkconfig vrhb-service off
The remote server returned an error: (403) Forbidden
You might end up in a situation where your IAAS service is not REGISTERED on VAMI Repository.log [UTC:2020-03-18 03:20:56 Local:2020-03-18 03:20] [Error]: [sub-thread-Id="51" context="" token=""] System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> System.AggregateException: One or more errors occurred. ---> System.Security.Authentication.AuthenticationException: OAuth token request failed. URL: https://<>/SERVICE: endpoints/types/sso ---> System.Net.Http.HttpRequestException: An error occurred while sending the request. ---> System.Net.WebException: The remote server returned an error: (403) Forbidden. at System.Net.HttpWebRequest.EndGetRequestStream(IAsyncResult asyncResult, TransportContext& context) at System.Net.Http.HttpClientHandler.GetRequestStreamCallback(IAsyncResult ar) --- End of inner exception stack trace --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at DynamicOps.Common.Client.RestClient.<>c__DisplayClassc9`2.<b__c8>d__cb.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() * * * * --- End of inner exception stack trace --- at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification) at DynamicOps.Repository.Runtime.SecurityModel.CafeSecurityProvider.LoadSecurityInformation(UserIdentity userIdentity) at DynamicOps.Repository.Runtime.SecurityModel.SecurityModelContext.GetIdentityTasksFromCache(UserIdentity userIdentity) at DynamicOps.Repository.Runtime.SecurityModel.SecurityModelContext.get_IdentityTasks() at DynamicOps.Repository.Runtime.ServiceModel.Data.RepositoryDataService`2.CalculateWritePermissionScopes(Int32 entityId) at DynamicOps.Repository.Runtime.ServiceModel.Data.RepositoryDataService`2.InternalOnChangeEntity[TEntity](Int32 entityId, TEntity entity, IQueryable`1 entitySet, UpdateOperations operation) at DynamicOps.Repository.Runtime.ServiceModel.Data.TrackingModelDataService.OnChangeTrackingLogItems(TrackingLogItem entity, UpdateOperations operation) inc:\Windows\Temp\0bxcpk4c.0.cs:line 105 --- End of inner exception stack trace --- at System.Data.Services.DataService`1.BatchDataService.HandleBatchContent(Stream responseStream) INNER EXCEPTION: System.AggregateException: One or more errors occurred. ---> System.Security.Authentication.AuthenticationException: OAuth token request failed. URL: https://<>/SERVICE: endpoints/types/sso ---> System.Net.Http.HttpRequestException: An error occurred while sending the request. ---> System.Net.WebException: The remote server returned an error: (403) Forbidden. at System.Net.HttpWebRequest.EndGetRequestStream(IAsyncResult asyncResult, TransportContext& context) at System.Net.Http.HttpClientHandler.GetRequestStreamCallback(IAsyncResult ar) --- End of inner exception stack trace --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at DynamicOps.Common.Client.RestClient.<>c__DisplayClassc9`2.<b__c8>d__cb.MoveNext() Web_Admin.log [UTC:2020-03-17 18:41:08 Local:2020-03-17 18:41] [Error]: [sub-thread-Id="12" context="" token=""] Error occurred writing to the repository tracking log System.Net.WebException: The remote server returned an error: (403) Forbidden. at System.Net.HttpWebRequest.GetRequestStream(TransportContext& context) at System.Net.HttpWebRequest.GetRequestStream() at System.Data.Services.Client.ODataRequestMessageWrapper.SetRequestStream(ContentStream requestStreamContent) at System.Data.Services.Client.BatchSaveResult.BatchRequest() at System.Data.Services.Client.DataServiceContext.SaveChanges(SaveChangesOptions options) at DynamicOps.Repository.RepositoryServiceContext.SaveChanges(SaveChangesOptions options) at DynamicOps.Repository.Tracking.RepoLoggingSingleton.WriteExceptionToLogs(String message, Exception exceptionObject, Boolean writeAsWarning) These messages clearly indicate that your IAAS is trying to fetch auth token from Manager but it's unable to get it. Expected ouputs would be as below [UTC:2020-03-11 12:33:36 Local:2020-03-11 00:33] [VMware.Cafe]: [sub-thread-Id="1" context="" token=""] Setting CafeClientCacheDuration: 00:05:00 [UTC:2020-03-11 12:33:36 Local:2020-03-11 00:33] [VMware.Cafe]: [sub-thread-Id="1" context="" token=""] (1) GET endpoints/types/sso [UTC:2020-03-11 12:33:36 Local:2020-03-11 00:33] [VMware.Cafe]: [sub-thread-Id="10" context="" token=""] (1) Response: OK 0:00.105 [UTC:2020-03-11 12:33:37 Local:2020-03-11 00:33] [VMware.Cafe]: [sub-thread-Id="8" context="" token=""] (2) POST SAAS/t/vsphere.local/auth/oauthtoken?grant_type=client_credentials [UTC:2020-03-11 12:33:37 Local:2020-03-11 00:33] [VMware.Cafe]: [sub-thread-Id="11" context="" token=""] (2) Response: OK 0:00.118 [UTC:2020-03-11 12:33:37 Local:2020-03-11 00:33] [VMware.Cafe]: [sub-thread-Id="8" context="" token=""] (3) GET endpoints/types/com.vmware.csp.cafe.authentication.api/default To resolve this problem Take Snapshots ( MANDATORY ) ( Note: No Memory or Quiescing ) Validate if all the certificates are in place and valid you may do this from VAMI Reinitiate trust under Actions section of Certificate tab on vRA Appliance's VAMI Reboot the environment systematically as per documentation Once the environment is up, you should see all services coming back appropriately
VAMI login lockouts
If an incorrect password is entered a couple of times on VAMI it would lock you out from logging in again for a couple of minutes. Here's the procedure to come out of it instantly so that you don't waste time anymore Step 1 Identify how many login failures occurred with account "root" pam_tally2 -u root Step 2 Below command will reset this value to 0 pam_tally2 -u root --reset Step 3 Verify if the value is back to 0 pam_tally2 -u root Step 4 Test your login into VAMI, this should be working now
Third-Party Plugins are not visible on all vRO nodes after we disable & enable from ControlCenter
Came across a problem where the plugin was not visible on all the nodes part of embedded vRO cluster, version 7.6 This issue is seen usually on third-party plugins and not on out of the box plugins. When the user disables a plugin from vRO's control-center and then enables it back, though it's enabled on control-center, inside appliance it's still marked as disabled Log Snippet showing that it's marking a plugin as disabled as per its configuration 2019-12-30 19:24:07.091+1100 [pool-1-thread-1] INFO {} [ModulesFactory] Deploying plug-in : '/var/lib/vco/app-server/../app-server/plugins/f5_big_ip_plugin.dar' 2019-12-30 19:24:07.609+1100 [pool-1-thread-1] INFO {} [SDKDatasource] setInvokerMode() -> Datasource 'datasource' in module 'F5Networks' set to 'direct' 2019-12-30 19:24:08.465+1100 [pool-1-thread-1] INFO {} [ModulesFactory] vCO plug-in 'F5Networks' is disabled Configuration of this plugin is stored under /var/lib/vco/app-server/conf/plugins/_VSOPluginState.xml When you edit this file you would see disabled got to change this line to enabled Then perform a restart of service vCO service vco-server stop && service vco-server start Once services are up you should be able to see plugin available on all three nodes This looks like a bug.
Enabling debug logging for vRA's horizon
Execute below commands to enable debug logging for vRA's horizon and connector components. sed -i 's/rootLogger=INFO/rootLogger=DEBUG/g' /usr/local/horizon/conf/saas-log4j.properties sed -i 's/rootLogger=INFO/rootLogger=DEBUG/g'/usr/local/horizon/conf/hc-log4j.properties There is no restart required after changing these settings Execute below commands to disable debug logging sed -i 's/rootLogger=DEBUG/rootLogger=INFO/g' /usr/local/horizon/conf/saas-log4j.properties sed -i 's/rootLogger=DEBUG/rootLogger=INFO/g' /usr/local/horizon/conf/hc-log4j.properties Again there is no restart needed after reverting back
PLAIN login refused: user 'vcac-server_XXXX'- invalid credentials & RabbitMQ Unrecoverable Exception
Last week I was working on 2 such issues where RabbitMQ configuration was messed up. The circumstances which led to these problems were entirely different though. For the first instance, RabbitMQ encountered an unrecoverable state due to an outage. Second instance occurred because of an upgrade failure where an appliance was replaced with a new one and it never joined cluster properly causing services not being registered. In both these scenarios, we had to manually break the cluster and then create one again. Below steps would help you to achieve and successfully rebuild the cluster Assuming you have 2 VA's in your vRA environment Note: Snapshots are a must before you perform these steps. Ensure you have valid backups as well Step 1 Stop all services on both the nodes using the command vcac-vami service-manage stop vco-server vcac-server horizon-workspace elasticsearch Step 2 Bring down the rabbitmq monitor by running " service cluster-rabbitmq-monitor stop " on both nodes Step 3 Bring down rabbitmq by running " service rabbitmq-server stop " on both nodes Step 4 Take a backup of folder /var/lib/rabbitmq* somewhere locally. Store it in a safe location Erase rabbitmq state by running " rm -rf /var/lib/rabbitmq/* " on both nodes Step 5 REBOOT appliances Step 6 When appliances are starting up there would be a point where it has to start rabbitmq. Ensure this is started properly Step 7 Verify rabbitmq is running by executing the command service rabbitmq-server status Validate rabbitmq is running only on a single node by executing the command rabbitmqctl cluster_status Step 8 Prepare the first node for clustering by running " vcac-vami rabbitmq-cluster-config set-cluster-node Step 9 Execute the command on the first node to get the cluster-info vcac-vami rabbitmq-cluster-config generate_join_cluster_variables Note down the output, you will need it in the next step Step 10 Execute the command on the second node, replacing USERNAME, PASSWORD, COOKIE, HOST, and USE_LONGNAME with their value from the previous step vcac-vami rabbitmq-cluster-config join-cluster 'USERNAME' 'PASSWORD' 'COOKIE' 'HOST' 'USE_LONGNAME' Step 11 Validate that rabbitmq is now running in cluster mode on both nodes by running rabbitmqctl cluster_status Step 12 Bring up rabbitmq monitor by running below command on both nodes service cluster-rabbitmq-monitor start Step 13 Execute below command to stop and start services on both MASTER and REPLICA in a systematic manner. Start from your MASTER node first. Once the services are coming back then you may go to REPLICA vcac-vami service-manage stop vco-server vcac-server horizon-workspace elasticsearch && vcac-vami service-manage start elasticsearch horizon-workspace hzn-dots vcac-server vco-server *** I'll share example with screenshots soon / recording soon ***
Accessing Custom Groups cause System exception in vRA UI
I was involved in a problem recently where a tenant admin tries accessing custom groups and he encounters a system exception Cause of this problem is unknown but due to changes performed for specific users from Active Directory perspective , vRA marks them as soft-deleted inside postgres database instead of completely removing them. Note : To resolve this problem we have to modify vRA's Postgres database so ensure a proper backup is taken. I would also advice to take snapshots on vRA appliances If in doubt or not comfortable in executing these changes contact VMware Support through Support Request Step 0 Login into vRA's Database either ssh or using PgAdmin tool. PgAdmin is the recommended as it's easy to modify and update Step 1 Extracting the User ids from the Custom Groups("groupType"='DYNAMIC') from Group table. select array_agg(e::text::int) from saas."Group" ,json_array_elements("compositionRules"::json -> 'addedUserIds') as e where "groupType"='DYNAMIC'; Step 2 Take the UserIds from Step1 and Query for Soft deleted Users select "idUser" from saas."Users" where "idUserStatus"=3 and "idUser" IN(); Step 3 Having all problematic User IDs, query back the Group table to get the groups in which they are members of select "id" from saas."Group" where "compositionRules" ~* 'iduser1|idUser2'; As shown in the above screenshot id : 1959 is the one which was soft deleted from our environment so the query would be select "id" from saas."Group" where "compositionRules" ~* '1959'; Step 4 Now browse the group as we identified id of that group where this user is part of Double click on compositionRules You can see 1959 user part of this group Edit this section to remove this id from this column and save it As you can see 1959 does not exist anymore Use PGAdmin to edit the groups from the list from Step 3 and remove the problematic User IDs. If PGAdmin is not an option, one needs to load everything for a given group, then remove the problematic User IDs from "compositionRules" column and update back the group to the DB Perform a complete sync post this and this exception should not be seen anymore.
Failed to save workflow schema, vRO HTML5 client exception
One of the major enhancements vRealize Automation 7.6 brings to the table is HTML5 functionality in vRO. To name a few on what it brings to the table Build , Import , Edit , Debug and Execute workflows from HTML5 interface Create , update and use workflow configuration elements Compare and restore different versions of the same workflow side by side Enhanced dashboard and permissions for WebClient Improvement in workflow run troubleshooting But during one of two instances , i did see a problem with this new HTML5 vRO client where it does not allow you to save workflows Issue is seen due to a bug in GA version of this component. Let's see on how to resolve this problem Step:1 Download JGit cli client curl -k --output jgit https://repo.eclipse.org/content/groups/releases//org/eclipse/jgit/org.eclipse.jgit.pgm/5.4.0.201906121030-r/org.eclipse.jgit.pgm-5.4.0.201906121030-r.sh Step:2 Move jgit to /usr/bin, set it to executable and change its owner to vco mv jgit /usr/bin/ chown vco:vco /usr/bin/jgit Step:3 Remove the HEAD ref of the master mv /storage/db/vcodata/git/__SYSTEM.git/refs/heads/master /var/lib/vco/master_ref.backup Step:4 Check where the HEAD should be cd /storage/db/vcodata/git/__SYSTEM.git jgit log Get the commit ref sha from the first line. ex: 01bb5f58716dc12016b4fd8798c7fa8c91c76bf3 Step:5 Login as vco su - vco Step:6 Create the master ref echo '01bb5f58716dc12016b4fd8798c7fa8c91c76bf3' > /storage/db/vcodata/git/__SYSTEM.git/refs/heads/master Step:7 Run jgit garbage collection cd /storage/db/vcodata/git/__SYSTEM.git jgit gc Step:8 Login as root again and make sure everything in the git folder is owned by vco:vco exit cd /storage/db/vcodata/git/__SYSTEM.git chown -R vco:vco Step:9 Restart vRO server service vco-server restart Step:10 Create a workflow and then try to save it , it should work as expected
Multiple/duplicate requests are being requested automatically post submitting a single XaaS bp
Note : Timestamps mentioned below is just an example When the user submits request for the first time on [UTC:2019-08-19 23:49:05,475 Local:2019-08-20 11:49:05,475] , from Catalina.out of vRA we see four submissions Each time there are different execution id’s for the workflow as shown in the above section [UTC:2019-08-19 23:49:11,573 Local:2019-08-20 11:49:11,573] vcac: [component="cafe:o11n-gateway" priority="INFO" thread="o11n-gateway-service-subscrAmqpExec-6" tenant="vsphere.local" context="Yz2eG2qO" parent="" token="ibo5TyOu"] com.vmware.vcac.o11n.gateway.service.impl.ExecutionStatusChecker.call:192 - Workflow 076852da-ba58-4da2-82c4-4a1c6ff5b9eb has been executed with executionId 0c0f1f1a-74d7-4d8b-ba99-39a664c178be [UTC:2019-08-19 23:49:30,774 Local:2019-08-20 11:49:30,774] vcac: [component="cafe:o11n-gateway" priority="INFO" thread="o11n-gateway-service-subscrAmqpExec-2" tenant="vsphere.local" context="b6bDVSVw" parent="" token="eXptiSGd"] com.vmware.vcac.o11n.gateway.service.impl.ExecutionStatusChecker.call:192 - Workflow 076852da-ba58-4da2-82c4-4a1c6ff5b9eb has been executed with executionId 0274db79-3d68-4eb9-9e89-707b9e5e40ed [UTC:2019-08-19 23:49:42,724 Local:2019-08-20 11:49:42,724] vcac: [component="cafe:o11n-gateway" priority="INFO" thread="o11n-gateway-service-subscrAmqpExec-4" tenant="vsphere.local" context="CnaBzAN7" parent="" token="BbiNoxQS"] com.vmware.vcac.o11n.gateway.service.impl.ExecutionStatusChecker.call:192 - Workflow 076852da-ba58-4da2-82c4-4a1c6ff5b9eb has been executed with executionId 6694050f-4c77-4c5a-8e80-921edda3d492 [UTC:2019-08-19 23:50:35,643 Local:2019-08-20 11:50:35,643] vcac: [component="cafe:o11n-gateway" priority="INFO" thread="o11n-gateway-service-subscrAmqpExec-4" tenant="vsphere.local" context="eWqLt26s" parent="" token="zLMn0Xqs"] com.vmware.vcac.o11n.gateway.service.impl.ExecutionStatusChecker.call:192 - Workflow 076852da-ba58-4da2-82c4-4a1c6ff5b9eb has been executed with executionId f6afefd9-86bc-4b4c-a8aa-abf54554f1e3 This is a bug which is reproducible in both vRA 7.5 and 7.6 Trigger for this issue is that the onAction() handler within the catalog request form is triggered every time the user clicks on either 'Submit','Previous' or 'Next' The handler in turn calls the handleActionSuccess() method which issues a POST call to Catalog service with the request details. The handler is involved every-time the user clicks on either of the three buttons, which in turn , results in multiple POST requests being issued. Simple fix for this problem is to only invoke the POST call if action is 'Submit', this helps us to avoid duplicate requests. This is being fixed under upcoming patch release for vRA 7.5 and 7.6
NSX data collection unavailable
In preparation for using NSX network, security, and load balancing capabilities in vRealize Automation , at first we have to create an NSX endpoint I was asked to look into a problem where even after creating an endpoint successfully along with association mapped , selecting data collection under Compute Resource does not show Network and Security Inventory Looking at the logs after NSX endpoint was created we do see there is a data collection workitem created , that's VCNSInventory Reference : ManagerService / All.log [UTC:2019-09-03 10:29:48 Local:2019-09-03 15:59:48] [Debug]: [sub-thread-Id="45" context="" token=""] DC: Created data collection item, WorkflowInstanceId 183022, Task VCNSInventory, EntityID 8ed67519-99fb-4afa-811f-227e753a24eb, StatusID = 457b3af7-b739-45b2-ab9f-0cdd79596af0 Taking one of the instance 183022 into consideration and inspecting worker logs Worker initialises instance 2019-09-03T10:29:49.962Z DC-DEM02 vcac: [component="iaas:DynamicOps.DEM.exe" priority="Trace" thread="4268"] [sub-thread-Id="27" context="" token=""] Worker Controller: initializing instance 183022 - vSphereVCNSInventory of the workflow execution unit 2019-09-03T10:29:52.009Z DC-DEM02 vcac: [component="iaas:DynamicOps.DEM.exe" priority="Trace" thread="4268"] [sub-thread-Id="27" context="" token=""] WorkflowExecutionUnit: initialize started: 183022 2019-09-03T10:30:14.401Z DC-DEM02 vcac: [component="iaas:DynamicOps.DEM.exe" priority="Trace" thread="4864"] [sub-thread-Id="28" context="" token=""] Workflow ID: 183022 Activity : State: Closed 2019-09-03T10:30:14.417Z DC-DEM02 vcac: [component="iaas:DynamicOps.DEM.exe" priority="Debug" thread="4864"] [sub-thread-Id="28" context="" token=""] Workflow Complete: 183022 - Successful 2019-09-03T10:30:14.417Z DC-DEM02 vcac: [component="iaas:DynamicOps.DEM.exe" priority="Trace" thread="4864"] [sub-thread-Id="28" context="" token=""] Worker Controller: WriteCompletedWorkflow As shown above it did go through data collection and marked as successful but it was never showing up in UI. At this time when we performed a Test Connection for an endpoint and click on OK, though test connection was successful , it was unable to save this endpoint. That's when I got an idea that there must be something wrong with endpoints table Assumption was changed to confirmation after reviewing API data captured from HAR file Now that we know that there is definitely something wrong with the endpoints Using query select * from ManagementEndpoints found that there were stale entries for all vSphere endpoints Ideally there should be only one entry per endpoint ( vSphere ) inside this table. But here we have 2 per vSphere endpoint. How do we now identify which is the correct one and what ManagementEndpointId to be deleted For this you have to grep vSphereAgent.log ( Proxy Agent logs ) and search for managementEndpointId. This managementEndpointId what you find in the log is the correct one and this entry must remain under ManagementEndpointID of dbo.ManagementEndpoints table Example 2019-09-09T03:54:23.466Z DC-AGENT01 vcac: [component="iaas:VRMAgent.exe" priority="Debug" thread="900"] [sub-thread-Id="6" context="" token=""] Ping Sent Successfully : [] Now that we know which ones are correct by cross checking vSphereAgent.log and then ManagementEndpoints table , we had to remove stale entries from this table Took a backup of SQL IaaS database along with snapshots and then executed delete statements on the one's we thought are the stale entries delete from dbo.ManagementEndpoints where ManagementEndpointID = 'E15DFAAE-229E-4874-AACB-793BDB6076F4'; delete from dbo.ManagementEndpoints where ManagementEndpointID = '03CACB31-23DD-444C-A493-8DDC8BC4E4CF'; But this did not solve our problem. Removing stale entries and then saving endpoints threw a different exception this time So when you create an endpoint in vRA , it not only creates an entry in IaaS but it also creates an entry inside vRA's postgres database We explored table called epconf_endpoint , this table has all entries of endpoints created through vRA UI and the id from Postgres database must match ManagementEndpointId of SQL database ( IaaS ) Remember these were the id's we deleted from SQL, the reason for "Endpoint with id [xxxxxx] is not found in iaas " is this discrepancy between IaaS and Postgres Now updating id's taken for appropriate endpoints and updating here in Postgres would resolve this data mismatch. But there is a catch here. As you can see above there is already a NSX endpoint created. Which we all know it is , as that's what we are troubleshooting to make it work. Along with NSX endpoint , there is an association created, this association information is stored under epconf_association table This association table contains id of the association from_endpoint_id : This is your NSXEndpointId from IaaS database and Id from epconf_endpoint of your postgres database to_endpoint_id : This is your mapping you create to one of the vSphere endpoints. Note : NSX endpoint information is stored inside table ,[DynamicOps.VCNSModel].[VCNSEndpoints] of IaaS database This is where we found an answer to our problem The to_endpoint_id inside epconf_association was pointing to a wrong id Both the id's under epconf_endpoint has to modified to the one's present under IaaS Remediation Plan As a first step , we deleted NSX endpoint from vRA UI , this removed entry from epconf_association , so there is no need to update this table anymore After removal of NSX endpoint from UI , we then moved onto epconf_endpoint to update id's with correct one's taken from IaaS database Updating vCenter endpoint update epconf_endpoint set id = 'e5b052e1-0792-465a-a2a8-6b8b031f48ac' where name = 'vCenter' Updating vCenter01 endpoint update epconf_endpoint set id = '5646fa1e-6a2b-4d08-9381-219fe6d92a5e' where name = 'vCenter01' After we corrected id's inside epconf_endpoint ( Postgres ) to match with ManagementEndpointID ( IaaS Database ) , we were successfully able to save endpoints Post this , creation of NSX endpoint and mapping it with a correct vSphere endpoint did result in a successful NSX data collection. !! Hope this helps !!