Last week was involved on a use-case where our team was helping one of our customer to onboard virtual machines to vRealize Automation 8.x
Once virtual machine is onboarded , there is an extensibility configured where a vRO workflow is triggered which get's all of the machines properties from vRealize Automation 7.x and then updates the machine's records in version 8.x
But for some reason we were hitting an exception stating
2022-06-02T08:16:07.454Z ERROR vco [host='vco-app-69b4969975-hvgbp' thread='WorkflowExecutorPool-Thread-65' user='vro-gateway-U9ITj3QActbfOqea' org='-' trace='-'] {|__SYSTEM|vro-gateway-U9ITj3QActbfOqea:Migration Set VM Properties :339411e8-ae83-4318-8e0f-924d3456a591:token=92f0de3f-07e7-475b-affd-3a2439d4f1d1} SCRIPTING_LOG - Workflow execution stack:
***
item: 'Migration Set VM Properties /item4', state: 'failed', business state: '', exception: 'Error: Unable to set properties for machine 2ef5658e-3a51-381d-9cfb-9d1ff3531095. Error: 404 Service not found: http://10.244.2.28:8282/resources/compute/<<resourceid>> (Dynamic Script Module name : updateMachineProperties#26)'
workflow: 'Migration Set VM Properties ' (339****91)
So going by the exception it clearly states that it's unable to set the properties of the machine as it get's 404 returned by the service ( provisioning-service ). Which means that there is some discrepancy in the machine's data / records
Investigation
Based on the timestamp mentioned above , we would start look at the provisioning service to start with. Based on the trace id , we just see below 5 statements
2022-06-02T08:16:07.412Z WARN provisioning [host='provisioning-service-app-5d7b9467b7-lz5j9' thread='reactor-http-epoll-9' user='configadmin' org='<<replacedorgid>>' trace='9272***2986' parent='' span='c4e755***43e2'] c.v.i.api.common.ControllerRoutingUtil.handleOperationInternal:464 - [Username=configadmin][OrgID=<<replacedorgid>>] The request is missing mandatory query parameter apiVersion.
2022-06-02T08:16:07.414Z WARN provisioning [host='provisioning-service-app-5d7b9467b7-lz5j9' thread='xn-index-queries-19' user='configadmin' org='<<replacedorgid>>' trace='927****2986' parent='' span='c4e75****3e2'] c.v.xenon.common.ServiceErrorResponse.create:83 - message: Service not found: http://10.244.2.28:8282/resources/compute/<<resourceid>>, statusCode: 404, serverErrorId: ff2f34d3-bdab-4b1c-b4f5-378f3b46849d
2022-06-02T08:16:07.414Z WARN provisioning [host='provisioning-service-app-5d7b9467b7-lz5j9' thread='xn-index-queries-19' user='configadmin' org='<<replacedorgid>>' trace='92724***72986' parent='' span='c4e75***43e2'] c.v.a.s.w.p.x.c.StatefulServiceController.lambda$loadService$20:704 - Failed to start service /resources/compute/<<resourceid>> with 404 status code.
2022-06-02T08:16:07.414Z WARN provisioning [host='provisioning-service-app-5d7b9467b7-lz5j9' thread='xn-index-queries-19' user='configadmin' org='<<replacedorgid>>' trace='927****72986' parent='' span='c4e75***3e2'] c.v.xenon.common.ServiceErrorResponse.create:83 - message: Service not found: http://10.244.2.28:8282/resources/compute/<<resourceid>>, statusCode: 404, serverErrorId: 1a207e42-2daa-48dd-bc4b-6c5f7599da47
2022-06-02T08:16:07.416Z INFO provisioning [host='provisioning-service-app-5d7b9467b7-lz5j9' thread='reactor-http-epoll-9' user='configadmin' org='<<replacedorgid>>' trace='92724***2986' parent='' span='c4e755***43e2'] reactor.netty.http.server.AccessLog.info:270 - 10.244.2.11 - - [02/Jun/2022:08:16:07 +0000] "PATCH /iaas/api/machines/<<resourceid>> HTTP/1.1" 404 269 8282 31 ms
Replaced orgID and resourceid for security reasons
As we see , we get a 404 exception which means resource isn't found after a PATCH call is made
PATCH /iaas/api/machines/<<resourceid>> HTTP/1.1" 404
Now that we know that there is an issue somewhere with the data of the resources. Let's login into database and see if we can find a clue
Note: The search_columns is a function. This can be created by executing below code in postgres
CREATE OR REPLACE FUNCTION search_columns(
needle text,
haystack_tables name[] default '{}',
haystack_schema name[] default '{public}'
)
RETURNS table(schemaname text, tablename text, columnname text, rowctid text)
AS $$
begin
FOR schemaname,tablename,columnname IN
SELECT c.table_schema,c.table_name,c.column_name
FROM information_schema.columns c
JOIN information_schema.tables t ON
(t.table_name=c.table_name AND t.table_schema=c.table_schema)
WHERE (c.table_name=ANY(haystack_tables) OR haystack_tables='{}')
AND c.table_schema=ANY(haystack_schema)
AND t.table_type='BASE TABLE'
LOOP
EXECUTE format('SELECT ctid FROM %I.%I WHERE cast(%I as text)=%L',
schemaname,
tablename,
columnname,
needle
) INTO rowctid;
IF rowctid is not null THEN
RETURN NEXT;
END IF;
END LOOP;
END;
$$ language plpgsql;
After function is created, then we will use this to identify what all tables is the resource id present based on the machine name
select * from search_columns("nameoftheresourceonboarded");
We will be connecting to catalog database and execute the queries. There result returned would be something like below

Taking this result into consideration , let's search both the tables . Starting with the dep_request_event , we get 4 records. 2 ONBOARDED records and 2 ACTION records.
select * from dep_request_event where resource_name = 'nameoftheresourceonboaded';

Two onboarded records indicate that the machine was onboarded twice . See the difference in the timestamp
The next query would be to search dep_resource table
select id,name,deployment_id,summary_properties,origin,provider_resource_name,accounts from dep_resource where name = 'nameoftheresourceonboaded';

So we have two records. One for DISCOVERED and the next one for ONBOARDED
Keep the id's from this table handy. For security reasons i've wiped it out but the id's here should be mapped to document_self_link under compute_state table of provisioning database
Let's now connect to provisioning database and identify what tables is the resource name part of
select * from search_columns('nameoftheresourceonboaded');
Result would be as shown below

Keep this in mind when we review the records in the table compute_description
select document_self_link,document_update_action,TO_TIMESTAMP(document_update_time_micros/1000000),id,TO_TIMESTAMP(document_creation_time_micros/1000000) from compute_description where name = 'nameoftheresourceonboarded';

There is a record for DELETE and one more for patch
Checking other table called compute_state
select document_self_link,document_update_action,TO_TIMESTAMP(document_update_time_micros/1000000),id,TO_TIMESTAMP(document_creation_time_micros/1000000) from compute_state where name = 'nameoftheresourceonboarded';

There is just one record returned. What's important to understand here is that the id from dep_resource should be matching with the document_self_link of the onboarded resource
But what we observed here is that it's being matched with the discovered document_self_link. See the snippet below
2022-06-02T08:15:37.690Z INFO tango-blueprint [host='tango-blueprint-service-app-5b46464947-xtkw8' thread='http-nio-8080-exec-11' user='relocation-8d1Zeo8aFltm4u8j(configadmin)' org='replacedorgid' project='2f***82' deployment='a68***270' trace='ad9***78744'] com.vmware.tango.blueprint.service.BlueprintDeploymentService - Sending read tile operation for resource 2ef5***095 with resource link /resources/compute/284ac****aaca9
Ideally it should be as below
Sending read tile operation for resource 2ef5***095 with resource link /resources/compute/2ef5***095
So we now know that there is no record in the database for
/resources/compute/2ef5***095
This is the reason for the 404 exception seen in the snippet
Remediation
When onboarding a machine a duplicate entry for the machine is created in the resource center, with one entry in the 'discovered' state and one in the 'onboarded' state. This is a regression caused by a fix for an onboarding failure when machines have legacy IDs in the provisioning service inventory
Applying KB: 88162 should fix this issue seen. Once done , we should be able to onboard the machines as expected and no duplicate records would be identified.
Note: Ensure a snapshot is taken from vRSLCM before implementing the patch mentioned in the KB Article
After the fix is implemented , offboard the machines onboarded previously ,for us the count was very low.
This issue has been documented in the release notes too
Commenti