top of page

Search Results

252 results found with an empty search

  • Duplicate VM get's detected but Original VM get's deleted when Content-Library Template is used

    We had a use-case where user accidentally passed VM name which was already provisioned and existing on vCenter through vRA 8.x The moment this was done, request was anyways expected to fail detecting that there is a VM with the same name already present on the endpoint 2021-04-05T00:03:29.934Z ERROR provisioning [host='provisioning-service-app-67b88d497d-w7cvq' thread='vsphere-io-72' user='' org='' trace='' parent='' span=''] c.v.p.c.m.a.v.VSphereAdapterInstanceService.log:453 - [8282/provisioning/vsphere/instance-adapter] Error in createInstanceAsync [com.vmware.photon.controller.model.adapters.vsphere.vapi.RpcException: Cannot deploy library item 1f19595c-1770-4288-b76c-d44b0766d232 ( com.vmware.vdcs.util.duplicate_name An object of type "ManagedObjectReference: type = VirtualMachine, value = vm-45, serverGuid = ae737187-1a20-4c5c-a1cd-9d7b3f7e1d8c" named "vcdrphoton" already exists.) at com.vmware.photon.controller.model.adapters.vsphere.vapi.VapiClient.throwIfError(VapiClient.java:261) at com.vmware.photon.controller.model.adapters.vsphere.vapi.LibraryClient.deployOvfLibItem(LibraryClient.java:453) at com.vmware.photon.controller.model.adapters.vsphere.InstanceClient.createInstanceFromLibraryItem(InstanceClient.java:4309) at com.vmware.photon.controller.model.adapters.vsphere.VSphereAdapterInstanceService.createInstanceAsync(VSphereAdapterInstanceService.java:498) at com.vmware.photon.controller.model.adapters.vsphere.VSphereAdapterInstanceService.lambda$handleCreateInstanceAsync$11(VSphereAdapterInstanceService.java:351) at java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072) at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) ] Example Blueprint formatVersion: 1 inputs: {} resources: VirtualMachine: type: Cloud.vSphere.Machine properties: name: testone image: photonfour flavor: small networks: - network: '${resource.Network.id}' assignment: static Network: type: Cloud.vSphere.Network properties: networkType: existing If this issue occurs when using a normal vSphere template , there is no impact. But if your using a content library based template and then there is a risk that the original virtual machine with the same name get's deleted For vCenter 7.x , please upgrade your vRealize Automation version to 8.3 P1 For vCenter 6.x , fix would be part of vRealize Automation 8.4 Patch 1 All we need to ensure is that do not pass an existing virtual machine name when using Content Library templates

  • Validations not being honored for Day-2 actions on Custom Resources

    We have a workflow in vRO called "Add Firewall Rules" or anything for that matter. In the input form we have set a validation to one of the fields to check if there is any sort of duplication when user enters a value. This is achieved by using a validation action in the background Now when we use this same workflow as a Day-2 action in a Custom Resource, we do not see the same validation pane while we edit request parameters This is not a bug but an expected behavior, we still don't support external validations for day 2 actions

  • Behind the Scenes: vRealize Automation Upgrade 8.2 to 8.3

    I delivered a training recently about the process which occurs in the background when we upgrade vRealize Automation 8.2 to 8.3 Thought it would be useful for everyone hence writing this blog As a first step, we need to keep the updaterepo ready and downloaded Then took a manual snapshot , once the snapshot task is complete we shall proceed with the upgrade Now that the snapshot is taken we will go ahead with upgrade Before we proceed, we need to perform inventory sync. Unless this is done, PROCEED option will not be available Once , done we need to run precheck Whole precheck pdf attached here When you click on next check the repo url and then submit , upgrade starts Upgrade successfully completes in 1.5 hours. Here's the list of log files which are to be monitored during upgrade BEHIND THE SCENES The first thing which happens when you click on upgrade is that we download manifest. And this is logged under upgrade-noop.log upgrade-noop.log [INFO][2021-03-30 06:20:47][labvra.nukescloud.com] Downloading manifest ... [INFO][2021-03-30 06:20:48][labvra.nukescloud.com] Manifest downloaded successfully. % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 10.109.45.33... * TCP_NODELAY set * Connected to reflcm.nukescloud.com (10.109.45.33) port 443 (#0) * ALPN, offering http/1.1 * Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH * ignoring certificate verify locations due to disabled peer verification * TLSv1.2 (OUT), TLS header, Certificate Status (22): } [5 bytes data] * TLSv1.2 (OUT), TLS handshake, Client hello (1): } [512 bytes data] * * * * Connection #0 to host reflcm.nukescloud.com left intact [INFO][2021-03-30 06:20:48][labvra.nukescloud.com] Validating manifest ... [INFO][2021-03-30 06:20:48][labvra.nukescloud.com] Manifest signature validated successfully. [INFO][2021-03-30 06:20:48][labvra.nukescloud.com] Repository product ID matched successfully. [INFO][2021-03-30 06:20:48][labvra.nukescloud.com] Manifest validated successfully. [INFO][2021-03-30 06:20:48][labvra.nukescloud.com] Searching for update bootstrap RPM ... [INFO][2021-03-30 06:20:48][labvra.nukescloud.com] Update bootstrap RPM found. [INFO][2021-03-30 06:20:48][labvra.nukescloud.com] Update bootstrap RPM downloaded successfully. * * * * SSL certificate verify result: self signed certificate (18), continuing anyway. } [5 bytes data] > GET /repo/productBinariesRepo/vra/8.3.0/upgrade/update/package-pool/bootstrap-update-8.3.0.15014-1.noarch.rpm HTTP/1.1 > Host: reflcm.nukescloud.com > User-Agent: curl/7.59.0 > Accept: */* * * < { [15833 bytes data] 100 64516 0 64516 0 0 2250k 0 --:--:-- --:--:-- --:--:-- 2250k * Connection #0 to host reflcm.nukescloud.com left intact Then starts the actual upgrade and logged under upgrade-YYYY-MM-DD-HH-MM-SS.log 1. Searches for the bootstrap package and checks if SSH port is open and it's able to connect [INFO][2021-03-30 06:20:56][labvra.nukescloud.com] Retrieving product version... [INFO][2021-03-30 06:20:56][labvra.nukescloud.com] Product version retrieved successfully. [INFO][2021-03-30 06:20:56][labvra.nukescloud.com] Searching for bootstrap package for product version 8.2.0.12946 in /var/vmware/prelude/upgrade/bootstrap/rel801 ... [INFO][2021-03-30 06:20:56][labvra.nukescloud.com] Searching for bootstrap package for product version 8.2.0.12946 in /var/vmware/prelude/upgrade/bootstrap/rel801-hf345 ... [INFO][2021-03-30 06:20:56][labvra.nukescloud.com] Searching for bootstrap package for product version 8.2.0.12946 in /var/vmware/prelude/upgrade/bootstrap/rel810-gap12 ... [INFO][2021-03-30 06:20:57][labvra.nukescloud.com] Applying configuration profile lcm. [INFO][2021-03-30 06:20:57][labvra.nukescloud.com] Configuration profile lcm applied successfully. [INFO][2021-03-30 06:20:58][labvra.nukescloud.com] Retrieving nodes status... [INFO][2021-03-30 06:20:58][labvra.nukescloud.com] Nodes status retrieval succeeded. [INFO][2021-03-30 06:20:58][labvra.nukescloud.com] Processing nodes data... [INFO][2021-03-30 06:20:58][labvra.nukescloud.com] Nodes data processing succeeded. [INFO][2021-03-30 06:20:58][labvra.nukescloud.com] Verifying nodes... [INFO][2021-03-30 06:20:58][labvra.nukescloud.com] Nodes verification completed successfully. [INFO][2021-03-30 06:20:58][labvra.nukescloud.com] Retrieving services status... [INFO][2021-03-30 06:20:59][labvra.nukescloud.com] Services status retrieval succeeded. [INFO][2021-03-30 06:21:00][labvra.nukescloud.com] Processing services data... [INFO][2021-03-30 06:21:18][labvra.nukescloud.com] Services data processing succeeded. [INFO][2021-03-30 06:21:18][labvra.nukescloud.com] Verifying services... [INFO][2021-03-30 06:21:19][labvra.nukescloud.com] Services verification completed successfully. [INFO][2021-03-30 06:21:21][labvra.nukescloud.com] Creating SSH key pairs [INFO][2021-03-30 06:21:21][labvra.nukescloud.com] SSH key pairs created successfully. [INFO][2021-03-30 06:21:21][labvra.nukescloud.com] Configuring SSH on nodes. [INFO][2021-03-30 06:21:22][labvra.nukescloud.com] Service sshd status: active/enabled [INFO][2021-03-30 06:21:22][labvra.nukescloud.com] SSH port 22 is open. [INFO][2021-03-30 06:21:22][labvra.nukescloud.com] SSH configurated on nodes successfully. [INFO][2021-03-30 06:21:22][labvra.nukescloud.com] Running remote command: /opt/scripts/upgrade/ssh-noop.sh at host: labvra.nukescloud.com [INFO][2021-03-30 06:21:23][labvra.nukescloud.com] Remote command succeeded: /opt/scripts/upgrade/ssh-noop.sh at host: labvra.nukescloud.com [INFO][2021-03-30 06:21:23][labvra.nukescloud.com] Verifying remote nodes are able to connect to one another and to this node. [INFO][2021-03-30 06:21:23][labvra.nukescloud.com] Verification that nodes are able to connect to one another and to this node succeeded. 2. Retrieves product version on all nodes. In this environment we just have one node [INFO][2021-03-30 06:21:24][labvra.nukescloud.com] Retriving product versions on all nodes. [INFO][2021-03-30 06:21:24][labvra.nukescloud.com] Running remote command: /opt/scripts/upgrade/vami-save-vers.sh prep at host: labvra.nukescloud.com [INFO][2021-03-30 06:21:25][labvra.nukescloud.com] Retrieving product version. [INFO][2021-03-30 06:21:25][labvra.nukescloud.com] Product version retrieved successfully. [INFO][2021-03-30 06:21:25][labvra.nukescloud.com] Remote command succeeded: /opt/scripts/upgrade/vami-save-vers.sh prep at host: labvra.nukescloud.com [INFO][2021-03-30 06:21:25][labvra.nukescloud.com] Product versions successfully retrieved from all nodes. [INFO][2021-03-30 06:21:25][labvra.nukescloud.com] Checking that product versions match across all nodes. [INFO][2021-03-30 06:21:25][labvra.nukescloud.com] Product versions match across all nodes verified. 3. Performs Infrastructure Health [INFO][2021-03-30 06:21:26][labvra.nukescloud.com] Checking appliance nodes infrastructure health [INFO][2021-03-30 06:21:26][labvra.nukescloud.com] Running remote command: /opt/health/run-once.sh health at host: labvra.nukescloud.com [INFO][2021-03-30 06:21:31][labvra.nukescloud.com] Remote command succeeded: /opt/health/run-once.sh health at host: labvra.nukescloud.com [INFO][2021-03-30 06:21:31][labvra.nukescloud.com] Infrastructure health check passed on all appliance nodes. 4. Set's up update repository on all nodes and also verifies if this is accessible across all nodes [INFO][2021-03-30 06:21:33][labvra.nukescloud.com] Setting up update repository on all nodes. [INFO][2021-03-30 06:21:33][labvra.nukescloud.com] Running remote command: vamicli update --repo https://reflcm.nukescloud.com/repo/productBinariesRepo/vra/8.3.0/upgrade/update at host: labvra.nukescloud.com [INFO][2021-03-30 06:21:34][labvra.nukescloud.com] Remote command succeeded: vamicli update --repo https://reflcm.nukescloud.com/repo/productBinariesRepo/vra/8.3.0/upgrade/update at host: labvra.nukescloud.com [INFO][2021-03-30 06:21:34][labvra.nukescloud.com] Update repository set successfully on all nodes. [INFO][2021-03-30 06:21:34][labvra.nukescloud.com] Verifying the access to update repository on all nodes. [INFO][2021-03-30 06:21:34][labvra.nukescloud.com] Running remote command: /opt/scripts/upgrade/vami-config-repo.sh 'local' at host: labvra.nukescloud.com [INFO][2021-03-30 06:21:34][labvra.nukescloud.com] Verifying the access to the update repository. [INFO][2021-03-30 06:21:36][labvra.nukescloud.com] Access to the update repository verified. [INFO][2021-03-30 06:21:36][labvra.nukescloud.com] Verifying availability of new product version in the repository. [INFO][2021-03-30 06:21:36][labvra.nukescloud.com] Availability of new version in the repository verified. [INFO][2021-03-30 06:21:36][labvra.nukescloud.com] Remote command succeeded: /opt/scripts/upgrade/vami-config-repo.sh 'local' at host: labvra.nukescloud.com [INFO][2021-03-30 06:21:36][labvra.nukescloud.com] Update repository is verified successfully on all nodes. 5. When the above vami-config.sh runs under /opt/vmware/var/log/vami/vami.log , the following repository is set 30/03/2021 06:21:34 [INFO] Setting local repository address. url=https://reflcm.nukescloud.com/repo/productBinariesRepo/vra/8.3.0/upgrade/update, username=, password= 30/03/2021 06:21:34 [INFO] Checking available updates. jobid=1 30/03/2021 06:21:34 [INFO] Downloading latest manifest. jobId=1, url=https://reflcm.nukescloud.com/repo/productBinariesRepo/vra/8.3.0/upgrade/update, username=, password= 30/03/2021 06:21:35 [INFO] Using download cache 30/03/2021 06:21:35 [INFO] Downloading file. URL: https://reflcm.nukescloud.com/repo/productBinariesRepo/vra/8.3.0/upgrade/update/manifest/manifest-latest.xml 30/03/2021 06:21:35 [INFO] Using download cache 30/03/2021 06:21:35 [INFO] Downloading file. URL: https://reflcm.nukescloud.com/repo/productBinariesRepo/vra/8.3.0/upgrade/update/manifest/manifest-latest.xml.sig 30/03/2021 06:21:35 [INFO] Signature script output: Verified OK 30/03/2021 06:21:35 [INFO] Signature script output: 30/03/2021 06:21:35 [INFO] Manifest signature verification passed 30/03/2021 06:21:35 [INFO] New manifest. installed-version=8.2.0.12946, downloaded-version=8.3.0.15014 6. Starts to shutdown application or services [INFO][2021-03-30 06:21:38][labvra.nukescloud.com] Shutting down application services [INFO][2021-03-30 06:23:58][labvra.nukescloud.com] Application services shut down successfully. [INFO][2021-03-30 06:23:58][labvra.nukescloud.com] Shutting down infrastructure services [INFO][2021-03-30 06:26:27][labvra.nukescloud.com] Infrastructure services shut down successfully. [INFO][2021-03-30 06:26:29][labvra.nukescloud.com] Saving restore points on all nodes. [INFO][2021-03-30 06:26:29][labvra.nukescloud.com] Running remote command: /opt/scripts/upgrade/rstp-save.sh 'local' sys-config sys-config at host: labvra.nukescloud.com [INFO][2021-03-30 06:26:34][labvra.nukescloud.com] Saving local restore points. ========================= [2021-03-30 06:24:03.425+0000] Waiting for deploy healthcheck ========================= ========================= [2021-03-30 06:24:09.389+0000] Waiting for command execution pods ========================= ========================= [2021-03-30 06:24:15.420+0000] Tear down existing deployment ========================= ========================= [2021-03-30 07:24:29.787+0000] Waiting for command execution pods ========================= ========================= [2021-03-30 07:24:32.368+0000] Tear down existing deployment ========================= 7. Saves local restore point for k8's , processes node data and then activates monitor on all nodes [INFO][2021-03-30 06:26:34][labvra.nukescloud.com] Saving local restore point for Kubernetes object prelude-vaconfig. [INFO][2021-03-30 06:26:35][labvra.nukescloud.com] Local restore point for Kubernetes object prelude-vaconfig saved successfully. Flag --export has been deprecated, This flag is deprecated and will be removed in future. [INFO][2021-03-30 06:26:35][labvra.nukescloud.com] Local restore points saved successfully. [INFO][2021-03-30 06:26:35][labvra.nukescloud.com] Restore points on all nodes saved successfully. [INFO][2021-03-30 06:28:25][labvra.nukescloud.com] Retrieving nodes status... [INFO][2021-03-30 06:28:25][labvra.nukescloud.com] Nodes status retrieval succeeded. [INFO][2021-03-30 06:28:25][labvra.nukescloud.com] Processing nodes data... [INFO][2021-03-30 06:28:25][labvra.nukescloud.com] Nodes data processing succeeded. [INFO][2021-03-30 06:28:25][labvra.nukescloud.com] Verifying nodes... [INFO][2021-03-30 06:28:25][labvra.nukescloud.com] Nodes verification completed successfully. [INFO][2021-03-30 06:28:27][labvra.nukescloud.com] Activating local monitors on all nodes. [INFO][2021-03-30 06:28:27][labvra.nukescloud.com] Running remote command: /opt/scripts/upgrade/mon-activate.sh at host: labvra.nukescloud.com [INFO][2021-03-30 06:28:27][labvra.nukescloud.com] Activating upgrade monitor on the node 8. Once the upgrade monitor is activated on the node and cluster nodes are removed successfully , it will proceed for installation. Remember this is a simple install [INFO][2021-03-30 06:28:27][labvra.nukescloud.com] Upgrade monitor activated successfully on the node. [INFO][2021-03-30 06:28:27][labvra.nukescloud.com] Remote command succeeded: /opt/scripts/upgrade/mon-activate.sh at host: labvra.nukescloud.com [INFO][2021-03-30 06:28:27][labvra.nukescloud.com] Local monitors successfully activated on all nodes. [INFO][2021-03-30 06:28:28][labvra.nukescloud.com] Removing nodes from the cluster [INFO][2021-03-30 06:28:28][labvra.nukescloud.com] Running remote command: vracli cluster leave at host: labvra.nukescloud.com [INFO][2021-03-30 06:44:26][labvra.nukescloud.com] Cluster nodes removed successfully 9. The Installation of the updates start [INFO][2021-03-30 06:46:01][labvra.nukescloud.com] Starting installation of updates. [INFO][2021-03-30 06:46:01][labvra.nukescloud.com] Update installation started successfully. [INFO][2021-03-30 06:47:01][labvra.nukescloud.com] VAMI checking upgrade progress [INFO][2021-03-30 06:47:01][labvra.nukescloud.com] VAMI upgrade in progress. [INFO][2021-03-30 06:47:01][labvra.nukescloud.com] VAMI upgrade still in progress. 10. Just a second after it states Starting installation of updates under the upgrade-xx-xx-xx-xx.log , this is the time you would see activity under vami.log 30/03/2021 06:46:02 [INFO] Installing updates. instanceid=VMware:VMware_8.3.0.15014, jobid=2 30/03/2021 06:46:02 [INFO] Installing update. instanceId=VMware:VMware_8.3.0.15014, jobId=2, url=https://reflcm.nukescloud.com/repo/productBinariesRepo/vra/8.3.0/upgrade/update, username=, password= 30/03/2021 06:46:02 [INFO] Downloading and installing update packages 30/03/2021 06:46:02 [INFO] Using download cache 30/03/2021 06:46:02 [INFO] Downloading file. URL: https://reflcm.nukescloud.com/repo/productBinariesRepo/vra/8.3.0/upgrade/update/manifest/manifest-latest.xml 30/03/2021 06:46:02 [INFO] Using download cache 30/03/2021 06:46:02 [INFO] Downloading file. URL: https://reflcm.nukescloud.com/repo/productBinariesRepo/vra/8.3.0/upgrade/update/manifest/manifest-latest.xml.sig 30/03/2021 06:46:02 [INFO] Signature script output: Verified OK 30/03/2021 06:46:02 [INFO] Signature script output: 30/03/2021 06:46:02 [INFO] Manifest signature verification passed 30/03/2021 06:46:02 [INFO] Creating /opt/vmware/var/lib/vami/update/data/update_progress.json 30/03/2021 06:46:02 [INFO] Downloading the following packages for update version 8.3.0.15014 30/03/2021 06:46:02 [INFO] package NEW PACKAGE : VMware-Log-Insight-Agent noarch (none) 8.3.0 17489324 /package-pool/VMware-Log-Insight-Agent-8.3.0-17489324.noarch.rpm rpm 9457625 e744601203ec01d0d637453260324b7b35a872ce 30/03/2021 06:46:02 [INFO] package UPDATE VERSION: abx-common noarch (none) 8.3.0.15014 1 /package-pool/abx-common-8.3.0.15014-1.noarch.rpm rpm 43780508 c134c5476bd805293bdfaf45dc9b6bd9e06bde63 30/03/2021 06:46:02 [INFO] package UPDATE VERSION: bindutils x86_64 (none) 9.16.6 1.ph2 /package-pool/bindutils-9.16.6-1.ph2.x86_64.rpm rpm 3584466 67d101f6718a81a2abc3dfd9b54381cac6948b60 30/03/2021 06:46:02 [INFO] package UPDATE VERSION: bootstrap-prelude noarch (none) 8.3.0.15014 1 /package-pool/bootstrap-prelude-8.3.0.15014-1.noarch.rpm rpm 2629839 14c2906620ddca6690e13867f1e385851b9e06d8 30/03/2021 06:46:02 [INFO] package UPDATE VERSION: cifs-utils x86_64 (none) 6.7 2.ph2 /package-pool/cifs-utils-6.7-2.ph2.x86_64.rpm rpm 38568 2eea29de217fc837792d0c8db9848d240c3b18d5 30/03/2021 06:46:02 [INFO] package NEW PACKAGE : config-xl noarch (none) 8.3.0.15014 1 /package-pool/config-xl-8.3.0.15014-1.noarch.rpm rpm 14255 c32254fd9e2c1074ea1b5e33e9c72af5350c27e4 30/03/2021 06:46:02 [INFO] package UPDATE VERSION: containerd x86_64 (none) 1.2.14 1.ph2 /package-pool/containerd-1.2.14-1.ph2.x86_64.rpm rpm 24221328 227d22a05417f4286e0247ca87ab415f0b095231 * * * 30/03/2021 06:48:34 [INFO] package UPDATE VERSION: vro-migrate noarch (none) 8.3.0.1611732871 17522798 /package-pool/vro-migrate-8.3.0.1611732871-17522798.noarch.rpm rpm 4984 eb517af86432ea5b752627335696eec3509a052c 30/03/2021 06:48:34 [INFO] Install packages set: '/opt/vmware/var/lib/vami/update/data/package-pool/package-pool/VMware-Log-Insight-Agent-8.3.0-17489324.noarch.rpm ' '/opt/vmware/var/lib/vami/update/data/package-pool/package-pool/abx-common-8.3.0.15014-1.noarch.rpm' '/opt/vmware/var/lib/vami/update/data/package-pool/package-pool/bindutils-9.16.6-1.ph2.x86_64.rpm' '/opt/vmware/var/lib/vami/update/data/package-pool/package-pool/bootstrap-prelude-8.3.0.15014-1.noarch.rpm' '/opt/vmware/var/lib/vami/update/data/package-pool/package-pool/cifs-utils-6.7-2.ph2.x86_64.rpm' * * vmware-studio-appliance-config_3.0.0.7-210201212046.noarch.rpm' '/opt/vmware/var/lib/vami/update/data/package-pool/package-pool/vmware-studio-init_3.0.0.7-210201212046.noarch.rpm' '/opt/vmware/var/lib/vami/update/data/package-pool/package-pool/vro-migrate-8.3.0.1611732871-17522798.noarch.rpm' 30/03/2021 06:48:34 [INFO] Reboot required when installing linux package 30/03/2021 06:48:34 [INFO] Created reboot required file 30/03/2021 06:48:34 [INFO] Update 8.3.0.15014 manifest is set to be installed 30/03/2021 06:48:34 [INFO] Using update post-install script 30/03/2021 06:48:34 [INFO] Running updatecli to install updates. command={ mkdir -p /usr/share/update-notifier ; ln -s /opt/vmware/share/vami/vami_notify_reboot_required /usr/share/update-notifier/notify-reboot-required ; /opt/vmware/share/vami/update/updatecli '/opt/vmware/var/lib/vami/update/data/job/2' '8.2.0.12946' '8.3.0.15014' ; mv /opt/vmware/var/lib/vami/update/data/update_progress.json /opt/vmware/var/lib/vami/update/data/.update_progress.json ; } >> /opt/vmware/var/log/vami/updatecli.log 2>&1 & 30/03/2021 06:48:36 [INFO] Installation running in the background 11. Once the installation starts in the background , this is the time we need to monitor /var/log/bootstrap/postupdate.log The ones highlighted in the RED are the series of scripts which run in the background. + log 'Main bootstrap postupdate started' ++ date '+%Y-%m-%d %H:%M:%S' + echo '2021-03-30 06:51:46 Main bootstrap postupdate started' 2021-03-30 06:51:46 Main bootstrap postupdate started + for script in "${bootstrap_dir}"/* + echo + log '/etc/bootstrap/postupdate.d/00-clear-services.sh starting...' + log '/etc/bootstrap/postupdate.d/00-disable-set-hostname.sh starting...' + log '/etc/bootstrap/postupdate.d/00-fix-vmtoolsd-deps.sh starting...' + log '/etc/bootstrap/postupdate.d/01-10-reset-k8s starting...' + log '/etc/bootstrap/postupdate.d/02-07-prune-images.sh starting...' * * untagged: photonos3_private:latest deleted: sha256:30ded4d67031ea69c5792250dcd4c466511d940ff92f13cd6cc945c4456596ed untagged: relocation-service:ed72098 untagged: relocation-service_private:latest deleted: sha256:840c0a6cb51902b0bb1d81c7ebb3b7ae233e1666dbe387f39a24d7066ad1781a deleted: sha256:3e737404723de6f8e2a6d8ad85b8cad2b9206e76b40b536766ba6a164a6d24ed deleted: sha256:ea1c6ad6204eb1aa5ffccc44a3890014b52a39bc9ed64de34af88195670196e5 deleted: sha256:3a90641efc2b2704dcfa4a8777a4c8cce7150b683811e315773ce35cb83d2e92 deleted: sha256:fa46acfe7078d8404ebda3e229ae4058e68aa579512454194f3bc6121be0818e untagged: rabbitmq_private:3.8.5-photon_9ff5c8078eac untagged: rabbitmq_private:latest * * + log '/etc/bootstrap/postupdate.d/02-10-load-images.sh starting...' + log '/etc/bootstrap/postupdate.d/02-disable-var-log-cron.sh starting...' + log '/etc/bootstrap/postupdate.d/03-10-setup-k8s.sh starting...' + log '/etc/bootstrap/postupdate.d/10-disable-tzselect.sh starting...' + log '/etc/bootstrap/postupdate.d/20-relocate-db.sh starting...' + log '/etc/bootstrap/postupdate.d/90-disable-default-services.sh starting...' + log '/etc/bootstrap/postupdate.d/91-00-remove-config-xl.sh starting...' + log '/etc/bootstrap/postupdate.d/99-10-handover-prelude starting.. .' * * # Wait for Kubernetes and other prerequisites vracli status first-boot -w 1800 && \ /opt/scripts/upgrade/vami-set-update.sh "success" || \ /opt/scripts/upgrade/vami-set-update.sh "error" rm -f /etc/bootstrap/everyboot.d/zz-zz-resume-upgrade.sh' + chmod a+x /etc/bootstrap/everyboot.d/zz-zz-resume-upgrade.sh + export -f vami_reboot_background + echo 'Scheduling a post-upgrade reboot...' Scheduling a post-upgrade reboot... + echo 'Post-upgrade reboot scheduled' Post-upgrade reboot scheduled + exit 0 + nohup bash -c vami_reboot_background + log '/etc/bootstrap/postupdate.d/99-10-handover-prelude done, succeeded.' ++ date '+%Y-%m-%d %H:%M:%S' + echo '2021-03-30 07:14:06 /etc/bootstrap/postupdate.d/99-10-handover-prelude done, succeeded.' * * + log '/etc/bootstrap/postupdate.d/README is not executable.' ++ date '+%Y-%m-%d %H:%M:%S' + echo '2021-03-30 07:14:06 /etc/bootstrap/postupdate.d/README is not executable.' 2021-03-30 07:14:06 /etc/bootstrap/postupdate.d/README is not executable. + log 'main bootstrap postupdate done' ++ date '+%Y-%m-%d %H:%M:%S' + echo '2021-03-30 07:14:06 main bootstrap postupdate done' 2021-03-30 07:14:06 main bootstrap postupdate done + '[' postupdate == firstboot ']' + exit 0 Rebooting the appliance after VAMI upgrade ... 12. Finishes the postupdate and the appliance is rebooted 13. As we can see above the appliance was rebooted sometime after 2021-03-30 07:14:06 , under upgrade-2021-03-30-06-20-48.log once the appliance is back you would see the status change [INFO][2021-03-30 07:14:01][labvra.nukescloud.com] VAMI checking upgrade progress [INFO][2021-03-30 07:14:01][labvra.nukescloud.com] VAMI upgrade in progress. [INFO][2021-03-30 07:14:01][labvra.nukescloud.com] VAMI upgrade still in progress. [INFO][2021-03-30 07:14:06][labvra.nukescloud.com] Waiting for VAMI to exit ... [INFO][2021-03-30 07:14:36][labvra.nukescloud.com] Verifying VAMI overall upgrade result ... [INFO][2021-03-30 07:14:36][labvra.nukescloud.com] VAMI upgrade in progress. [INFO][2021-03-30 07:17:01][labvra.nukescloud.com] VAMI checking upgrade progress [INFO][2021-03-30 07:17:01][labvra.nukescloud.com] VAMI upgrade in progress. * * * [INFO][2021-03-30 07:22:01][labvra.nukescloud.com] VAMI checking upgrade progress [INFO][2021-03-30 07:22:01][labvra.nukescloud.com] VAMI upgrade completed succesfully. [INFO][2021-03-30 07:22:01][labvra.nukescloud.com] VAMI upgrade completed successfully. [INFO][2021-03-30 07:23:01][labvra.nukescloud.com] Saving artifacts metadata [INFO][2021-03-30 07:23:02][labvra.nukescloud.com] Artifacts metadata saved. [INFO][2021-03-30 07:24:01][labvra.nukescloud.com] Resolving post-upgrade controller. [INFO][2021-03-30 07:24:01][labvra.nukescloud.com] This node is elected for post-upgrade controller. [INFO][2021-03-30 07:24:03][labvra.nukescloud.com] Resolving post-upgrade nodes quorum... [INFO][2021-03-30 07:24:04][labvra.nukescloud.com] Adding nodes to the cluster. [INFO][2021-03-30 07:24:04][labvra.nukescloud.com] Cluster nodes added successfully. [INFO][2021-03-30 07:24:05][labvra.nukescloud.com] Restoring from restore points on all nodes. [INFO][2021-03-30 07:24:05][labvra.nukescloud.com] Running remote command: /opt/scripts/upgrade/rstp-restore.sh 'local' sys-config sys-config at host: labvra.nukescloud.com [INFO][2021-03-30 07:24:11][labvra.nukescloud.com] Restoring all restore points. 14. Restore point restoration starts and then application and infrastructure services start as well [INFO][2021-03-30 07:22:01][labvra.nukescloud.com] VAMI checking upgrade progress [INFO][2021-03-30 07:22:01][labvra.nukescloud.com] VAMI upgrade completed succesfully. [INFO][2021-03-30 07:22:01][labvra.nukescloud.com] VAMI upgrade completed successfully. [INFO][2021-03-30 07:23:01][labvra.nukescloud.com] Saving artifacts metadata [INFO][2021-03-30 07:23:02][labvra.nukescloud.com] Artifacts metadata saved. [INFO][2021-03-30 07:24:01][labvra.nukescloud.com] Resolving post-upgrade controller. [INFO][2021-03-30 07:24:01][labvra.nukescloud.com] This node is elected for post-upgrade controller. [INFO][2021-03-30 07:24:03][labvra.nukescloud.com] Resolving post-upgrade nodes quorum... [INFO][2021-03-30 07:24:04][labvra.nukescloud.com] Adding nodes to the cluster. [INFO][2021-03-30 07:24:04][labvra.nukescloud.com] Cluster nodes added successfully. [INFO][2021-03-30 07:24:05][labvra.nukescloud.com] Restoring from restore points on all nodes. [INFO][2021-03-30 07:24:05][labvra.nukescloud.com] Running remote command: /opt/scripts/upgrade/rstp-restore.sh 'local' sys-config sys-config at host: labvra.nukescloud.com [INFO][2021-03-30 07:24:11][labvra.nukescloud.com] Restoring all restore points. 15. At this point in time one has to check /var/log/deploy.log to see if the services are coming back as expected ========================= [2021-03-30 07:26:36.605+0000] Creating kubernetes namespaces ========================= ========================= [2021-03-30 07:26:39.435+0000] Applying ingress certificate ========================= ========================= [2021-03-30 07:26:50.589+0000] Updating etcd configuration to include https_proxy if such exists ========================= ========================= [2021-03-30 07:26:57.836+0000] Deploying infrastructure services ========================= ========================= [2021-03-30 07:28:43.477+0000] Clearing liquibase locks ========================= ========================= [2021-03-30 07:28:56.296+0000] DB upgrade schema ========================= ========================= [2021-03-30 07:32:08.619+0000] Populating initial identity-service data ========================= ========================= [2021-03-30 07:32:38.488+0000] Deploying application services ========================= ========================= [2021-03-30 07:42:06.787+0000] Deploying application UI ========================= ========================= [2021-03-30 07:43:28.519+0000] Setting feature UI toggles to Provisioning service ========================= ========================= [2021-03-30 07:43:54.731+0000] Registering embedded vRO ========================= ========================= [2021-03-30 07:43:58.698+0000] Registering embedded ABX on-prem endpoint in organizations ========================= 16. That marks the application and infrastructure start up procedure and then there would be a verification , this procedure is documented under 2 logs Reference: Verifies if all kubernetes related services are up and running upgrade-k8s-2021-03-30-06-20-48.log Reference: Copy of deploy.log upgrade-deploy-2021-03-30-06-20-48.log 17. As the verification is now complete, it will now works towards the last few steps [INFO][2021-03-30 07:44:32][labvra.nukescloud.com] Infrastrucre and application services deployed successfully. [INFO][2021-03-30 07:44:33][labvra.nukescloud.com] Retrieving services status... [INFO][2021-03-30 07:44:33][labvra.nukescloud.com] Services status retrieval succeeded. [INFO][2021-03-30 07:44:33][labvra.nukescloud.com] Processing services data... [INFO][2021-03-30 07:44:54][labvra.nukescloud.com] Services data processing succeeded. [INFO][2021-03-30 07:44:54][labvra.nukescloud.com] Verifying services... [INFO][2021-03-30 07:44:56][labvra.nukescloud.com] Services verification completed successfully. [INFO][2021-03-30 07:44:58][labvra.nukescloud.com] Cleaning up restore point on all nodes. [INFO][2021-03-30 07:44:58][labvra.nukescloud.com] Running remote command: /opt/scripts/upgrade/rstp-clean.sh sys-config at host: labvra.nukescloud.com [INFO][2021-03-30 07:45:03][labvra.nukescloud.com] Cleaning up restore point /data/restorepoint/sys-config . [INFO][2021-03-30 07:45:03][labvra.nukescloud.com] Restore point /data/restorepoint/sys-config cleaned up successfully. [INFO][2021-03-30 07:45:03][labvra.nukescloud.com] Restore point cleaned up on all nodes. [INFO][2021-03-30 07:45:03][labvra.nukescloud.com] Cleaning up restore point on all nodes. [INFO][2021-03-30 07:45:03][labvra.nukescloud.com] Running remote command: /opt/scripts/upgrade/rstp-clean.sh artifacts-lastworking at host: labvra.nukescloud.com [INFO][2021-03-30 07:45:08][labvra.nukescloud.com] Cleaning up restore point /data/restorepoint/artifacts-lastworking . [INFO][2021-03-30 07:45:08][labvra.nukescloud.com] Restore point /data/restorepoint/artifacts-lastworking cleaned up successfully. [INFO][2021-03-30 07:45:08][labvra.nukescloud.com] Restore point cleaned up on all nodes. [INFO][2021-03-30 07:45:08][labvra.nukescloud.com] Cleaning up restore point on all nodes. [INFO][2021-03-30 07:45:09][labvra.nukescloud.com] Running remote command: /opt/scripts/upgrade/rstp-clean.sh live-data at host: labvra.nukescloud.com [INFO][2021-03-30 07:45:14][labvra.nukescloud.com] Cleaning up restore point /data/restorepoint/live-data . [INFO][2021-03-30 07:45:14][labvra.nukescloud.com] Restore point /data/restorepoint/live-data cleaned up successfully. [INFO][2021-03-30 07:45:14][labvra.nukescloud.com] Restore point cleaned up on all nodes. [INFO][2021-03-30 07:45:15][labvra.nukescloud.com] Reverting SSH configuration on nodes. [INFO][2021-03-30 07:45:16][labvra.nukescloud.com] SSH configuration reverted on nodes successfully. [INFO][2021-03-30 07:45:24][labvra.nukescloud.com] Cleaning up upgrade runtime state. [INFO][2021-03-30 07:45:25][labvra.nukescloud.com] Archiving upgrade runtime data. [INFO][2021-03-30 07:45:26][labvra.nukescloud.com] Upgrade runtime data archived successfully. [INFO][2021-03-30 07:45:26][labvra.nukescloud.com] Clearing upgrade runtime directory. [INFO][2021-03-30 07:45:26][labvra.nukescloud.com] Upgrade runtime directory cleared successfully. [INFO][2021-03-30 07:45:26][labvra.nukescloud.com] Upgrade runtime clean up completed. This concludes vRealize Automation upgrade As said earlier following logs are the ones which we need to monitor

  • Unable to run deployments with AD integration in vRA 8.3 if OU path is more than 50 characters

    If there is AD integration inside vRA 8.3 and then you attempt to deploy a machine to an OU relative path that is more than 50 characters long then your deployments would fail. This is a bug that is fixed in vRealize Automation 8.4 GA

  • Unable to update resource elements in vRealize Orchestrator

    We've recently seen issues while trying update resource elements in vRealize Orchestrator 7.x Some of the exceptions you would see when the vRO workflow is executed is as below. Below excerpts are from scripting.log 2021-03-09 11:42:35.070+0000 [WorkflowExecutorPool-Thread-2] ERROR {|__SYSTEM|administrator@vsphere.local:updateResourceEleme:f896e1b8-4b5e-484e-b5d1-db04390a1bcc:token=e98c4043-52d4-4375-96b4-c699da88f34d} [ResourceElementManagementServiceImpl] Failed to commit 'Resource saved.'. Reason: null java.lang.NullPointerException Cannot set content from mime attachment : ch.dunes.util.DunesServerException: org.springframework.orm.jpa.JpaSystemException: Transaction was marked for rollback only; cannot commit; nested exception is org.hibernate.TransactionException: Transaction was marked for rollback only; cannot commit One of the main reasons why this is not working is because the jgit garbage collection is not working as expected in the background 2021-03-09 11:42:23.354+0000 [http-nio-127.0.0.1-8280-exec-6] ERROR {} [ContentVersionManagementServiceImpl] Failed to disable auto push: org.eclipse.jgit.api.errors.TransportException: /storage/db/vcodata/git/__SYSTEM.git: internal server error java.lang.RuntimeException: org.eclipse.jgit.api.errors.TransportException: /storage/db/vcodata/git/__SYSTEM.git: internal server error at com.vmware.o11n.git.StagingRepositoryService.checkout(StagingRepositoryService.java:226) at com.vmware.o11n.service.version.ContentVersionManagementServiceImpl$ReusableStaginRepository.(ContentVersionManagementServiceImpl.java:495) at com.vmware.o11n.service.version.ContentVersionManagementServiceImpl$StagingRepositoryPooledObjectFactory.create(ContentVersionManagementServiceImpl.java:455) at com.vmware.o11n.service.version.ContentVersionManagementServiceImpl$StagingRepositoryPooledObjectFactory.create(ContentVersionManagementServiceImpl.java:445) at org.apache.commons.pool2.BasePooledObjectFactory.makeObject(BasePooledObjectFactory.java:58) 2021-03-10 03:00:00.000+0000 [vcoSystemTaskScheduler-3] DEBUG {} [ContentVersionManagementServiceImpl] Starting garbage collection in all repositories. 2021-03-10 03:00:05.931+0000 [vcoSystemTaskScheduler-3] INFO {} [BareRepositoryService] Started garbage collection for __SYSTEM.git 2021-03-10 03:00:06.352+0000 [vcoSystemTaskScheduler-3] ERROR {} [BareRepositoryService] Failed to run garbage collection for __SYSTEM.git. Reason: 2021-03-10 03:00:06.352+0000 [vcoSystemTaskScheduler-3] ERROR {} [BareRepositoryService] Garbage collection failed. org.eclipse.jgit.api.errors.JGitInternalException: Garbage collection failed. at org.eclipse.jgit.api.GarbageCollectCommand.call(GarbageCollectCommand.java:229) at com.vmware.o11n.git.BareRepositoryService.runGc(BareRepositoryService.java:109) at com.vmware.o11n.service.version.ContentVersionManagementServiceImpl.lambda$compact$3(ContentVersionManagementServiceImpl.java:260) at com.vmware.o11n.service.version.ContentVersionManagementServiceImpl.wrapInTransaction(ContentVersionManagementServiceImpl.java:111) at com.vmware.o11n.service.version.ContentVersionManagementServiceImpl.compact(ContentVersionManagementServiceImpl.java:256) at com.vmware.o11n.service.version.ContentVersionManagementServiceImpl.lambda$compactAll$2(ContentVersionManagementServiceImpl.java:251) at java.util.ArrayList.forEach(ArrayList.java:1257) Garbage Collection runs everyday at 3 AM. If you notice garbage collection failures as shown below then execute the steps mentioned in the remediation section 2021-03-06 03:00:12.167+0000 [vcoSystemTaskScheduler-3] ERROR {} [BareRepositoryService] Garbage collection failed. org.eclipse.jgit.api.errors.JGitInternalException: Garbage collection failed. 2021-03-07 03:00:00.545+0000 [vcoSystemTaskScheduler-4] ERROR {} [BareRepositoryService] Garbage collection failed. org.eclipse.jgit.api.errors.JGitInternalException: Garbage collection failed. 2021-03-08 03:00:03.969+0000 [vcoSystemTaskScheduler-3] ERROR {} [BareRepositoryService] Garbage collection failed. org.eclipse.jgit.api.errors.JGitInternalException: Garbage collection failed. 2021-03-09 03:00:00.469+0000 [vcoSystemTaskScheduler-4] ERROR {} [BareRepositoryService] Garbage collection failed. org.eclipse.jgit.api.errors.JGitInternalException: Garbage collection failed. 2021-03-10 03:00:06.352+0000 [vcoSystemTaskScheduler-3] ERROR {} [BareRepositoryService] G arbage collection failed. org.eclipse.jgit.api.errors.JGitInternalException: Garbage collection failed. To resolve this issue follow these steps Download JGit cli client , below command would work if internet is available, else execute this command on node which has internet access and copy the file into vRA node curl -k --output jgit https://repo.eclipse.org/content/groups/releases//org/eclipse/jgit/org.eclipse.jgit.pgm/5.4.0.201906121030-r/org.eclipse.jgit.pgm-5.4.0.201906121030-r.sh 2. Move jgit to /usr/bin, set it to executable and change its owner to vco and make it executable mv jgit /usr/bin/ chown vco:vco /usr/bin/jgit chmod +x /usr/bin/jgit 3. Remove the HEAD ref of the MASTER mv /storage/db/vcodata/git/__SYSTEM.git/refs/heads/master /var/lib/vco/master_ref.backup 4. Check where the HEAD should be.Get the commit ref sha from the first line. ex: 01bb5f58716dc12016b4fd8798c7fa8c91c76bf3 cd /storage/db/vcodata/git/__SYSTEM.git jgit log 5. Login as vco su - vco 6. Create the master ref echo '01bb5f58716dc12016b4fd8798c7fa8c91c76bf3' > /storage/db/vcodata/git/__SYSTEM.git/refs/heads/master 7. Run jgit garbage collection cd /storage/db/vcodata/git/__SYSTEM.git jgit gc 8. Login as root again and make sure everything in the git folder is owned by vco:vco exit cd /storage/db/vcodata/git/__SYSTEM.git chown -R vco:vco 9. Restart the vRO server service vco-server restart This should fix any garbage collection issues

  • vIDM Inventory Sync fails after implementing workaround to fix CVE-2020-4006 (KB 81731)

    Came across an issue where we see vIDM Inventory Sync fails after applying KB 81731 which talks about fixing CVE-2020-4006 vIDM configurator page(:8443) will not be available/accessible and we are blocking it due to security vulnerability CVE-2020-4006 has been determined to affect some releases of Workspace ONE Access, Identity Manager, and Workspace ONE Access Connector. This vulnerability and its impact on VMware products are documented in VMSA-2020-0027 . Please review this advisory before continuing as there may be considerations outside the scope of this document including permanent fixes. Affected Product versions: VMware Workspace ONE Access 20.10 (Linux) VMware Workspace ONE Access 20.01 (Linux) VMware Identity Manager 3.3.3 (Linux) VMware Identity Manager 3.3.2 (Linux) VMware Identity Manager 3.3.1 (Linux) VMware Identity Manager Connector 3.3.2, 3.3.1 (Linux) VMware Identity Manager Connector 3.3.3, 3.3.2, 3.3.1 (Windows) VMware Identity Manager Connector 19.03.0.0, 19.03.0.1 According to KB following steps were suggested to implement workaround on vIDM Implement workaround for Linux based vIDM appliances Use SSH to connect to appliance using “sshuser” credentials configured during installation or updated later. Switch to root by typing su and provide “root” credentials configured during installation or updated later. Run the following commands: cd /opt/vmware/horizon/workspace mkdir webapps.tmp mv webapps/cfg webapps.tmp mv conf/Catalina/localhost/cfg.xml webapps.tmp service horizon-workspace restart Repeat steps for all Linux based appliances affected by CVE-2020-4006. The moment this workaround is implemented on vIDM , Inventory Sync of vIDM through vRLCM will fail For Inventory Sync to work through vRLCM one has to revert the workaround as stated in the KB Revert workaround for Linux based appliances Use SSH to connect to appliance using “sshuser” credentials configured during installation or updated later. Switch to root by typing su and provide “root” credentials configured during installation or updated later. Run the following commands: cd /opt/vmware/horizon/workspace mv webapps.tmp/cfg webapps mv webapps.tmp/cfg.xml conf/Catalina/localhost rmdir webapps.tmp service horizon-workspace restart Repeat steps for all Linux based appliances affected by CVE-2020-4006 This issue is fixed in vIDM 3.3.4 which is packaged with vRLCM 8.3 and vRA 8.3 NOTE: If there are plans to upgrade vIDM you have to revert the patch using the steps mentioned above.

  • A useful command to list only patch logs while implementing vRA HF's

    Execute below command which would list only patch logs during HF implementation tail -f /var/log/vmware/vcac/vcac-config.log | grep cluster.patch

  • Useful Postgres commands which can be executed on vRA appliance database

    List of few commands which can help in troubleshooting database related issues Login into Database /opt/vmware/vpostgres/current/bin/psql vcac postgres Exit Database \q Display largest tables across all formats SELECT nspname || '.' || relname AS "relation", pg_size_pretty(pg_total_relation_size(C.oid)) AS "total_size" FROM pg_class C LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace) WHERE nspname NOT IN ('pg_catalog', 'information_schema') AND C.relkind <> 'i' AND nspname !~ '^pg_toast' ORDER BY pg_total_relation_size(C.oid) DESC LIMIT 20; Display last Vaccum stuff SELECT relname, last_vacuum, last_autovacuum, last_analyze, last_autoanalyze FROM pg_stat_all_tables WHERE schemaname = 'public'; Display largest objects SELECT relname AS objectname, relkind AS objecttype, reltuples AS "#entries", pg_size_pretty(relpages::bigint*8*1024) AS size FROM pg_class WHERE relpages >= 8 ORDER BY relpages DESC; Whole database size select pg_size_pretty(pg_database_size('vcac')); Single table size SELECT pg_size_pretty(pg_total_relation_size('event')); Size per row from the table SELECT primary_key, pg_column_size(tablename.*) FROM tablename;

  • Mandatory Prechecks before a vRealize Automation 7.x Upgrade

    Collating few pre-checks which has to be performed when your performing a vRealize Automation upgrade based on our experience Check 1 Ensure vPostgres is in a healthy state Simple environment Distributed Environment Check 2 These IaaS nodes should have the last connected time, not more than 30 seconds. Appliances should not be more than 10 minutes. Here's the command you can use from a shell session vcac-config cluster-config -list | sed -e 's/[{}]/''/g' | awk -v k="text" '{n=split($0,a,","); for (i=1; i<=n; i++) print a[i]}' | grep -E 'nodeHost|lastUpdateRelative' or one can check from VAMI as well Check 3 Health checks of all components of vRA have to pass. Click on this link to find out how you have to perform them. This link has screenshots as well so that you may compare the outputs. Check 4 Ensure there are no duplicates inside Postgres database Login into Postgres psql -U postgres -d vcac Check for duplicates by executing below query select * from (select categoryid, name, count(*)-1, 'vmo_configelement' as type from vmo_configelement group by categoryid, name having count(*) > 1 union select parentcategoryid, name, count(*)-1, 'vmo_configelementcategory' as type from vmo_configelementcategory group by parentcategoryid, name having count(*) > 1 union select elementid, '' as name, count(*)-1, 'vmo_configelementcontent' as type from vmo_configelementcontent group by elementid having count(*) > 1 union select '' as id, name, count(*)-1, 'vmo_package' as type from vmo_package group by name having count(*) > 1 union select packageid, '' as name, count(*)-1, 'vmo_packagecontent' as type from vmo_packagecontent group by packageid having count(*) > 1 union select policyid, '' as name, count(*)-1, 'vmo_policycontent' as type from vmo_policycontent group by policyid having count(*) > 1 union select categoryid, name, count(*)-1, 'vmo_resourceelement' as type from vmo_resourceelement group by categoryid, name having count(*) > 1 union select elementid, '' as name, count(*)-1, 'vmo_resourceelementcontent' as type from vmo_resourceelementcontent group by elementid having count(*) > 1 union select scriptmodulecategoryid, name, count(*)-1, 'vmo_scriptmodule' as type from vmo_scriptmodule group by scriptmodulecategoryid, name having count(*) > 1 union select '' as id, name, count(*)-1, 'vmo_scriptmodulecategory' as type from vmo_scriptmodulecategory group by name having count(*) > 1 union select scriptmoduleid, '' as name, count(*)-1, 'vmo_scriptmodulecontent' as type from vmo_scriptmodulecontent group by scriptmoduleid having count(*) > 1 union select dunesobjectid, version, count(*)-1, 'vmo_versionhistorycontent' as type from vmo_versionhistorycontent group by dunesobjectid, version having count(*) > 1 union select workflowid, '' as name, count(*)-1, 'vmo_workflowcontent' as type from vmo_workflowcontent group by workflowid having count(*) > 1) as u; Just copy and paste the query. Then execute. The output should show 0 rows as shown below Check 5 Ensure cluster_nodes table inside Postgres just has a certificate and not a complete chain Login into Postgres psql -U postgres -d vcac Check cluster_nodes table select * from cluster_nodes where node_type = 'VA'; The certificate here must match /opt/vmware/etc/lighttpd/server.pem Moreover, server.pem must only contain a PRIVATE KEY and a CERTIFICATE This must not contain a chain like ROOT and INTERMEDIATE CERTIFICATE

  • AutoConfigurer cleaned up resource

    AutoConfigurer is a process of deleting "RETIRED" resources from vRA 8.x environment. The moment it's marked as retired in the next data collection cycle it's deleted from vRA Here's a simple test performed in my lab Deployed a test deployment Here are the details of the deployment Deployment details on the vCenter Checked the latest data collection if it happened Powered off the virtual machine which belongs to the deployment Deleted the virtual machine Data Collection yet to start in 4 minutes Deployment History of this machine Detected as powered off Data Collection about to be executed Remember , this machine or resource has already been deleted from the vCenter After the new data collection cycle is complete This is where your AutoConfigurer will delete the resource after marking it as RETIRED 2021-02-09T06:51:10.137Z [priority='INFO' thread='vsphere-enum-2' user='' org='' context='' parent='' token=''] c.v.p.c.m.a.v.p.ResourceBatchProcessor.lambda$fetchToBeDeletedResources$7:233 - [premvc.prem.com:production] Will delete [1] documents of type [VirtualMachine] for resources [["retiretest->/resources/compute/06510162-8982-41a1-898a-f6f721f2a41e"]]. 2021-02-09T06:51:10.138Z [priority='INFO' thread='vsphere-enum-2' user='' org='' context='' parent='' token=''] c.v.p.c.m.a.v.p.VMBatchProcessor.lambda$deleteResourcesIncremental$28:270 - premvc.prem.com:production Deleting 1 resources of type VirtualMachine with morefs [VirtualMachine:vm-3109] 2021-02-09T06:51:10.139Z [priority='INFO' thread='vsphere-enum-2' user='' org='' context='' parent='' token=''] c.v.p.c.m.a.v.p.VirtualMachineProcessor.processDelete:186 - [premvc.prem.com]:[production]: Begin deleting machine [retiretest] with self link [/resources/compute/06510162-8982-41a1-898a-f6f721f2a41e] 2021-02-09T06:51:10.159Z [priority='INFO' thread='xn-index-updates-50' user='' org='' context='' parent='' token=''] c.v.p.c.m.a.v.processors.ProcessorUtils.lambda$retireCompute$5:219 - [premvc.prem.com]:[production]: The compute [retiretest] with self link [/resources/compute/06510162-8982-41a1-898a-f6f721f2a41e] is marked as RETIRED with expiration micros [1613112670139000] Here's your AutoConfigurer message on the deployment pane stating it has deleted the resource

  • vPostgres Database Failover Scenarios

    Master Database constantly streams transactions to Replica servers Once Master fails, the active and working Replica is ready to proceed with Read-Only requests right away.When the new Master is promoted ( manually or automatically ), all upcoming requests are moved to it. Up to vRA 7.2 release , manual failover was supported which means that the MASTER is replaced with a Replica selected by Administrator Administrator needs to trigger the promotion operation and wait for its completion With vRA 7.3, PostgreSQL database failover is automatic, when three appliances are deployed and synchronous replication is configured Failover scenarios does not differ from the manual failover scenarios except parts of the scenarios with the steps needed to recover the environment are performed automatically, when automatic failover is on. It is the ability of the vRA PostgreSQL database component to automatically sustain a Master node alive at any time without further intervention of Administrator. There are some cases that vRA is unable to perform an automatic failover, for example if more than half of the nodes are off at the same time.Monitoring the https://vra-fqdn:5434/api/status when in sync mode ( automatic failover turned on ) as specified in documentation will notify the client to perform manual failover when the manualFailoverNeeded status flag is true 3 Node ( vRA Appliances ) Failover Scenarios As you can see in the screenshot we do see three nodes One being MASTER , other being SYNC REPLICA and the last one is POTENTIAL REPLICA For systems that use synchronous replication, there is one replica node completely in sync with the MASTER known as SYNC REPLICA. In the above screenshot it's vra3.vcloud.local which is SYNC REPLICA The SYNC REPLICA which is also called as HOT-SYNC node will be the one who would be promoted as MASTER as soon as current MASTER state would be changed to DOWN Only SYNC REPLICA will be the one in complete SYNC with MASTER node , other's status would be set to POTENTIAL. In ASYNCHRONOUS replication mode all node's have Sync State set to ASYNC. MASTER does not have any Sync State value. The Valid column in the screenshot shows that the database instance in the node is healthy The Priority column shows the position of Replica in relation to the Master node. The Master node has no priority. SYNC REPLICA's priority would be set to 0 as it's the one next in line to become MASTER. Subsequent POTENTIAL REPLICA's would be having priority set to 1, 2.... and so on Three Node ( vRA Appliances ) Auto Failover Scenarios When one of three nodes is down , the AutoFailover will be performed. No more AutoFailover capability till all three nodes are alive Let's explore few scenarios Scenario 1: MASTER is DOWN What Happens SYNC REPLICA becomes MASTER and picks up the database function automatically ( performed by automatic failover agent ) POTENTIAL REPLICA is now a new SYNC REPLICA vRA appliances get into READ ONLY mode till automatic failover completes. Once former MASTER is back It would be set to REPLICA automatically by failover agent. No manual action is needed. If former MASTER cannot be fixed, set the PostgreSQL database to asynchronous mode Scenario 2: SYNC REPLICA is DOWN What Happens There would be no change to MASTER, hence no downtime. You may experience some delay in the database requests for a couple of seconds until Potential Replica becomes the new SYNC REPLICA Promotion of nodes is automatically done by the PostgreSQL service Once former SYNC REPLICA is back online It would be automatically set to POTENTIAL REPLICA Scenario 3: POTENTIAL REPLICA is DOWN What happens Nothing. There is no downtime When former POTENTIAL REPLICA comes back online It will automatically be set to POTENTIAL REPLICA If former POTENTIAL REPLICA does not come online then set the PostgreSQL database to asynchronous mode. When two out of three nodes are DOWN at the same time, the system starts working in a read only mode until a manual repair is performed. When just one Database on node is available in the system, it should be switched to asynchronous mode. Scenario 4: MASTER AND POTENTIAL REPLICA are DOWN What happens SYNC REPLICA will not be prompted as MASTER automatically System starts working in a read only mode until the manual promotion is performed Once Former MASTER and POTENTIAL REPLICA are back online They should be reset manually to synchronize against MASTER Then the replication should be turned back to SYNCHRONOUS mode Scenario 5: SYNC REPLICA and POTENTIAL REPLICA are DOWN What Happens MASTER is not able to process R/W transactions System starts working in READ ONLY mode until a manual repair is performed When the SYNC REPLICA and POTENTIAL REPLICA are back They must be reset manually to synchronize against the MASTER The system should be turned back to SYNCHRONOUS mode When link failure happens between NODES Scenario 6: Link Failure between SITE A and SITE B, but all 3 nodes are UP and ONLINE SITE A has MASTER and POTENTIAL REPLICA SITE B has SYNC REPLICA What happens POTENTIAL REPLICA on SITE A becomes SYNC REPLICA There is no downtime When the link is re-established no manual action is needed Scenario 7: Link failure between SITE A and SITE B and all three nodes are UP and ONLINE SITE A has MASTER SITE B has POTENTIAL REPLICA What happens SYNC REPLICA becomes MASTER and picks up the DB function automatically, and POTENTIAL REPLICA becomes the new SYNC REPLICA (performed by the automatic failover agent). System is working in a read only mode until this promotion finishes. When the link is re-established , former MASTER is reset as REPLICA. There is no MANUAL action needed vRA Database Configuration Files and it's log locations vPostgres configuration file: /var/vmware/vpostgres/current/pgdata/pg_hba.conf vPostgreslog: /storage/db/pgdata/pg_log/postgresql.csv, /storage/db/pgdata/serverlog Automatic Failover Agent log: /var/log/vcac/vcac-config.log #vRealizeAutomation

  • vRA 8.x PostgreSQL pods showing down ?

    I was doing analysis on a problem where PostgreSQL pods were showing down. vRA 8.x deployments were stuck and not working as expected In an ideal scenario, there should always be one master and two standby nodes in vRA 8.x Ideal Status Node "postgres-0.postgres.prelude.svc.cluster.local": PostgreSQL version: 10.10 Total data size: 26 GB Conninfo: host=postgres-0.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10 Role: standby WAL archiving: disabled (on standbys "archive_mode" must be set to "always" to be effective) Archive command: /bin/true WALs pending archiving: 0 pending files Replication connections: 0 (of maximal 10) Replication slots: 0 physical (of maximal 10; 0 missing) Upstream node: postgres-1.postgres.prelude.svc.cluster.local (ID: 101) Replication lag: 0 seconds Last received LSN: 307/595EBA48 Last replayed LSN: 307/595EBA48 Node "postgres-1.postgres.prelude.svc.cluster.local": PostgreSQL version: 10.10 Total data size: 26 GB Conninfo: host=postgres-1.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10 Role: primary WAL archiving: enabled Archive command: /bin/true WALs pending archiving: 0 pending files Replication connections: 1 (of maximal 10) Replication slots: 0 physical (of maximal 10; 0 missing) Replication lag: n/a Node "postgres-2.postgres.prelude.svc.cluster.local": PostgreSQL version: 10.10 Total data size: 26 GB Conninfo: host=postgres-2.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10 Role: standby WAL archiving: disabled (on standbys "archive_mode" must be set to "always" to be effective) Archive command: /bin/true WALs pending archiving: 0 pending files Replication connections: 0 (of maximal 10) Replication slots: 0 physical (of maximal 10; 0 missing) Upstream node: postgres-1.postgres.prelude.svc.cluster.local (ID: 101) Replication lag: 0 seconds Last received LSN: 304/E1A1B458 Last replayed LSN: 304/E1A1B458 When issue was seen one of the postgres pods was reporting down postgres-0 1/1 Running 1 3d11h postgres-1 1/1 Running 0 3d11h postgres-2 0/1 Running 1 3d11h By the time we realized, it fixed itself Was curious to know what happened and how did it fix itself. So let's start our deepdive Started inspecting postgres-1 which is the current master node Till 2021-01-29 06:10:19 +0000 UTC , it was detecting that postgres-0 was it's master and all of a sudden at 2021-01-29 06:10:35 +0000 it was unable to ping postgres-0 2021-01-29 06:09:19 +0000 UTC [repmgrd] node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) monitoring upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state 2021-01-29 06:10:19 +0000 UTC [repmgrd] node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) monitoring upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state 2021-01-29 06:10:35 +0000 UTC [repmgrd] unable to ping "host=postgres-0.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10" 2021-01-29 06:10:35 +0000 UTC [repmgrd] unable to connect to upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) 2021-01-29 06:10:35 +0000 UTC [repmgrd] checking state of node 100, 1 of 1 attempts 2021-01-29 06:10:37 +0000 UTC [socat] W waitpid(-1, {}, WNOHANG): no child has exited . Below snippets show that postgres-1 has been promoted as MASTER, where previous MASTER was postgres-0. postgres-1 has been promoted as MASTER on 2021-01-29 06:11:06.17188+00 2021-01-29 06:09:19 +0000 UTC [repmgrd] node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) monitoring upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state 2021-01-29 06:10:19 +0000 UTC [repmgrd] node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) monitoring upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state 2021-01-29 06:10:35 +0000 UTC [repmgrd] unable to ping "host=postgres-0.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10" 2021-01-29 06:10:35 +0000 UTC [repmgrd] unable to connect to upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) 2021-01-29 06:10:35 +0000 UTC [repmgrd] checking state of node 100, 1 of 1 attempts 2021-01-29 06:10:37 +0000 UTC [socat] W waitpid(-1, {}, WNOHANG): no child has exited 2021-01-29 06:10:45 +0000 UTC [repmgrd] unable to ping "user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10 dbname=repmgr-db host=postgres-0.postgres.prelude.svc.cluster.local fallback_application_name=repmgr" 2021-01-29 06:10:45 +0000 UTC [repmgrd] unable to reconnect to node 100 after 1 attempts 2021-01-29 06:10:45 +0000 UTC [repmgrd] 1 active sibling nodes registered 2021-01-29 06:10:45 +0000 UTC [repmgrd] primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) and this node have the same location ("default") 2021-01-29 06:10:45 +0000 UTC [repmgrd] local node's last receive lsn: 418/C2E8E700 2021-01-29 06:10:45 +0000 UTC [repmgrd] checking state of sibling node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) 2021-01-29 06:10:45 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) reports its upstream is node 100, last seen 20 second(s) ago 2021-01-29 06:10:45 +0000 UTC [repmgrd] standby node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) last saw primary node 20 second(s) ago 2021-01-29 06:10:45 +0000 UTC [repmgrd] last receive LSN for sibling node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) is: 418/C2E8E700 2021-01-29 06:10:45 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) has same LSN as current candidate "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) 2021-01-29 06:10:45 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) has lower priority (98) than current candidate "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) (99) 2021-01-29 06:10:45 +0000 UTC [repmgrd] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 2 seconds 2021-01-29 06:10:45 +0000 UTC [repmgrd] promotion candidate is "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) 2021-01-29 06:10:45 +0000 UTC [repmgrd] executing "failover_validation_command" 2021-01-29 06:10:56 +0000 UTC [repmgrd] no output returned from command 2021-01-29 06:10:56 +0000 UTC [repmgrd] failover validation command returned zero 2021-01-29 06:10:56 +0000 UTC [repmgrd] this node is the winner, will now promote itself and inform other nodes 2021-01-29 06:10:56 +0000 UTC [repmgrd] promote_command is: 2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:10:58 +0000 UTC [socat] W waitpid(-1, {}, WNOHANG): no child has exited * * * 2021-01-29 06:11:04 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:11:04 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:11:06 +0000 UTC [repmgr] 1 sibling nodes found, but option "--siblings-follow" not specified 2021-01-29 06:11:06 +0000 UTC [repmgr] promoting standby to primary 2021-01-29 06:11:06 +0000 UTC [postgres] [9-1] 2021-01-29 06:11:06.065 UTC [171] LOG: received promote request 2021-01-29 06:11:06 +0000 UTC [postgres] [7-1] 2021-01-29 06:11:06.065 UTC [175] FATAL: terminating walreceiver process due to administrator command 2021-01-29 06:11:06 +0000 UTC [postgres] [10-1] 2021-01-29 06:11:06.066 UTC [171] LOG: invalid resource manager ID 47 at 418/C2E8E700 2021-01-29 06:11:06 +0000 UTC [postgres] [11-1] 2021-01-29 06:11:06.066 UTC [171] LOG: redo done at 418/C2E8E6D8 2021-01-29 06:11:06 +0000 UTC [postgres] [12-1] 2021-01-29 06:11:06.066 UTC [171] LOG: last completed transaction was at log time 2021-01-29 06:10:25.633443+00 2021-01-29 06:11:06 +0000 UTC [postgres] [13-1] 2021-01-29 06:11:06.076 UTC [171] LOG: selected new timeline ID: 6 2021-01-29 06:11:06 +0000 UTC [postgres] [14-1] 2021-01-29 06:11:06.148 UTC [171] LOG: archive recovery complete 2021-01-29 06:11:06 +0000 UTC [repmgr] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete 2021-01-29 06:11:06 +0000 UTC [repmgr] STANDBY PROMOTE successful 2021-01-29 06:11:06 +0000 UTC [repmgr-event] 101 standby_promote 1 2021-01-29 06:11:06.17188+00 server "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) was successfully promoted to primary 2021-01-29 06:11:06 +0000 UTC [repmgr-event] Reconfiguring the db proxy with primary: postgres-1.postgres.prelude.svc.cluster.local 2021-01-29 06:11:06 +0000 UTC [postgres] [6-1] 2021-01-29 06:11:06.203 UTC [170] LOG: database system is ready to accept connections 2021-01-29 06:11:06 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/sbin/sv stop db-proxy 2021-01-29 06:11:06 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0) Post this on postgres-1 we see following snippets. At 2021-01-29 06:11:07 +0000 postgres-0 has been marked as a failure . Notifies postgres-2 as a new standby node 2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15 2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15 2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15 2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15 2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15 2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15 * * 2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15 2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15 2021-01-29 06:11:06 +0000 UTC [sudo] pam_unix(sudo:session): session closed for user root 2021-01-29 06:11:06 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/bin/tee /tmp/primary 2021-01-29 06:11:06 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0) 2021-01-29 06:11:06 +0000 UTC [sudo] pam_unix(sudo:session): session closed for user root 2021-01-29 06:11:06 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/sbin/sv start db-proxy 2021-01-29 06:11:06 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0) socat: no process found 2021-01-29 06:11:07 +0000 UTC [sudo] pam_unix(sudo:session): session closed for user root 2021-01-29 06:11:07 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/bin/touch /tmp/initialized 2021-01-29 06:11:07 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0) 2021-01-29 06:11:07 +0000 UTC [sudo] pam_unix(sudo:session): session closed for user root 2021-01-29 06:11:07 +0000 UTC [repmgrd] executing notification command for event "repmgrd_failover_promote" 2021-01-29 06:11:07 +0000 UTC [repmgr-event] 101 repmgrd_failover_promote 1 2021-01-29 06:11:07.074526+00 node 101 promoted to primary; old primary 100 marked as failed 2021-01-29 06:11:07 +0000 UTC [repmgrd] 1 followers to notify 2021-01-29 06:11:07 +0000 UTC [repmgrd] notifying node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) to follow node 101 2021-01-29 06:11:07 +0000 UTC [repmgrd] switching to primary monitoring mode 2021-01-29 06:11:07 +0000 UTC [repmgrd] executing notification command for event "repmgrd_reload" 2021-01-29 06:11:07 +0000 UTC [repmgr-event] 101 repmgrd_reload 1 2021-01-29 06:11:07.086852+00 monitoring cluster primary "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) 2021-01-29 06:11:07 +0000 UTC [repmgrd] monitoring cluster primary "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) 2021-01-29 06:11:12 +0000 UTC [repmgrd] new standby "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) has connected 2021-01-29 06:11:12 +0000 UTC [repmgrd] executing notification command for event "child_node_new_connect" 2021-01-29 06:11:12 +0000 UTC [repmgr-event] 101 child_node_new_connect 1 2021-01-29 06:11:12.114782+00 new standby "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) has connected 2021-01-29 06:12:07 +0000 UTC [repmgrd] monitoring primary node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 06:13:07 +0000 UTC [repmgrd] monitoring primary node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 06:14:07 +0000 UTC [repmgrd] monitoring primary node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state Around 2021-01-29 06:16:48 +0000 postgres-1 reports that postgres-2 has disconnected as a standby postgres-1 2021-01-29 06:16:08 +0000 UTC [repmgrd] monitoring primary node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 06:16:48 +0000 UTC [repmgrd] standby node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) has disconnected 2021-01-29 06:16:48 +0000 UTC [repmgrd] executing notification command for event "child_node_disconnect" 2021-01-29 06:16:48 +0000 UTC [repmgr-event] 101 child_node_disconnect 1 2021-01-29 06:16:48.7483+00 standby node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) has disconnected 2021-01-29 06:17:08 +0000 UTC [repmgrd] monitoring primary node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 06:17:16 +0000 UTC [postgres] [7-1] 2021-01-29 06:17:16.899 UTC [2820] LOG: unexpected EOF on client connection with an open transaction 2021-01-29 06:17:16 +0000 UTC [postgres] [7-1] 2021-01-29 06:17:16.914 UTC [2982] LOG: could not receive data from client: Connection reset by peer 2021-01-29 06:17:16 +0000 UTC [postgres] [7-1] 2021-01-29 06:17:16.914 UTC [2973] LOG: could not send data to client: Broken pipe 2021-01-29 06:17:16 +0000 UTC [postgres] [8-1] 2021-01-29 06:17:16.918 UTC [2973] LOG: could not receive data from client: Connection reset by peer 2021-01-29 06:17:16 +0000 UTC [postgres] [9-1] 2021-01-29 06:17:16.918 UTC [2973] FATAL: connection to client lost 2021-01-29 06:17:16 +0000 UTC [postgres] [7-1] 2021-01-29 06:17:16.920 UTC [2811] LOG: unexpected EOF on client connection with an open transaction 2021-01-29 06:18:09 +0000 UTC [repmgrd] monitoring primary node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 06:19:09 +0000 UTC [repmgrd] monitoring primary node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state postgres-2 2021-01-29 06:11:08 +0000 UTC [postgres] [6-1] 2021-01-29 06:11:08.631 UTC [7056] LOG: fetching timeline history file for timeline 6 from primary server 2021-01-29 06:11:08 +0000 UTC [repmgr] STANDBY FOLLOW successful 2021-01-29 06:11:08 +0000 UTC [repmgr] executing notification command for event "standby_follow" 2021-01-29 06:11:08 +0000 UTC [repmgr-event] 102 standby_follow 1 2021-01-29 06:11:08.63358+00 standby attached to upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) 101 host=postgres-1.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/rep mgr-db.cred connect_timeout=10 postgres-1.postgres.prelude.svc.cluster.local 2021-01-29 06:11:08 +0000 UTC [repmgr-event] Reconfiguring the db proxy with primary: postgres-1.postgres.prelude.svc.cluster.local 2021-01-29 06:11:08 +0000 UTC [postgres] [7-1] 2021-01-29 06:11:08.662 UTC [7056] LOG: started streaming WAL from primary at 418/C2000000 on timeline 5 2021-01-29 06:11:08 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/sbin/sv stop db-proxy 2021-01-29 06:11:08 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0) 2021-01-29 06:11:08 +0000 UTC [socat] W exiting on signal 15 2021-01-29 06:11:08 +0000 UTC [socat] W exiting on signal 15 * * 2021-01-29 06:11:08 +0000 UTC [postgres] [8-1] 2021-01-29 06:11:08.821 UTC [7056] LOG: replication terminated by primary server 2021-01-29 06:11:08 +0000 UTC [postgres] [8-2] 2021-01-29 06:11:08.821 UTC [7056] DETAIL: End of WAL reached on timeline 5 at 418/C2E8E700. 2021-01-29 06:11:08 +0000 UTC [postgres] [10-1] 2021-01-29 06:11:08.822 UTC [7052] LOG: new target timeline is 6 2021-01-29 06:11:08 +0000 UTC [postgres] [9-1] 2021-01-29 06:11:08.828 UTC [7056] LOG: restarted WAL streaming at 418/C2000000 on timeline 6 2021-01-29 06:11:09 +0000 UTC [sudo] pam_unix(sudo:session): session closed for user root 2021-01-29 06:11:09 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/bin/tee /tmp/primary 2021-01-29 06:11:09 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0) 2021-01-29 06:11:09 +0000 UTC [sudo] pam_unix(sudo:session): session closed for user root 2021-01-29 06:11:09 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/sbin/sv start db-proxy 2021-01-29 06:11:09 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0) socat: no process found 2021-01-29 06:11:09 +0000 UTC [sudo] pam_unix(sudo:session): session closed for user root 2021-01-29 06:11:09 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/bin/touch /tmp/initialized 2021-01-29 06:11:09 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0) 2021-01-29 06:11:09 +0000 UTC [sudo] pam_unix(sudo:session): session closed for user root 2021-01-29 06:11:09 +0000 UTC [repmgrd] node 102 now following new upstream node 101 2021-01-29 06:11:09 +0000 UTC [repmgrd] executing notification command for event "repmgrd_failover_follow" 2021-01-29 06:11:09 +0000 UTC [repmgr-event] 102 repmgrd_failover_follow 1 2021-01-29 06:11:09.56332+00 node 102 now following new upstream node 101 2021-01-29 06:11:09 +0000 UTC [repmgrd] resuming standby monitoring mode 2021-01-29 06:12:10 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 06:13:11 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 06:14:12 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 06:15:13 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 06:16:14 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 06:16:46 +0000 UTC [repmgrd] TERM signal received 2021-01-29 06:16:46 +0000 UTC [repmgrd] executing notification command for event "repmgrd_shutdown" 2021-01-29 06:16:46 +0000 UTC [repmgr-event] 102 repmgrd_shutdown 1 2021-01-29 06:16:46.832319+00 TERM signal received 2021-01-29 06:16:46 +0000 UTC [repmgrd] repmgrd terminating... 2021-01-29 06:16:47 +0000 UTC [postgres] [6-1] 2021-01-29 06:16:47.251 UTC [7051] LOG: received fast shutdown request 2021-01-29 06:16:47 +0000 UTC [postgres] [7-1] 2021-01-29 06:16:47.263 UTC [7051] LOG: aborting any active transactions 2021-01-29 06:16:47 +0000 UTC [postgres] [10-1] 2021-01-29 06:16:47.263 UTC [7056] FATAL: terminating walreceiver process due to administrator command 2021-01-29 06:16:48 +0000 UTC [postgres] [5-1] 2021-01-29 06:16:48.594 UTC [7053] LOG: shutting down 2021-01-29 06:16:48 +0000 UTC [postgres] [8-1] 2021-01-29 06:16:48.719 UTC [7051] LOG: database system is shut down After the shutdown on postgres-2 which happened around 2021-01-29 06:16:48 +0000 , we see there was a startup procedure immediately + set -e + source /scripts/utils.sh ++ export PGCTLTIMEOUT=120 ++ PGCTLTIMEOUT=120 ++ export PGDATA=/data/live ++ PGDATA=/data/live ++ export PGFLAGS=/data/flags ++ PGFLAGS=/data/flags ++ export BACKUP_DIR=/data/backup ++ BACKUP_DIR=/data/backup ++ export DB_POPULATE=/scratch/populate_db.sql ++ DB_POPULATE=/scratch/populate_db.sql ++ export REPMGR_USER=repmgr-db ++ REPMGR_USER=repmgr-db ++ export REPMGR_PASSFILE=/scratch/repmgr-db.cred ++ REPMGR_PASSFILE=/scratch/repmgr-db.cred ++ export REPMGR_DB=repmgr-db ++ REPMGR_DB=repmgr-db ++ export DEFAULT_FAILURE_DELAY=10s ++ DEFAULT_FAILURE_DELAY=10s ++ export 'ERR_TOO_MANY_MASTERS=Found more than one active master.' ++ ERR_TOO_MANY_MASTERS='Found more than one active master.' ++ export 'ERR_NO_ACTIVE_MASTER=Found no active masters.' ++ ERR_NO_ACTIVE_MASTER='Found no active masters.' * * * https://10.244.4.1:443/apis/apps/v1/namespaces/prelude/statefulsets/postgres + local replicas_count=3 + '[' 3 -eq 0 ']' ++ hostname -s ++ cut -b 10- + local index=2 ++ xargs -n 1 -I '{}' echo 'postgres-{}.postgres.prelude.svc.cluster.local' +++ expr 3 - 1 ++ seq 0 2 + export 'ALL_NODES=postgres-0.postgres.prelude.svc.cluster.local postgres-1.postgres.prelude.svc.cluster.local postgres-2.postgres.prelude.svc.cluster.local' + ALL_NODES='postgres-0.postgres.prelude.svc.cluster.local postgres-1.postgres.prelude.svc.cluster.local postgres-2.postgres.prelude.svc.cluster.local' ++ hostname -f + export SELF_NAME=postgres-2.postgres.prelude.svc.cluster.local + SELF_NAME=postgres-2.postgres.prelude.svc.cluster.local ++ expr 100 + 2 + export SELF_ID=102 + SELF_ID=102 ++ echo 'postgres-0.postgres.prelude.svc.cluster.local postgres-1.postgres.prelude.svc.cluster.local postgres-2.postgres.prelude.svc.cluster.local' ++ c1grep -v postgres-2.postgres.prelude.svc.cluster.local ++ grep -v postgres-2.postgres.prelude.svc.cluster.local + export 'OTHER_NODES=postgres-0.postgres.prelude.svc.cluster.local postgres-1.postgres.prelude.svc.cluster.local' + OTHER_NODES='postgres-0.postgres.prelude.svc.cluster.local postgres-1.postgres.prelude.svc.cluster.local' + return 0 + setup_repmgr + cat ++ connstring_from_node_id 102 ++ local node_id=102 +++ expr 102 - 100 ++ local node_index=2 +++ hostname +++ cut -d - -f 1 +++ hostname -d ++ local host=postgres-2.postgres.prelude.svc.cluster.local ++ echo 'host=postgres-2.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10' ++ expr 200 - 102 + setup_repmgrd + cat + find_current_master + master_nodes=() + read -r line ++ echo 'postgres-0.postgres.prelude.svc.cluster.local postgres-1.postgres.prelude.svc.cluster.local' + '[' -z postgres-0.postgres.prelude.svc.cluster.local ']' ++ ssh postgres@postgres-0.postgres.prelude.svc.cluster.local repmgr node status --csv ssh-keygen: generating new host keys: RSA DSA ECDSA ED25519 ++ true + local 'node_status="Node name","postgres-0.postgres.prelude.svc.cluster.local" "Node ID","100" "PostgreSQL version","10.10" "Total data size","32 GB" "Conninfo","host=postgres-0.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10" "Role","primary" "WAL archiving","enabled" "Archive command","/bin/true" "WALs pending archiving","0" "Replication connections","0 (of maximal 10)" "Replication slots","0 physical (of maximal 10; 0 missing)" "Upstream node","(none)" "Replication lag","n/a" "Last received LSN","(none)" "Last replayed LSN","(none)" "max_wal_senders",10 "occupied_wal_senders",0 "max_replication_slots",10 "active_replication_slots",0 "inactive_replication_slots",0 "missing_replication_slots",0' * * * ++ cut -d ' ' -f 1 + node_id=100 + master_nodes+=("${node_id}") + read -r line + '[' -z postgres-1.postgres.prelude.svc.cluster.local ']' ++ ssh postgres@postgres-1.postgres.prelude.svc.cluster.local repmgr node status --csv ++ true + local 'node_status="Node name","postgres-1.postgres.prelude.svc.cluster.local" "Node ID","101" "PostgreSQL version","10.10" "Total data size","32 GB" "Conninfo","host=postgres-1.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10" "Role","primary" "WAL archiving","enabled" "Archive command","/bin/true" "WALs pending archiving","0" "Replication connections","0 (of maximal 10)" "Replication slots","0 physical (of maximal 10; 0 missing)" "Upstream node","(none)" "Replication lag","n/a" "Last received LSN","(none)" "Last replayed LSN","(none)" "max_wal_senders",10 "occupied_wal_senders",0 "max_replication_slots",10 "active_replication_slots",0 "inactive_replication_slots",0 "missing_replication_slots",0' * * * ++ cut -d , -f 2 ++ grep '"Node ID"' ++ cut -d ' ' -f 1 ++ tr -d '"' + node_id=101 + master_nodes+=("${node_id}") + read -r line + unique_master_nodes=($(echo "${master_nodes[@]}" | tr ' ' '\n' | sort | uniq | tr '\n' ' ')) ++ echo 100 101 ++ sort ++ uniq ++ tr '\n' ' ' ++ tr ' ' '\n' + unique_master_count=2 + '[' 2 -eq 0 ']' + '[' 2 -eq 1 ']' + echo 'Found multiple master nodes 100' 101. + CURRENT_MASTER='Found more than one active master.' + return + [[ ! -f /data/live/postgresql.conf ]] + [[ -f /data/live/recovery.conf ]] + [[ Found more than one active master. == \F\o\u\n\d\ \m\o\r\e\ \t\h\a\n\ \o\n\e\ \a\c\t\i\v\e\ \m\a\s\t\e\r\. ]] + echo Too many active masters, will sleep and retry + delayed_exec /scripts/postgres_main.sh + sleep 10s Found multiple master nodes 100 101. Too many active masters, will sleep and retry Even though postgres-2 was attempting to start it was not coming up as it was detecting multiple master nodes This retry went on till the point it found only 1 Master: 2021-01-29 07:51:53.623 , that's when the service started + set -e + source /scripts/utils.sh ++ export PGCTLTIMEOUT=120 ++ PGCTLTIMEOUT=120 ++ export PGDATA=/data/live ++ PGDATA=/data/live ++ export PGFLAGS=/data/flags ++ PGFLAGS=/data/flags ++ export BACKUP_DIR=/data/backup ++ BACKUP_DIR=/data/backup ++ export DB_POPULATE=/scratch/populate_db.sql ++ DB_POPULATE=/scratch/populate_db.sql ++ export REPMGR_USER=repmgr-db ++ REPMGR_USER=repmgr-db ++ export REPMGR_PASSFILE=/scratch/repmgr-db.cred ++ REPMGR_PASSFILE=/scratch/repmgr-db.cred ++ export REPMGR_DB=repmgr-db * * * + unique_master_nodes=($(echo "${master_nodes[@]}" | tr ' ' '\n' | sort | uniq | tr '\n' ' ')) ++ echo 101 ++ sort ++ tr ' ' '\n' ++ uniq ++ tr '\n' ' ' + unique_master_count=1 + '[' 1 -eq 0 ']' + '[' 1 -eq 1 ']' + echo 'Found master node 101.' + CURRENT_MASTER=101 + return Found master node 101. + [[ ! -f /data/live/postgresql.conf ]] + [[ -f /data/live/recovery.conf ]] + [[ 101 == \F\o\u\n\d\ \m\o\r\e\ \t\h\a\n\ \o\n\e\ \a\c\t\i\v\e\ \m\a\s\t\e\r\. ]] + [[ 101 == \F\o\u\n\d\ \n\o\ \a\c\t\i\v\e\ \m\a\s\t\e\r\s\. ]] + [[ 101 == \1\0\2 ]] + echo 'Existing master 101 found, will attempt to join to it.' + init_as_standby 101 + CURRENT_MASTER=101 Existing master 101 found, will attempt to join to it. ++ find /data/live -maxdepth 0 -empty -type d ++ wc -l + '[' 0 -eq 0 ']' +++ hostname_from_node_id 101 +++ local node_id=101 ++++ expr 101 - 100 +++ local node_index=1 ++++ hostname ++++ cut -d - -f 1 ++++ hostname -d +++ echo postgres-1.postgres.prelude.svc.cluster.local ++ ssh postgres@postgres-1.postgres.prelude.svc.cluster.local /scripts/postgres_timeline.sh + their_timeline=6 + '[' -z 6 ']' ++ /scripts/postgres_timeline.sh + my_timeline=6 + '[' -n 6 ']' + '[' 6 -lt 6 ']' + echo Taking a backup of the database directory. Taking a backup of the database directory. + rm -rf /data/backup/backup_label.old /data/backup/base /data/backup/global /data/backup/pg_commit_ts /data/backup/pg_dynshmem /data/backup/pg_hba.conf /data/backup/pg_ident.conf /data/backup/pg_logical /data/backup/pg_multixact /data/backup/pg_notify /data/backup/pg_replslot /data/backup/pg_serial /data/backup/pg_snapshots /data/backup/pg_stat /data/backup/pg_stat_tmp /data/backup/pg_subtrans /data/backup/pg_tblspc /data/backup/pg_twophase /data/backup/PG_VERSION /data/backup/pg_wal /data/backup/pg_xact /data/backup/postgresql.auto.conf /data/backup/postgresql.conf /data/backup/postmaster.opts /data/backup/recovery.conf /data/backup/recovery.done + mv /data/live/backup_label.old /data/live/base /data/live/global /data/live/pg_commit_ts /data/live/pg_dynshmem /data/live/pg_hba.conf /data/live/pg_ident.conf /data/live/pg_logical /data/live/pg_multixact /data/live/pg_notify /data/live/pg_replslot /data/live/pg_serial /data/live/pg_snapshots /data/live/pg_stat /data/live/pg_stat_tmp /data/live/pg_subtrans /data/live/pg_tblspc /data/live/pg_twophase /data/live/PG_VERSION /data/live/pg_wal /data/live/pg_xact /data/live/postgresql.auto.conf /data/live/postgresql.conf /data/live/postmaster.opts /data/live/recovery.conf /data/live/recovery.done /data/backup/ + chown -R postgres:postgres /data/live * * /scripts/repmgr_event_handler.sh "102" "standby_clone" "1" "2021-01-29 07:51:53.563216+00" "cloned from host \"postgres-1.postgres.prelude.svc.cluster.local\", port 5432; backup method: pg_basebackup; --force: Y" "" "" "" 2021-01-29 07:51:53 +0000 UTC [repmgr-event] 102 standby_clone 1 2021-01-29 07:51:53.563216+00 cloned from host "postgres-1.postgres.prelude.svc.cluster.local", port 5432; backup method: pg_basebackup; --force: Y + chpst -u postgres pg_ctl start waiting for server to start....2021-01-29 07:51:53.623 UTC [18761] LOG: listening on IPv4 address "0.0.0.0", port 5432 2021-01-29 07:51:53.623 UTC [18761] LOG: listening on IPv6 address "::", port 5432 2021-01-29 07:51:53 +0000 UTC [postgres] [1-1] 2021-01-29 07:51:53.623 UTC [18761] LOG: listening on IPv4 address "0.0.0.0", port 5432 2021-01-29 07:51:53 +0000 UTC [postgres] [2-1] 2021-01-29 07:51:53.623 UTC [18761] LOG: listening on IPv6 address "::", port 5432 2021-01-29 07:51:53.647 UTC [18761] LOG: listening on Unix socket "/run/postgresql/.s.PGSQL.5432" 2021-01-29 07:51:53 +0000 UTC [postgres] [3-1] 2021-01-29 07:51:53.647 UTC [18761] LOG: listening on Unix socket "/run/postgresql/.s.PGSQL.5432" 2021-01-29 07:51:53.802 UTC [18761] LOG: ending log output to stderr 2021-01-29 07:51:53.802 UTC [18761] HINT: Future log output will go to log destination "syslog". 2021-01-29 07:51:53 +0000 UTC [postgres] [4-1] 2021-01-29 07:51:53.802 UTC [18761] LOG: ending log output to stderr 2021-01-29 07:51:53 +0000 UTC [postgres] [4-2] 2021-01-29 07:51:53.802 UTC [18761] HINT: Future log output will go to log destination "syslog". 2021-01-29 07:51:53 +0000 UTC [postgres] [5-1] 2021-01-29 07:51:53.818 UTC [18762] LOG: database system was interrupted; last known up at 2021-01-29 07:51:03 UTC 2021-01-29 07:51:53 +0000 UTC [postgres] [6-1] 2021-01-29 07:51:53.947 UTC [18762] LOG: entering standby mode 2021-01-29 07:51:53 +0000 UTC [postgres] [7-1] 2021-01-29 07:51:53.977 UTC [18762] LOG: redo starts at 418/F9000028 2021-01-29 07:51:54 +0000 UTC [postgres] [8-1] 2021-01-29 07:51:54.171 UTC [18762] LOG: consistent recovery state reached at 418/FBB43658 .2021-01-29 07:51:55 +0000 UTC [postgres] [5-1] 2021-01-29 07:51:55.299 UTC [18761] LOG: database system is ready to accept read only connections 2021-01-29 07:51:55 +0000 UTC [postgres] [6-1] 2021-01-29 07:51:55.324 UTC [18771] LOG: started streaming WAL from primary at 418/FC000000 on timeline 6 done server started + chpst -u postgres repmgr standby register --force INFO: connecting to local node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) INFO: connecting to primary database INFO: executing notification command for event "standby_register" DETAIL: command is: /scripts/repmgr_event_handler.sh "102" "standby_register" "1" "2021-01-29 07:51:55.429089+00" "standby registration succeeded; upstream node ID is 101 (-F/--force option was used)" "101" "host=postgres-1.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10" "postgres-1.postgres.prelude.svc.cluster.local" 2021-01-29 07:51:55 +0000 UTC [repmgr-event] 102 standby_register 1 2021-01-29 07:51:55.429089+00 standby registration succeeded; upstream node ID is 101 (-F/--force option was used) 101 host=postgres-1.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10 postgres-1.postgres.prelude.svc.cluster.local INFO: standby registration complete NOTICE: standby node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) successfully registered + hostname_from_node_id 101 So pretty much between 2021-01-29 06:16:48 +0000 --> 2021-01-29 07:51:53.623 for 1.5 hours this postgres-2 was down because of the multiple master situation. Below snippet we can discuss why postgres-0 became master , when postgres-1 was already a master On postgres-0 looks like it restarted the service around 2021-01-29 06:13:xx , this postgres-0 identified that there was only 1 Master. Then it tried to compare the timelines with existing master. While it was trying to compare existing timelines with the existing MASTER. Because of this difference in the timeline postgres-0 decides to become MASTER by itself + set -e + source /scripts/utils.sh ++ export PGCTLTIMEOUT=120 ++ PGCTLTIMEOUT=120 ++ export PGDATA=/data/live ++ PGDATA=/data/live * * * + unique_master_nodes=($(echo "${master_nodes[@]}" | tr ' ' '\n' | sort | uniq | tr '\n' ' ')) ++ echo 101 101 ++ tr ' ' '\n' ++ sort ++ uniq ++ tr '\n' ' ' + unique_master_count=1 + '[' 1 -eq 0 ']' + '[' 1 -eq 1 ']' + echo 'Found master node 101.' Found master node 101. + CURRENT_MASTER=101 + return + [[ ! -f /data/live/postgresql.conf ]] + [[ -f /data/live/recovery.conf ]] + [[ 101 == \F\o\u\n\d\ \m\o\r\e\ \t\h\a\n\ \o\n\e\ \a\c\t\i\v\e\ \m\a\s\t\e\r\. ]] + [[ 101 == \F\o\u\n\d\ \n\o\ \a\c\t\i\v\e\ \m\a\s\t\e\r\s\. ]] + [[ 101 == \1\0\0 ]] + echo 'Existing master 101 found, will compare timelines.' Existing master 101 found, will compare timelines. ++ /scripts/postgres_timeline.sh + my_timeline=5 +++ hostname_from_node_id 101 +++ local node_id=101 ++++ expr 101 - 100 +++ local node_index=1 ++++ hostname ++++ cut -d - -f 1 ++++ hostname -d +++ echo postgres-1.postgres.prelude.svc.cluster.local ++ ssh postgres@postgres-1.postgres.prelude.svc.cluster.local /scripts/postgres_timeline.sh + their_timeline=5 + '[' -n 5 ']' + '[' -z 5 ']' + '[' 5 -gt 5 ']' + echo 'The other master has equal or older timeline, I should not join to it!' + echo 'Start as a master and let the liveness check decide on.' + init_as_master The other master has equal or older timeline, I should not join to it! Start as a master and let the liveness check decide on. ++ find /data/live -maxdepth 0 -empty -type d ++ wc -l + '[' 0 -gt 0 ']' + setup_postgres_parameters + cat + setup_postgres_hba + local pg_hba=/data/live/pg_hba.conf + truncate -s 0 /data/live/pg_hba.conf + cat + chpst -u postgres pg_ctl start waiting for server to start....2021-01-29 06:13:33.487 UTC [120] LOG: listening on IPv4 address "0.0.0.0", port 5432 2021-01-29 06:13:33.488 UTC [120] LOG: listening on IPv6 address "::", port 5432 2021-01-29 06:13:33 +0000 UTC [postgres] [1-1] 2021-01-29 06:13:33.487 UTC [120] LOG: listening on IPv4 address "0.0.0.0", port 5432 2021-01-29 06:13:33 +0000 UTC [postgres] [2-1] 2021-01-29 06:13:33.488 UTC [120] LOG: listening on IPv6 address "::", port 5432 2021-01-29 06:13:33.494 UTC [120] LOG: listening on Unix socket "/run/postgresql/.s.PGSQL.5432" 2021-01-29 06:13:33 +0000 UTC [postgres] [3-1] 2021-01-29 06:13:33.494 UTC [120] LOG: listening on Unix socket "/run/postgresql/.s.PGSQL.5432" 2021-01-29 06:13:33.707 UTC [120] LOG: ending log output to stderr 2021-01-29 06:13:33.707 UTC [120] HINT: Future log output will go to log destination "syslog". 2021-01-29 06:13:33 +0000 UTC [postgres] [4-1] 2021-01-29 06:13:33.707 UTC [120] LOG: ending log output to stderr 2021-01-29 06:13:33 +0000 UTC [postgres] [4-2] 2021-01-29 06:13:33.707 UTC [120] HINT: Future log output will go to log destination "syslog". 2021-01-29 06:13:33 +0000 UTC [postgres] [5-1] 2021-01-29 06:13:33.715 UTC [121] LOG: database system was shut down at 2021-01-29 06:13:02 UTC 2021-01-29 06:13:33 +0000 UTC [postgres] [5-1] 2021-01-29 06:13:33.733 UTC [120] LOG: database system is ready to accept connections done server started + populate_databases + chpst -u postgres psql -f /scratch/populate_db.sql * * * You are now connected to database "repmgr-db" as user "postgres". ALTER EXTENSION psql:/scratch/populate_db.sql:199: NOTICE: version "5.1" of extension "repmgr" is already installed + [[ -f /data/flags/reindex.flag ]] + chpst -u postgres repmgr primary register --force INFO: connecting to primary database... INFO: "repmgr" extension is already installed INFO: executing notification command for event "primary_register" DETAIL: command is: /scripts/repmgr_event_handler.sh "100" "primary_register" "1" "2021-01-29 06:13:34.052955+00" "existing primary record updated" "" "" "" 2021-01-29 06:13:34 +0000 UTC [repmgr-event] 100 primary_register 1 2021-01-29 06:13:34.052955+00 existing primary record updated NOTICE: primary node record (ID: 100) updated + hostname -f + touch /tmp/initialized + runsv /etc/service/db-proxy/ + runsv /etc/service/repmgrd/ socat: no process found [2021-01-29 06:13:34] [NOTICE] using provided configuration file "/etc/repmgr.conf" [2021-01-29 06:13:34] [NOTICE] setup syslog (level: , facility: LOCAL0) 2021-01-29 06:13:34 +0000 UTC [repmgrd] repmgrd (repmgrd 5.1.0) starting up 2021-01-29 06:13:34 +0000 UTC [repmgrd] connecting to database "host=postgres-0.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10" 2021-01-29 06:13:34 +0000 UTC [repmgrd] starting monitoring of node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) 2021-01-29 06:13:34 +0000 UTC [repmgrd] "connection_check_type" set to "ping" 2021-01-29 06:13:34 +0000 UTC [repmgrd] executing notification command for event "repmgrd_start" 2021-01-29 06:13:34 +0000 UTC [repmgr-event] 100 repmgrd_start 1 2021-01-29 06:13:34.104243+00 monitoring cluster primary "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) 2021-01-29 06:13:34 +0000 UTC [repmgrd] monitoring cluster primary "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) 2021-01-29 06:13:34 +0000 UTC [repmgrd] child node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) is not yet attached 2021-01-29 06:13:34 +0000 UTC [repmgrd] child node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) is not yet attached 2021-01-29 06:14:34 +0000 UTC [repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state 2021-01-29 06:15:34 +0000 UTC [repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state 2021-01-29 06:16:34 +0000 UTC [repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state So between 2021-01-29 06:13:34 till 2021-01-29 07:45:56 +0000 there was a multiple master situation.Around 2021-01-29 07:45:56 +0000 is when termination was received on postgres-0 2021-01-29 07:40:59 +0000 UTC [repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state 2021-01-29 07:42:00 +0000 UTC [repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state 2021-01-29 07:43:00 +0000 UTC [repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state 2021-01-29 07:44:00 +0000 UTC [repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state 2021-01-29 07:45:00 +0000 UTC [repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state 2021-01-29 07:45:56 +0000 UTC [repmgrd] TERM signal received 2021-01-29 07:45:56 +0000 UTC [repmgrd] executing notification command for event "repmgrd_shutdown" 2021-01-29 07:45:56 +0000 UTC [repmgr-event] 100 repmgrd_shutdown 1 2021-01-29 07:45:56.497585+00 TERM signal received 2021-01-29 07:45:56 +0000 UTC [repmgrd] repmgrd terminating... 2021-01-29 07:45:56 +0000 UTC [postgres] [6-1] 2021-01-29 07:45:56.919 UTC [120] LOG: received fast shutdown request 2021-01-29 07:45:56 +0000 UTC [postgres] [7-1] 2021-01-29 07:45:56.925 UTC [120] LOG: aborting any active transactions 2021-01-29 07:45:56 +0000 UTC [postgres] [6-1] 2021-01-29 07:45:56.925 UTC [10502] FATAL: terminating connection due to administrator command 2021-01-29 07:45:56 +0000 UTC [postgres] [6-1] 2021-01-29 07:45:56.925 UTC [10504] FATAL: terminating connection due to administrator command 2021-01-29 07:45:56 +0000 UTC [postgres] [6-1] 2021-01-29 07:45:56.925 UTC [10496] FATAL: terminating connection due to administrator command This is when postgres-0 detects that postgres-1 has a new timeline and is the latest one and demotes itself to be a standby than a Master + unique_master_count=1 + '[' 1 -eq 0 ']' + '[' 1 -eq 1 ']' Found master node 101. + echo 'Found master node 101.' + CURRENT_MASTER=101 + return + [[ ! -f /data/live/postgresql.conf ]] + [[ -f /data/live/recovery.conf ]] + [[ 101 == \F\o\u\n\d\ \m\o\r\e\ \t\h\a\n\ \o\n\e\ \a\c\t\i\v\e\ \m\a\s\t\e\r\. ]] + [[ 101 == \F\o\u\n\d\ \n\o\ \a\c\t\i\v\e\ \m\a\s\t\e\r\s\. ]] + [[ 101 == \1\0\0 ]] Existing master 101 found, will compare timelines. + echo 'Existing master 101 found, will compare timelines.' ++ /scripts/postgres_timeline.sh + my_timeline=5 +++ hostname_from_node_id 101 +++ local node_id=101 ++++ expr 101 - 100 +++ local node_index=1 ++++ hostname ++++ cut -d - -f 1 ++++ hostname -d +++ echo postgres-1.postgres.prelude.svc.cluster.local ++ ssh postgres@postgres-1.postgres.prelude.svc.cluster.local /scripts/postgres_timeline.sh + their_timeline=6 + '[' -n 5 ']' + '[' -z 6 ']' + '[' 6 -gt 5 ']' + echo 'The other master has a more recent timeline, I should join to it.' The other master has a more recent timeline, I should join to it. + init_as_standby 101 + CURRENT_MASTER=101 ++ find /data/live -maxdepth 0 -empty -type d ++ wc -l + '[' 0 -eq 0 ']' +++ hostname_from_node_id 101 +++ local node_id=101 ++++ expr 101 - 100 +++ local node_index=1 ++++ hostname ++++ cut -d - -f 1 ++++ hostname -d +++ echo postgres-1.postgres.prelude.svc.cluster.local ++ ssh postgres@postgres-1.postgres.prelude.svc.cluster.local /scripts/postgres_timeline.sh + their_timeline=6 + '[' -z 6 ']' ++ /scripts/postgres_timeline.sh + my_timeline=5 + '[' -n 5 ']' Taking a backup of the database directory. + '[' 6 -lt 5 ']' + echo Taking a backup of the database directory. * * * 2021-01-29 07:53:39 +0000 UTC [repmgrd] repmgrd (repmgrd 5.1.0) starting up 2021-01-29 07:53:39 +0000 UTC [repmgrd] connecting to database "host=postgres-0.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10" 2021-01-29 07:53:39 +0000 UTC [repmgrd] starting monitoring of node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) 2021-01-29 07:53:39 +0000 UTC [repmgrd] "connection_check_type" set to "ping" 2021-01-29 07:53:39 +0000 UTC [repmgrd] executing notification command for event "repmgrd_start" 2021-01-29 07:53:39 +0000 UTC [repmgr-event] 100 repmgrd_start 1 2021-01-29 07:53:39.302793+00 monitoring connection to upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) 2021-01-29 07:53:39 +0000 UTC [repmgrd] monitoring connection to upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) 2021-01-29 07:54:39 +0000 UTC [repmgrd] node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 07:55:40 +0000 UTC [repmgrd] node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 07:56:41 +0000 UTC [repmgrd] node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 07:57:41 +0000 UTC [repmgrd] node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 07:58:42 +0000 UTC [repmgrd] node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 07:59:42 +0000 UTC [repmgrd] node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state The moment postgres-0 is now detected as standby at 2021-01-29 07:53:39.302793+00 , just 2 minutes early when this split brain issue was resolved on postgres-0 that's when postgres-2 comes back as well, to be precise : 2021-01-29 07:51:55 +0000 Looks like there is a cleanup script which runs in the backgorund after 1.5 hours to check if there is a multiple master situation. Will blog more details about it once research progresses

bottom of page