Search Results

249 results found with an empty search

Unable to run deployments with AD integration in vRA 8.3 if OU path is more than 50 characters
If there is AD integration inside vRA 8.3 and then you attempt to deploy a machine to an OU relative path that is more than 50 characters long then your deployments would fail. This is a bug that is fixed in vRealize Automation 8.4 GA
Unable to update resource elements in vRealize Orchestrator
We've recently seen issues while trying update resource elements in vRealize Orchestrator 7.x Some of the exceptions you would see when the vRO workflow is executed is as below. Below excerpts are from scripting.log 2021-03-09 11:42:35.070+0000 [WorkflowExecutorPool-Thread-2] ERROR {|__SYSTEM|administrator@vsphere.local:updateResourceEleme:f896e1b8-4b5e-484e-b5d1-db04390a1bcc:token=e98c4043-52d4-4375-96b4-c699da88f34d} [ResourceElementManagementServiceImpl] Failed to commit 'Resource saved.'. Reason: null java.lang.NullPointerException Cannot set content from mime attachment : ch.dunes.util.DunesServerException: org.springframework.orm.jpa.JpaSystemException: Transaction was marked for rollback only; cannot commit; nested exception is org.hibernate.TransactionException: Transaction was marked for rollback only; cannot commit One of the main reasons why this is not working is because the jgit garbage collection is not working as expected in the background 2021-03-09 11:42:23.354+0000 [http-nio-127.0.0.1-8280-exec-6] ERROR {} [ContentVersionManagementServiceImpl] Failed to disable auto push: org.eclipse.jgit.api.errors.TransportException: /storage/db/vcodata/git/__SYSTEM.git: internal server error java.lang.RuntimeException: org.eclipse.jgit.api.errors.TransportException: /storage/db/vcodata/git/__SYSTEM.git: internal server error at com.vmware.o11n.git.StagingRepositoryService.checkout(StagingRepositoryService.java:226) at com.vmware.o11n.service.version.ContentVersionManagementServiceImpl$ReusableStaginRepository.(ContentVersionManagementServiceImpl.java:495) at com.vmware.o11n.service.version.ContentVersionManagementServiceImpl$StagingRepositoryPooledObjectFactory.create(ContentVersionManagementServiceImpl.java:455) at com.vmware.o11n.service.version.ContentVersionManagementServiceImpl$StagingRepositoryPooledObjectFactory.create(ContentVersionManagementServiceImpl.java:445) at org.apache.commons.pool2.BasePooledObjectFactory.makeObject(BasePooledObjectFactory.java:58) 2021-03-10 03:00:00.000+0000 [vcoSystemTaskScheduler-3] DEBUG {} [ContentVersionManagementServiceImpl] Starting garbage collection in all repositories. 2021-03-10 03:00:05.931+0000 [vcoSystemTaskScheduler-3] INFO {} [BareRepositoryService] Started garbage collection for __SYSTEM.git 2021-03-10 03:00:06.352+0000 [vcoSystemTaskScheduler-3] ERROR {} [BareRepositoryService] Failed to run garbage collection for __SYSTEM.git. Reason: 2021-03-10 03:00:06.352+0000 [vcoSystemTaskScheduler-3] ERROR {} [BareRepositoryService] Garbage collection failed. org.eclipse.jgit.api.errors.JGitInternalException: Garbage collection failed. at org.eclipse.jgit.api.GarbageCollectCommand.call(GarbageCollectCommand.java:229) at com.vmware.o11n.git.BareRepositoryService.runGc(BareRepositoryService.java:109) at com.vmware.o11n.service.version.ContentVersionManagementServiceImpl.lambda$compact$3(ContentVersionManagementServiceImpl.java:260) at com.vmware.o11n.service.version.ContentVersionManagementServiceImpl.wrapInTransaction(ContentVersionManagementServiceImpl.java:111) at com.vmware.o11n.service.version.ContentVersionManagementServiceImpl.compact(ContentVersionManagementServiceImpl.java:256) at com.vmware.o11n.service.version.ContentVersionManagementServiceImpl.lambda$compactAll$2(ContentVersionManagementServiceImpl.java:251) at java.util.ArrayList.forEach(ArrayList.java:1257) Garbage Collection runs everyday at 3 AM. If you notice garbage collection failures as shown below then execute the steps mentioned in the remediation section 2021-03-06 03:00:12.167+0000 [vcoSystemTaskScheduler-3] ERROR {} [BareRepositoryService] Garbage collection failed. org.eclipse.jgit.api.errors.JGitInternalException: Garbage collection failed. 2021-03-07 03:00:00.545+0000 [vcoSystemTaskScheduler-4] ERROR {} [BareRepositoryService] Garbage collection failed. org.eclipse.jgit.api.errors.JGitInternalException: Garbage collection failed. 2021-03-08 03:00:03.969+0000 [vcoSystemTaskScheduler-3] ERROR {} [BareRepositoryService] Garbage collection failed. org.eclipse.jgit.api.errors.JGitInternalException: Garbage collection failed. 2021-03-09 03:00:00.469+0000 [vcoSystemTaskScheduler-4] ERROR {} [BareRepositoryService] Garbage collection failed. org.eclipse.jgit.api.errors.JGitInternalException: Garbage collection failed. 2021-03-10 03:00:06.352+0000 [vcoSystemTaskScheduler-3] ERROR {} [BareRepositoryService] Garbage collection failed. org.eclipse.jgit.api.errors.JGitInternalException: Garbage collection failed. To resolve this issue follow these steps Download JGit cli client , below command would work if internet is available, else execute this command on node which has internet access and copy the file into vRA node curl -k --output jgit https://repo.eclipse.org/content/groups/releases//org/eclipse/jgit/org.eclipse.jgit.pgm/5.4.0.201906121030-r/org.eclipse.jgit.pgm-5.4.0.201906121030-r.sh 2. Move jgit to /usr/bin, set it to executable and change its owner to vco and make it executable mv jgit /usr/bin/ chown vco:vco /usr/bin/jgit chmod +x /usr/bin/jgit 3. Remove the HEAD ref of the MASTER mv /storage/db/vcodata/git/__SYSTEM.git/refs/heads/master /var/lib/vco/master_ref.backup 4. Check where the HEAD should be.Get the commit ref sha from the first line. ex: 01bb5f58716dc12016b4fd8798c7fa8c91c76bf3 cd /storage/db/vcodata/git/__SYSTEM.git jgit log 5. Login as vco su - vco 6. Create the master ref echo '01bb5f58716dc12016b4fd8798c7fa8c91c76bf3' > /storage/db/vcodata/git/__SYSTEM.git/refs/heads/master 7. Run jgit garbage collection cd /storage/db/vcodata/git/__SYSTEM.git jgit gc 8. Login as root again and make sure everything in the git folder is owned by vco:vco exit cd /storage/db/vcodata/git/__SYSTEM.git chown -R vco:vco 9. Restart the vRO server service vco-server restart This should fix any garbage collection issues
vIDM Inventory Sync fails after implementing workaround to fix CVE-2020-4006 (KB 81731)
Came across an issue where we see vIDM Inventory Sync fails after applying KB 81731 which talks about fixing CVE-2020-4006 vIDM configurator page(:8443) will not be available/accessible and we are blocking it due to security vulnerability CVE-2020-4006 has been determined to affect some releases of Workspace ONE Access, Identity Manager, and Workspace ONE Access Connector. This vulnerability and its impact on VMware products are documented in VMSA-2020-0027. Please review this advisory before continuing as there may be considerations outside the scope of this document including permanent fixes. Affected Product versions: VMware Workspace ONE Access 20.10 (Linux) VMware Workspace ONE Access 20.01 (Linux) VMware Identity Manager 3.3.3 (Linux) VMware Identity Manager 3.3.2 (Linux) VMware Identity Manager 3.3.1 (Linux) VMware Identity Manager Connector 3.3.2, 3.3.1 (Linux) VMware Identity Manager Connector 3.3.3, 3.3.2, 3.3.1 (Windows) VMware Identity Manager Connector 19.03.0.0, 19.03.0.1 According to KB following steps were suggested to implement workaround on vIDM Implement workaround for Linux based vIDM appliances Use SSH to connect to appliance using “sshuser” credentials configured during installation or updated later. Switch to root by typing su and provide “root” credentials configured during installation or updated later. Run the following commands: cd /opt/vmware/horizon/workspace mkdir webapps.tmp mv webapps/cfg webapps.tmp mv conf/Catalina/localhost/cfg.xml webapps.tmp service horizon-workspace restart Repeat steps for all Linux based appliances affected by CVE-2020-4006. The moment this workaround is implemented on vIDM , Inventory Sync of vIDM through vRLCM will fail For Inventory Sync to work through vRLCM one has to revert the workaround as stated in the KB Revert workaround for Linux based appliances Use SSH to connect to appliance using “sshuser” credentials configured during installation or updated later. Switch to root by typing su and provide “root” credentials configured during installation or updated later. Run the following commands: cd /opt/vmware/horizon/workspace mv webapps.tmp/cfg webapps mv webapps.tmp/cfg.xml conf/Catalina/localhost rmdir webapps.tmp service horizon-workspace restart Repeat steps for all Linux based appliances affected by CVE-2020-4006 This issue is fixed in vIDM 3.3.4 which is packaged with vRLCM 8.3 and vRA 8.3 NOTE: If there are plans to upgrade vIDM you have to revert the patch using the steps mentioned above.
A useful command to list only patch logs while implementing vRA HF's
Execute below command which would list only patch logs during HF implementation tail -f /var/log/vmware/vcac/vcac-config.log | grep cluster.patch
Useful Postgres commands which can be executed on vRA appliance database
List of few commands which can help in troubleshooting database related issues Login into Database /opt/vmware/vpostgres/current/bin/psql vcac postgres Exit Database \q Display largest tables across all formats SELECT nspname || '.' || relname AS "relation", pg_size_pretty(pg_total_relation_size(C.oid)) AS "total_size" FROM pg_class C LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace) WHERE nspname NOT IN ('pg_catalog', 'information_schema') AND C.relkind <> 'i' AND nspname !~ '^pg_toast' ORDER BY pg_total_relation_size(C.oid) DESC LIMIT 20; Display last Vaccum stuff SELECT relname, last_vacuum, last_autovacuum, last_analyze, last_autoanalyze FROM pg_stat_all_tables WHERE schemaname = 'public'; Display largest objects SELECT relname AS objectname, relkind AS objecttype, reltuples AS "#entries", pg_size_pretty(relpages::bigint*8*1024) AS size FROM pg_class WHERE relpages >= 8 ORDER BY relpages DESC; Whole database size select pg_size_pretty(pg_database_size('vcac')); Single table size SELECT pg_size_pretty(pg_total_relation_size('event')); Size per row from the table SELECT primary_key, pg_column_size(tablename.*) FROM tablename;
Mandatory Prechecks before a vRealize Automation 7.x Upgrade
Collating few pre-checks which has to be performed when your performing a vRealize Automation upgrade based on our experience Check 1 Ensure vPostgres is in a healthy state Simple environment Distributed Environment Check 2 These IaaS nodes should have the last connected time, not more than 30 seconds. Appliances should not be more than 10 minutes. Here's the command you can use from a shell session vcac-config cluster-config -list | sed -e 's/[{}]/''/g' | awk -v k="text" '{n=split($0,a,","); for (i=1; i<=n; i++) print a[i]}' | grep -E 'nodeHost|lastUpdateRelative' or one can check from VAMI as well Check 3 Health checks of all components of vRA have to pass. Click on this link to find out how you have to perform them. This link has screenshots as well so that you may compare the outputs. Check 4 Ensure there are no duplicates inside Postgres database Login into Postgres psql -U postgres -d vcac Check for duplicates by executing below query select * from (select categoryid, name, count(*)-1, 'vmo_configelement' as type from vmo_configelement group by categoryid, name having count(*) > 1 union select parentcategoryid, name, count(*)-1, 'vmo_configelementcategory' as type from vmo_configelementcategory group by parentcategoryid, name having count(*) > 1 union select elementid, '' as name, count(*)-1, 'vmo_configelementcontent' as type from vmo_configelementcontent group by elementid having count(*) > 1 union select '' as id, name, count(*)-1, 'vmo_package' as type from vmo_package group by name having count(*) > 1 union select packageid, '' as name, count(*)-1, 'vmo_packagecontent' as type from vmo_packagecontent group by packageid having count(*) > 1 union select policyid, '' as name, count(*)-1, 'vmo_policycontent' as type from vmo_policycontent group by policyid having count(*) > 1 union select categoryid, name, count(*)-1, 'vmo_resourceelement' as type from vmo_resourceelement group by categoryid, name having count(*) > 1 union select elementid, '' as name, count(*)-1, 'vmo_resourceelementcontent' as type from vmo_resourceelementcontent group by elementid having count(*) > 1 union select scriptmodulecategoryid, name, count(*)-1, 'vmo_scriptmodule' as type from vmo_scriptmodule group by scriptmodulecategoryid, name having count(*) > 1 union select '' as id, name, count(*)-1, 'vmo_scriptmodulecategory' as type from vmo_scriptmodulecategory group by name having count(*) > 1 union select scriptmoduleid, '' as name, count(*)-1, 'vmo_scriptmodulecontent' as type from vmo_scriptmodulecontent group by scriptmoduleid having count(*) > 1 union select dunesobjectid, version, count(*)-1, 'vmo_versionhistorycontent' as type from vmo_versionhistorycontent group by dunesobjectid, version having count(*) > 1 union select workflowid, '' as name, count(*)-1, 'vmo_workflowcontent' as type from vmo_workflowcontent group by workflowid having count(*) > 1) as u; Just copy and paste the query. Then execute. The output should show 0 rows as shown below Check 5 Ensure cluster_nodes table inside Postgres just has a certificate and not a complete chain Login into Postgres psql -U postgres -d vcac Check cluster_nodes table select * from cluster_nodes where node_type = 'VA'; The certificate here must match /opt/vmware/etc/lighttpd/server.pem Moreover, server.pem must only contain a PRIVATE KEY and a CERTIFICATE This must not contain a chain like ROOT and INTERMEDIATE CERTIFICATE
AutoConfigurer cleaned up resource
AutoConfigurer is a process of deleting "RETIRED" resources from vRA 8.x environment. The moment it's marked as retired in the next data collection cycle it's deleted from vRA Here's a simple test performed in my lab Deployed a test deployment Here are the details of the deployment Deployment details on the vCenter Checked the latest data collection if it happened Powered off the virtual machine which belongs to the deployment Deleted the virtual machine Data Collection yet to start in 4 minutes Deployment History of this machine Detected as powered off Data Collection about to be executed Remember , this machine or resource has already been deleted from the vCenter After the new data collection cycle is complete This is where your AutoConfigurer will delete the resource after marking it as RETIRED 2021-02-09T06:51:10.137Z [priority='INFO' thread='vsphere-enum-2' user='' org='' context='' parent='' token=''] c.v.p.c.m.a.v.p.ResourceBatchProcessor.lambda$fetchToBeDeletedResources$7:233 - [premvc.prem.com:production] Will delete [1] documents of type [VirtualMachine] for resources [["retiretest->/resources/compute/06510162-8982-41a1-898a-f6f721f2a41e"]]. 2021-02-09T06:51:10.138Z [priority='INFO' thread='vsphere-enum-2' user='' org='' context='' parent='' token=''] c.v.p.c.m.a.v.p.VMBatchProcessor.lambda$deleteResourcesIncremental$28:270 - premvc.prem.com:production Deleting 1 resources of type VirtualMachine with morefs [VirtualMachine:vm-3109] 2021-02-09T06:51:10.139Z [priority='INFO' thread='vsphere-enum-2' user='' org='' context='' parent='' token=''] c.v.p.c.m.a.v.p.VirtualMachineProcessor.processDelete:186 - [premvc.prem.com]:[production]: Begin deleting machine [retiretest] with self link [/resources/compute/06510162-8982-41a1-898a-f6f721f2a41e] 2021-02-09T06:51:10.159Z [priority='INFO' thread='xn-index-updates-50' user='' org='' context='' parent='' token=''] c.v.p.c.m.a.v.processors.ProcessorUtils.lambda$retireCompute$5:219 - [premvc.prem.com]:[production]: The compute [retiretest] with self link [/resources/compute/06510162-8982-41a1-898a-f6f721f2a41e] is marked as RETIRED with expiration micros [1613112670139000] Here's your AutoConfigurer message on the deployment pane stating it has deleted the resource
vPostgres Database Failover Scenarios
Master Database constantly streams transactions to Replica servers Once Master fails, the active and working Replica is ready to proceed with Read-Only requests right away.When the new Master is promoted ( manually or automatically ), all upcoming requests are moved to it. Up to vRA 7.2 release , manual failover was supported which means that the MASTER is replaced with a Replica selected by Administrator Administrator needs to trigger the promotion operation and wait for its completion With vRA 7.3, PostgreSQL database failover is automatic, when three appliances are deployed and synchronous replication is configured Failover scenarios does not differ from the manual failover scenarios except parts of the scenarios with the steps needed to recover the environment are performed automatically, when automatic failover is on. It is the ability of the vRA PostgreSQL database component to automatically sustain a Master node alive at any time without further intervention of Administrator. There are some cases that vRA is unable to perform an automatic failover, for example if more than half of the nodes are off at the same time.Monitoring the https://vra-fqdn:5434/api/status when in sync mode ( automatic failover turned on ) as specified in documentation will notify the client to perform manual failover when the manualFailoverNeeded status flag is true 3 Node ( vRA Appliances ) Failover Scenarios As you can see in the screenshot we do see three nodes One being MASTER , other being SYNC REPLICA and the last one is POTENTIAL REPLICA For systems that use synchronous replication, there is one replica node completely in sync with the MASTER known as SYNC REPLICA. In the above screenshot it's vra3.vcloud.local which is SYNC REPLICA The SYNC REPLICA which is also called as HOT-SYNC node will be the one who would be promoted as MASTER as soon as current MASTER state would be changed to DOWN Only SYNC REPLICA will be the one in complete SYNC with MASTER node , other's status would be set to POTENTIAL. In ASYNCHRONOUS replication mode all node's have Sync State set to ASYNC. MASTER does not have any Sync State value. The Valid column in the screenshot shows that the database instance in the node is healthy The Priority column shows the position of Replica in relation to the Master node. The Master node has no priority. SYNC REPLICA's priority would be set to 0 as it's the one next in line to become MASTER. Subsequent POTENTIAL REPLICA's would be having priority set to 1, 2.... and so on Three Node ( vRA Appliances ) Auto Failover Scenarios When one of three nodes is down , the AutoFailover will be performed. No more AutoFailover capability till all three nodes are alive Let's explore few scenarios Scenario 1: MASTER is DOWN What Happens SYNC REPLICA becomes MASTER and picks up the database function automatically ( performed by automatic failover agent ) POTENTIAL REPLICA is now a new SYNC REPLICA vRA appliances get into READ ONLY mode till automatic failover completes. Once former MASTER is back It would be set to REPLICA automatically by failover agent. No manual action is needed. If former MASTER cannot be fixed, set the PostgreSQL database to asynchronous mode Scenario 2: SYNC REPLICA is DOWN What Happens There would be no change to MASTER, hence no downtime. You may experience some delay in the database requests for a couple of seconds until Potential Replica becomes the new SYNC REPLICA Promotion of nodes is automatically done by the PostgreSQL service Once former SYNC REPLICA is back online It would be automatically set to POTENTIAL REPLICA Scenario 3: POTENTIAL REPLICA is DOWN What happens Nothing. There is no downtime When former POTENTIAL REPLICA comes back online It will automatically be set to POTENTIAL REPLICA If former POTENTIAL REPLICA does not come online then set the PostgreSQL database to asynchronous mode. When two out of three nodes are DOWN at the same time, the system starts working in a read only mode until a manual repair is performed. When just one Database on node is available in the system, it should be switched to asynchronous mode. Scenario 4: MASTER AND POTENTIAL REPLICA are DOWN What happens SYNC REPLICA will not be prompted as MASTER automatically System starts working in a read only mode until the manual promotion is performed Once Former MASTER and POTENTIAL REPLICA are back online They should be reset manually to synchronize against MASTER Then the replication should be turned back to SYNCHRONOUS mode Scenario 5: SYNC REPLICA and POTENTIAL REPLICA are DOWN What Happens MASTER is not able to process R/W transactions System starts working in READ ONLY mode until a manual repair is performed When the SYNC REPLICA and POTENTIAL REPLICA are back They must be reset manually to synchronize against the MASTER The system should be turned back to SYNCHRONOUS mode When link failure happens between NODES Scenario 6: Link Failure between SITE A and SITE B, but all 3 nodes are UP and ONLINE SITE A has MASTER and POTENTIAL REPLICA SITE B has SYNC REPLICA What happens POTENTIAL REPLICA on SITE A becomes SYNC REPLICA There is no downtime When the link is re-established no manual action is needed Scenario 7: Link failure between SITE A and SITE B and all three nodes are UP and ONLINE SITE A has MASTER SITE B has POTENTIAL REPLICA What happens SYNC REPLICA becomes MASTER and picks up the DB function automatically, and POTENTIAL REPLICA becomes the new SYNC REPLICA (performed by the automatic failover agent). System is working in a read only mode until this promotion finishes. When the link is re-established , former MASTER is reset as REPLICA. There is no MANUAL action needed vRA Database Configuration Files and it's log locations vPostgres configuration file: /var/vmware/vpostgres/current/pgdata/pg_hba.conf vPostgreslog: /storage/db/pgdata/pg_log/postgresql.csv, /storage/db/pgdata/serverlog Automatic Failover Agent log: /var/log/vcac/vcac-config.log #vRealizeAutomation
vRA 8.x PostgreSQL pods showing down ?
I was doing analysis on a problem where PostgreSQL pods were showing down. vRA 8.x deployments were stuck and not working as expected In an ideal scenario, there should always be one master and two standby nodes in vRA 8.x Ideal Status Node "postgres-0.postgres.prelude.svc.cluster.local": PostgreSQL version: 10.10 Total data size: 26 GB Conninfo: host=postgres-0.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10 Role: standby WAL archiving: disabled (on standbys "archive_mode" must be set to "always" to be effective) Archive command: /bin/true WALs pending archiving: 0 pending files Replication connections: 0 (of maximal 10) Replication slots: 0 physical (of maximal 10; 0 missing) Upstream node: postgres-1.postgres.prelude.svc.cluster.local (ID: 101) Replication lag: 0 seconds Last received LSN: 307/595EBA48 Last replayed LSN: 307/595EBA48 Node "postgres-1.postgres.prelude.svc.cluster.local": PostgreSQL version: 10.10 Total data size: 26 GB Conninfo: host=postgres-1.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10 Role: primary WAL archiving: enabled Archive command: /bin/true WALs pending archiving: 0 pending files Replication connections: 1 (of maximal 10) Replication slots: 0 physical (of maximal 10; 0 missing) Replication lag: n/a Node "postgres-2.postgres.prelude.svc.cluster.local": PostgreSQL version: 10.10 Total data size: 26 GB Conninfo: host=postgres-2.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10 Role: standby WAL archiving: disabled (on standbys "archive_mode" must be set to "always" to be effective) Archive command: /bin/true WALs pending archiving: 0 pending files Replication connections: 0 (of maximal 10) Replication slots: 0 physical (of maximal 10; 0 missing) Upstream node: postgres-1.postgres.prelude.svc.cluster.local (ID: 101) Replication lag: 0 seconds Last received LSN: 304/E1A1B458 Last replayed LSN: 304/E1A1B458 When issue was seen one of the postgres pods was reporting down postgres-0 1/1 Running 1 3d11h postgres-1 1/1 Running 0 3d11h postgres-2 0/1 Running 1 3d11h By the time we realized, it fixed itself Was curious to know what happened and how did it fix itself. So let's start our deepdive Started inspecting postgres-1 which is the current master node Till 2021-01-29 06:10:19 +0000 UTC , it was detecting that postgres-0 was it's master and all of a sudden at 2021-01-29 06:10:35 +0000 it was unable to ping postgres-0 2021-01-29 06:09:19 +0000 UTC [repmgrd] node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) monitoring upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state 2021-01-29 06:10:19 +0000 UTC [repmgrd] node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) monitoring upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state 2021-01-29 06:10:35 +0000 UTC [repmgrd] unable to ping "host=postgres-0.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10" 2021-01-29 06:10:35 +0000 UTC [repmgrd] unable to connect to upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) 2021-01-29 06:10:35 +0000 UTC [repmgrd] checking state of node 100, 1 of 1 attempts 2021-01-29 06:10:37 +0000 UTC [socat] W waitpid(-1, {}, WNOHANG): no child has exited . Below snippets show that postgres-1 has been promoted as MASTER, where previous MASTER was postgres-0. postgres-1 has been promoted as MASTER on 2021-01-29 06:11:06.17188+00 2021-01-29 06:09:19 +0000 UTC [repmgrd] node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) monitoring upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state 2021-01-29 06:10:19 +0000 UTC [repmgrd] node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) monitoring upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state 2021-01-29 06:10:35 +0000 UTC [repmgrd] unable to ping "host=postgres-0.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10" 2021-01-29 06:10:35 +0000 UTC [repmgrd] unable to connect to upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) 2021-01-29 06:10:35 +0000 UTC [repmgrd] checking state of node 100, 1 of 1 attempts 2021-01-29 06:10:37 +0000 UTC [socat] W waitpid(-1, {}, WNOHANG): no child has exited 2021-01-29 06:10:45 +0000 UTC [repmgrd] unable to ping "user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10 dbname=repmgr-db host=postgres-0.postgres.prelude.svc.cluster.local fallback_application_name=repmgr" 2021-01-29 06:10:45 +0000 UTC [repmgrd] unable to reconnect to node 100 after 1 attempts 2021-01-29 06:10:45 +0000 UTC [repmgrd] 1 active sibling nodes registered 2021-01-29 06:10:45 +0000 UTC [repmgrd] primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) and this node have the same location ("default") 2021-01-29 06:10:45 +0000 UTC [repmgrd] local node's last receive lsn: 418/C2E8E700 2021-01-29 06:10:45 +0000 UTC [repmgrd] checking state of sibling node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) 2021-01-29 06:10:45 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) reports its upstream is node 100, last seen 20 second(s) ago 2021-01-29 06:10:45 +0000 UTC [repmgrd] standby node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) last saw primary node 20 second(s) ago 2021-01-29 06:10:45 +0000 UTC [repmgrd] last receive LSN for sibling node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) is: 418/C2E8E700 2021-01-29 06:10:45 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) has same LSN as current candidate "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) 2021-01-29 06:10:45 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) has lower priority (98) than current candidate "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) (99) 2021-01-29 06:10:45 +0000 UTC [repmgrd] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 2 seconds 2021-01-29 06:10:45 +0000 UTC [repmgrd] promotion candidate is "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) 2021-01-29 06:10:45 +0000 UTC [repmgrd] executing "failover_validation_command" 2021-01-29 06:10:56 +0000 UTC [repmgrd] no output returned from command 2021-01-29 06:10:56 +0000 UTC [repmgrd] failover validation command returned zero 2021-01-29 06:10:56 +0000 UTC [repmgrd] this node is the winner, will now promote itself and inform other nodes 2021-01-29 06:10:56 +0000 UTC [repmgrd] promote_command is: 2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:10:58 +0000 UTC [socat] W waitpid(-1, {}, WNOHANG): no child has exited * * * 2021-01-29 06:11:04 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:11:04 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out 2021-01-29 06:11:06 +0000 UTC [repmgr] 1 sibling nodes found, but option "--siblings-follow" not specified 2021-01-29 06:11:06 +0000 UTC [repmgr] promoting standby to primary 2021-01-29 06:11:06 +0000 UTC [postgres] [9-1] 2021-01-29 06:11:06.065 UTC [171] LOG: received promote request 2021-01-29 06:11:06 +0000 UTC [postgres] [7-1] 2021-01-29 06:11:06.065 UTC [175] FATAL: terminating walreceiver process due to administrator command 2021-01-29 06:11:06 +0000 UTC [postgres] [10-1] 2021-01-29 06:11:06.066 UTC [171] LOG: invalid resource manager ID 47 at 418/C2E8E700 2021-01-29 06:11:06 +0000 UTC [postgres] [11-1] 2021-01-29 06:11:06.066 UTC [171] LOG: redo done at 418/C2E8E6D8 2021-01-29 06:11:06 +0000 UTC [postgres] [12-1] 2021-01-29 06:11:06.066 UTC [171] LOG: last completed transaction was at log time 2021-01-29 06:10:25.633443+00 2021-01-29 06:11:06 +0000 UTC [postgres] [13-1] 2021-01-29 06:11:06.076 UTC [171] LOG: selected new timeline ID: 6 2021-01-29 06:11:06 +0000 UTC [postgres] [14-1] 2021-01-29 06:11:06.148 UTC [171] LOG: archive recovery complete 2021-01-29 06:11:06 +0000 UTC [repmgr] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete 2021-01-29 06:11:06 +0000 UTC [repmgr] STANDBY PROMOTE successful 2021-01-29 06:11:06 +0000 UTC [repmgr-event] 101 standby_promote 1 2021-01-29 06:11:06.17188+00 server "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) was successfully promoted to primary 2021-01-29 06:11:06 +0000 UTC [repmgr-event] Reconfiguring the db proxy with primary: postgres-1.postgres.prelude.svc.cluster.local 2021-01-29 06:11:06 +0000 UTC [postgres] [6-1] 2021-01-29 06:11:06.203 UTC [170] LOG: database system is ready to accept connections 2021-01-29 06:11:06 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/sbin/sv stop db-proxy 2021-01-29 06:11:06 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0) Post this on postgres-1 we see following snippets. At 2021-01-29 06:11:07 +0000 postgres-0 has been marked as a failure. Notifies postgres-2 as a new standby node 2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15 2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15 2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15 2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15 2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15 2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15 * * 2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15 2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15 2021-01-29 06:11:06 +0000 UTC [sudo] pam_unix(sudo:session): session closed for user root 2021-01-29 06:11:06 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/bin/tee /tmp/primary 2021-01-29 06:11:06 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0) 2021-01-29 06:11:06 +0000 UTC [sudo] pam_unix(sudo:session): session closed for user root 2021-01-29 06:11:06 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/sbin/sv start db-proxy 2021-01-29 06:11:06 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0) socat: no process found 2021-01-29 06:11:07 +0000 UTC [sudo] pam_unix(sudo:session): session closed for user root 2021-01-29 06:11:07 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/bin/touch /tmp/initialized 2021-01-29 06:11:07 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0) 2021-01-29 06:11:07 +0000 UTC [sudo] pam_unix(sudo:session): session closed for user root 2021-01-29 06:11:07 +0000 UTC [repmgrd] executing notification command for event "repmgrd_failover_promote" 2021-01-29 06:11:07 +0000 UTC [repmgr-event] 101 repmgrd_failover_promote 1 2021-01-29 06:11:07.074526+00 node 101 promoted to primary; old primary 100 marked as failed 2021-01-29 06:11:07 +0000 UTC [repmgrd] 1 followers to notify 2021-01-29 06:11:07 +0000 UTC [repmgrd] notifying node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) to follow node 101 2021-01-29 06:11:07 +0000 UTC [repmgrd] switching to primary monitoring mode 2021-01-29 06:11:07 +0000 UTC [repmgrd] executing notification command for event "repmgrd_reload" 2021-01-29 06:11:07 +0000 UTC [repmgr-event] 101 repmgrd_reload 1 2021-01-29 06:11:07.086852+00 monitoring cluster primary "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) 2021-01-29 06:11:07 +0000 UTC [repmgrd] monitoring cluster primary "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) 2021-01-29 06:11:12 +0000 UTC [repmgrd] new standby "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) has connected 2021-01-29 06:11:12 +0000 UTC [repmgrd] executing notification command for event "child_node_new_connect" 2021-01-29 06:11:12 +0000 UTC [repmgr-event] 101 child_node_new_connect 1 2021-01-29 06:11:12.114782+00 new standby "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) has connected 2021-01-29 06:12:07 +0000 UTC [repmgrd] monitoring primary node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 06:13:07 +0000 UTC [repmgrd] monitoring primary node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 06:14:07 +0000 UTC [repmgrd] monitoring primary node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state Around 2021-01-29 06:16:48 +0000 postgres-1 reports that postgres-2 has disconnected as a standby postgres-1 2021-01-29 06:16:08 +0000 UTC [repmgrd] monitoring primary node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 06:16:48 +0000 UTC [repmgrd] standby node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) has disconnected 2021-01-29 06:16:48 +0000 UTC [repmgrd] executing notification command for event "child_node_disconnect" 2021-01-29 06:16:48 +0000 UTC [repmgr-event] 101 child_node_disconnect 1 2021-01-29 06:16:48.7483+00 standby node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) has disconnected 2021-01-29 06:17:08 +0000 UTC [repmgrd] monitoring primary node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 06:17:16 +0000 UTC [postgres] [7-1] 2021-01-29 06:17:16.899 UTC [2820] LOG: unexpected EOF on client connection with an open transaction 2021-01-29 06:17:16 +0000 UTC [postgres] [7-1] 2021-01-29 06:17:16.914 UTC [2982] LOG: could not receive data from client: Connection reset by peer 2021-01-29 06:17:16 +0000 UTC [postgres] [7-1] 2021-01-29 06:17:16.914 UTC [2973] LOG: could not send data to client: Broken pipe 2021-01-29 06:17:16 +0000 UTC [postgres] [8-1] 2021-01-29 06:17:16.918 UTC [2973] LOG: could not receive data from client: Connection reset by peer 2021-01-29 06:17:16 +0000 UTC [postgres] [9-1] 2021-01-29 06:17:16.918 UTC [2973] FATAL: connection to client lost 2021-01-29 06:17:16 +0000 UTC [postgres] [7-1] 2021-01-29 06:17:16.920 UTC [2811] LOG: unexpected EOF on client connection with an open transaction 2021-01-29 06:18:09 +0000 UTC [repmgrd] monitoring primary node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 06:19:09 +0000 UTC [repmgrd] monitoring primary node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state postgres-2 2021-01-29 06:11:08 +0000 UTC [postgres] [6-1] 2021-01-29 06:11:08.631 UTC [7056] LOG: fetching timeline history file for timeline 6 from primary server 2021-01-29 06:11:08 +0000 UTC [repmgr] STANDBY FOLLOW successful 2021-01-29 06:11:08 +0000 UTC [repmgr] executing notification command for event "standby_follow" 2021-01-29 06:11:08 +0000 UTC [repmgr-event] 102 standby_follow 1 2021-01-29 06:11:08.63358+00 standby attached to upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) 101 host=postgres-1.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/rep mgr-db.cred connect_timeout=10 postgres-1.postgres.prelude.svc.cluster.local 2021-01-29 06:11:08 +0000 UTC [repmgr-event] Reconfiguring the db proxy with primary: postgres-1.postgres.prelude.svc.cluster.local 2021-01-29 06:11:08 +0000 UTC [postgres] [7-1] 2021-01-29 06:11:08.662 UTC [7056] LOG: started streaming WAL from primary at 418/C2000000 on timeline 5 2021-01-29 06:11:08 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/sbin/sv stop db-proxy 2021-01-29 06:11:08 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0) 2021-01-29 06:11:08 +0000 UTC [socat] W exiting on signal 15 2021-01-29 06:11:08 +0000 UTC [socat] W exiting on signal 15 * * 2021-01-29 06:11:08 +0000 UTC [postgres] [8-1] 2021-01-29 06:11:08.821 UTC [7056] LOG: replication terminated by primary server 2021-01-29 06:11:08 +0000 UTC [postgres] [8-2] 2021-01-29 06:11:08.821 UTC [7056] DETAIL: End of WAL reached on timeline 5 at 418/C2E8E700. 2021-01-29 06:11:08 +0000 UTC [postgres] [10-1] 2021-01-29 06:11:08.822 UTC [7052] LOG: new target timeline is 6 2021-01-29 06:11:08 +0000 UTC [postgres] [9-1] 2021-01-29 06:11:08.828 UTC [7056] LOG: restarted WAL streaming at 418/C2000000 on timeline 6 2021-01-29 06:11:09 +0000 UTC [sudo] pam_unix(sudo:session): session closed for user root 2021-01-29 06:11:09 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/bin/tee /tmp/primary 2021-01-29 06:11:09 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0) 2021-01-29 06:11:09 +0000 UTC [sudo] pam_unix(sudo:session): session closed for user root 2021-01-29 06:11:09 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/sbin/sv start db-proxy 2021-01-29 06:11:09 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0) socat: no process found 2021-01-29 06:11:09 +0000 UTC [sudo] pam_unix(sudo:session): session closed for user root 2021-01-29 06:11:09 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/bin/touch /tmp/initialized 2021-01-29 06:11:09 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0) 2021-01-29 06:11:09 +0000 UTC [sudo] pam_unix(sudo:session): session closed for user root 2021-01-29 06:11:09 +0000 UTC [repmgrd] node 102 now following new upstream node 101 2021-01-29 06:11:09 +0000 UTC [repmgrd] executing notification command for event "repmgrd_failover_follow" 2021-01-29 06:11:09 +0000 UTC [repmgr-event] 102 repmgrd_failover_follow 1 2021-01-29 06:11:09.56332+00 node 102 now following new upstream node 101 2021-01-29 06:11:09 +0000 UTC [repmgrd] resuming standby monitoring mode 2021-01-29 06:12:10 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 06:13:11 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 06:14:12 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 06:15:13 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 06:16:14 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 06:16:46 +0000 UTC [repmgrd] TERM signal received 2021-01-29 06:16:46 +0000 UTC [repmgrd] executing notification command for event "repmgrd_shutdown" 2021-01-29 06:16:46 +0000 UTC [repmgr-event] 102 repmgrd_shutdown 1 2021-01-29 06:16:46.832319+00 TERM signal received 2021-01-29 06:16:46 +0000 UTC [repmgrd] repmgrd terminating... 2021-01-29 06:16:47 +0000 UTC [postgres] [6-1] 2021-01-29 06:16:47.251 UTC [7051] LOG: received fast shutdown request 2021-01-29 06:16:47 +0000 UTC [postgres] [7-1] 2021-01-29 06:16:47.263 UTC [7051] LOG: aborting any active transactions 2021-01-29 06:16:47 +0000 UTC [postgres] [10-1] 2021-01-29 06:16:47.263 UTC [7056] FATAL: terminating walreceiver process due to administrator command 2021-01-29 06:16:48 +0000 UTC [postgres] [5-1] 2021-01-29 06:16:48.594 UTC [7053] LOG: shutting down 2021-01-29 06:16:48 +0000 UTC [postgres] [8-1] 2021-01-29 06:16:48.719 UTC [7051] LOG: database system is shut down After the shutdown on postgres-2 which happened around 2021-01-29 06:16:48 +0000 , we see there was a startup procedure immediately + set -e + source /scripts/utils.sh ++ export PGCTLTIMEOUT=120 ++ PGCTLTIMEOUT=120 ++ export PGDATA=/data/live ++ PGDATA=/data/live ++ export PGFLAGS=/data/flags ++ PGFLAGS=/data/flags ++ export BACKUP_DIR=/data/backup ++ BACKUP_DIR=/data/backup ++ export DB_POPULATE=/scratch/populate_db.sql ++ DB_POPULATE=/scratch/populate_db.sql ++ export REPMGR_USER=repmgr-db ++ REPMGR_USER=repmgr-db ++ export REPMGR_PASSFILE=/scratch/repmgr-db.cred ++ REPMGR_PASSFILE=/scratch/repmgr-db.cred ++ export REPMGR_DB=repmgr-db ++ REPMGR_DB=repmgr-db ++ export DEFAULT_FAILURE_DELAY=10s ++ DEFAULT_FAILURE_DELAY=10s ++ export 'ERR_TOO_MANY_MASTERS=Found more than one active master.' ++ ERR_TOO_MANY_MASTERS='Found more than one active master.' ++ export 'ERR_NO_ACTIVE_MASTER=Found no active masters.' ++ ERR_NO_ACTIVE_MASTER='Found no active masters.' * * * https://10.244.4.1:443/apis/apps/v1/namespaces/prelude/statefulsets/postgres + local replicas_count=3 + '[' 3 -eq 0 ']' ++ hostname -s ++ cut -b 10- + local index=2 ++ xargs -n 1 -I '{}' echo 'postgres-{}.postgres.prelude.svc.cluster.local' +++ expr 3 - 1 ++ seq 0 2 + export 'ALL_NODES=postgres-0.postgres.prelude.svc.cluster.local postgres-1.postgres.prelude.svc.cluster.local postgres-2.postgres.prelude.svc.cluster.local' + ALL_NODES='postgres-0.postgres.prelude.svc.cluster.local postgres-1.postgres.prelude.svc.cluster.local postgres-2.postgres.prelude.svc.cluster.local' ++ hostname -f + export SELF_NAME=postgres-2.postgres.prelude.svc.cluster.local + SELF_NAME=postgres-2.postgres.prelude.svc.cluster.local ++ expr 100 + 2 + export SELF_ID=102 + SELF_ID=102 ++ echo 'postgres-0.postgres.prelude.svc.cluster.local postgres-1.postgres.prelude.svc.cluster.local postgres-2.postgres.prelude.svc.cluster.local' ++ c1grep -v postgres-2.postgres.prelude.svc.cluster.local ++ grep -v postgres-2.postgres.prelude.svc.cluster.local + export 'OTHER_NODES=postgres-0.postgres.prelude.svc.cluster.local postgres-1.postgres.prelude.svc.cluster.local' + OTHER_NODES='postgres-0.postgres.prelude.svc.cluster.local postgres-1.postgres.prelude.svc.cluster.local' + return 0 + setup_repmgr + cat ++ connstring_from_node_id 102 ++ local node_id=102 +++ expr 102 - 100 ++ local node_index=2 +++ hostname +++ cut -d - -f 1 +++ hostname -d ++ local host=postgres-2.postgres.prelude.svc.cluster.local ++ echo 'host=postgres-2.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10' ++ expr 200 - 102 + setup_repmgrd + cat + find_current_master + master_nodes=() + read -r line ++ echo 'postgres-0.postgres.prelude.svc.cluster.local postgres-1.postgres.prelude.svc.cluster.local' + '[' -z postgres-0.postgres.prelude.svc.cluster.local ']' ++ ssh postgres@postgres-0.postgres.prelude.svc.cluster.local repmgr node status --csv ssh-keygen: generating new host keys: RSA DSA ECDSA ED25519 ++ true + local 'node_status="Node name","postgres-0.postgres.prelude.svc.cluster.local" "Node ID","100" "PostgreSQL version","10.10" "Total data size","32 GB" "Conninfo","host=postgres-0.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10" "Role","primary" "WAL archiving","enabled" "Archive command","/bin/true" "WALs pending archiving","0" "Replication connections","0 (of maximal 10)" "Replication slots","0 physical (of maximal 10; 0 missing)" "Upstream node","(none)" "Replication lag","n/a" "Last received LSN","(none)" "Last replayed LSN","(none)" "max_wal_senders",10 "occupied_wal_senders",0 "max_replication_slots",10 "active_replication_slots",0 "inactive_replication_slots",0 "missing_replication_slots",0' * * * ++ cut -d ' ' -f 1 + node_id=100 + master_nodes+=("${node_id}") + read -r line + '[' -z postgres-1.postgres.prelude.svc.cluster.local ']' ++ ssh postgres@postgres-1.postgres.prelude.svc.cluster.local repmgr node status --csv ++ true + local 'node_status="Node name","postgres-1.postgres.prelude.svc.cluster.local" "Node ID","101" "PostgreSQL version","10.10" "Total data size","32 GB" "Conninfo","host=postgres-1.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10" "Role","primary" "WAL archiving","enabled" "Archive command","/bin/true" "WALs pending archiving","0" "Replication connections","0 (of maximal 10)" "Replication slots","0 physical (of maximal 10; 0 missing)" "Upstream node","(none)" "Replication lag","n/a" "Last received LSN","(none)" "Last replayed LSN","(none)" "max_wal_senders",10 "occupied_wal_senders",0 "max_replication_slots",10 "active_replication_slots",0 "inactive_replication_slots",0 "missing_replication_slots",0' * * * ++ cut -d , -f 2 ++ grep '"Node ID"' ++ cut -d ' ' -f 1 ++ tr -d '"' + node_id=101 + master_nodes+=("${node_id}") + read -r line + unique_master_nodes=($(echo "${master_nodes[@]}" | tr ' ' '\n' | sort | uniq | tr '\n' ' ')) ++ echo 100 101 ++ sort ++ uniq ++ tr '\n' ' ' ++ tr ' ' '\n' + unique_master_count=2 + '[' 2 -eq 0 ']' + '[' 2 -eq 1 ']' + echo 'Found multiple master nodes 100' 101. + CURRENT_MASTER='Found more than one active master.' + return + [[ ! -f /data/live/postgresql.conf ]] + [[ -f /data/live/recovery.conf ]] + [[ Found more than one active master. == \F\o\u\n\d\ \m\o\r\e\ \t\h\a\n\ \o\n\e\ \a\c\t\i\v\e\ \m\a\s\t\e\r\. ]] + echo Too many active masters, will sleep and retry + delayed_exec /scripts/postgres_main.sh + sleep 10s Found multiple master nodes 100 101. Too many active masters, will sleep and retry Even though postgres-2 was attempting to start it was not coming up as it was detecting multiple master nodes This retry went on till the point it found only 1 Master: 2021-01-29 07:51:53.623 , that's when the service started + set -e + source /scripts/utils.sh ++ export PGCTLTIMEOUT=120 ++ PGCTLTIMEOUT=120 ++ export PGDATA=/data/live ++ PGDATA=/data/live ++ export PGFLAGS=/data/flags ++ PGFLAGS=/data/flags ++ export BACKUP_DIR=/data/backup ++ BACKUP_DIR=/data/backup ++ export DB_POPULATE=/scratch/populate_db.sql ++ DB_POPULATE=/scratch/populate_db.sql ++ export REPMGR_USER=repmgr-db ++ REPMGR_USER=repmgr-db ++ export REPMGR_PASSFILE=/scratch/repmgr-db.cred ++ REPMGR_PASSFILE=/scratch/repmgr-db.cred ++ export REPMGR_DB=repmgr-db * * * + unique_master_nodes=($(echo "${master_nodes[@]}" | tr ' ' '\n' | sort | uniq | tr '\n' ' ')) ++ echo 101 ++ sort ++ tr ' ' '\n' ++ uniq ++ tr '\n' ' ' + unique_master_count=1 + '[' 1 -eq 0 ']' + '[' 1 -eq 1 ']' + echo 'Found master node 101.' + CURRENT_MASTER=101 + return Found master node 101. + [[ ! -f /data/live/postgresql.conf ]] + [[ -f /data/live/recovery.conf ]] + [[ 101 == \F\o\u\n\d\ \m\o\r\e\ \t\h\a\n\ \o\n\e\ \a\c\t\i\v\e\ \m\a\s\t\e\r\. ]] + [[ 101 == \F\o\u\n\d\ \n\o\ \a\c\t\i\v\e\ \m\a\s\t\e\r\s\. ]] + [[ 101 == \1\0\2 ]] + echo 'Existing master 101 found, will attempt to join to it.' + init_as_standby 101 + CURRENT_MASTER=101 Existing master 101 found, will attempt to join to it. ++ find /data/live -maxdepth 0 -empty -type d ++ wc -l + '[' 0 -eq 0 ']' +++ hostname_from_node_id 101 +++ local node_id=101 ++++ expr 101 - 100 +++ local node_index=1 ++++ hostname ++++ cut -d - -f 1 ++++ hostname -d +++ echo postgres-1.postgres.prelude.svc.cluster.local ++ ssh postgres@postgres-1.postgres.prelude.svc.cluster.local /scripts/postgres_timeline.sh + their_timeline=6 + '[' -z 6 ']' ++ /scripts/postgres_timeline.sh + my_timeline=6 + '[' -n 6 ']' + '[' 6 -lt 6 ']' + echo Taking a backup of the database directory. Taking a backup of the database directory. + rm -rf /data/backup/backup_label.old /data/backup/base /data/backup/global /data/backup/pg_commit_ts /data/backup/pg_dynshmem /data/backup/pg_hba.conf /data/backup/pg_ident.conf /data/backup/pg_logical /data/backup/pg_multixact /data/backup/pg_notify /data/backup/pg_replslot /data/backup/pg_serial /data/backup/pg_snapshots /data/backup/pg_stat /data/backup/pg_stat_tmp /data/backup/pg_subtrans /data/backup/pg_tblspc /data/backup/pg_twophase /data/backup/PG_VERSION /data/backup/pg_wal /data/backup/pg_xact /data/backup/postgresql.auto.conf /data/backup/postgresql.conf /data/backup/postmaster.opts /data/backup/recovery.conf /data/backup/recovery.done + mv /data/live/backup_label.old /data/live/base /data/live/global /data/live/pg_commit_ts /data/live/pg_dynshmem /data/live/pg_hba.conf /data/live/pg_ident.conf /data/live/pg_logical /data/live/pg_multixact /data/live/pg_notify /data/live/pg_replslot /data/live/pg_serial /data/live/pg_snapshots /data/live/pg_stat /data/live/pg_stat_tmp /data/live/pg_subtrans /data/live/pg_tblspc /data/live/pg_twophase /data/live/PG_VERSION /data/live/pg_wal /data/live/pg_xact /data/live/postgresql.auto.conf /data/live/postgresql.conf /data/live/postmaster.opts /data/live/recovery.conf /data/live/recovery.done /data/backup/ + chown -R postgres:postgres /data/live * * /scripts/repmgr_event_handler.sh "102" "standby_clone" "1" "2021-01-29 07:51:53.563216+00" "cloned from host \"postgres-1.postgres.prelude.svc.cluster.local\", port 5432; backup method: pg_basebackup; --force: Y" "" "" "" 2021-01-29 07:51:53 +0000 UTC [repmgr-event] 102 standby_clone 1 2021-01-29 07:51:53.563216+00 cloned from host "postgres-1.postgres.prelude.svc.cluster.local", port 5432; backup method: pg_basebackup; --force: Y + chpst -u postgres pg_ctl start waiting for server to start....2021-01-29 07:51:53.623 UTC [18761] LOG: listening on IPv4 address "0.0.0.0", port 5432 2021-01-29 07:51:53.623 UTC [18761] LOG: listening on IPv6 address "::", port 5432 2021-01-29 07:51:53 +0000 UTC [postgres] [1-1] 2021-01-29 07:51:53.623 UTC [18761] LOG: listening on IPv4 address "0.0.0.0", port 5432 2021-01-29 07:51:53 +0000 UTC [postgres] [2-1] 2021-01-29 07:51:53.623 UTC [18761] LOG: listening on IPv6 address "::", port 5432 2021-01-29 07:51:53.647 UTC [18761] LOG: listening on Unix socket "/run/postgresql/.s.PGSQL.5432" 2021-01-29 07:51:53 +0000 UTC [postgres] [3-1] 2021-01-29 07:51:53.647 UTC [18761] LOG: listening on Unix socket "/run/postgresql/.s.PGSQL.5432" 2021-01-29 07:51:53.802 UTC [18761] LOG: ending log output to stderr 2021-01-29 07:51:53.802 UTC [18761] HINT: Future log output will go to log destination "syslog". 2021-01-29 07:51:53 +0000 UTC [postgres] [4-1] 2021-01-29 07:51:53.802 UTC [18761] LOG: ending log output to stderr 2021-01-29 07:51:53 +0000 UTC [postgres] [4-2] 2021-01-29 07:51:53.802 UTC [18761] HINT: Future log output will go to log destination "syslog". 2021-01-29 07:51:53 +0000 UTC [postgres] [5-1] 2021-01-29 07:51:53.818 UTC [18762] LOG: database system was interrupted; last known up at 2021-01-29 07:51:03 UTC 2021-01-29 07:51:53 +0000 UTC [postgres] [6-1] 2021-01-29 07:51:53.947 UTC [18762] LOG: entering standby mode 2021-01-29 07:51:53 +0000 UTC [postgres] [7-1] 2021-01-29 07:51:53.977 UTC [18762] LOG: redo starts at 418/F9000028 2021-01-29 07:51:54 +0000 UTC [postgres] [8-1] 2021-01-29 07:51:54.171 UTC [18762] LOG: consistent recovery state reached at 418/FBB43658 .2021-01-29 07:51:55 +0000 UTC [postgres] [5-1] 2021-01-29 07:51:55.299 UTC [18761] LOG: database system is ready to accept read only connections 2021-01-29 07:51:55 +0000 UTC [postgres] [6-1] 2021-01-29 07:51:55.324 UTC [18771] LOG: started streaming WAL from primary at 418/FC000000 on timeline 6 done server started + chpst -u postgres repmgr standby register --force INFO: connecting to local node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) INFO: connecting to primary database INFO: executing notification command for event "standby_register" DETAIL: command is: /scripts/repmgr_event_handler.sh "102" "standby_register" "1" "2021-01-29 07:51:55.429089+00" "standby registration succeeded; upstream node ID is 101 (-F/--force option was used)" "101" "host=postgres-1.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10" "postgres-1.postgres.prelude.svc.cluster.local" 2021-01-29 07:51:55 +0000 UTC [repmgr-event] 102 standby_register 1 2021-01-29 07:51:55.429089+00 standby registration succeeded; upstream node ID is 101 (-F/--force option was used) 101 host=postgres-1.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10 postgres-1.postgres.prelude.svc.cluster.local INFO: standby registration complete NOTICE: standby node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) successfully registered + hostname_from_node_id 101 So pretty much between 2021-01-29 06:16:48 +0000 --> 2021-01-29 07:51:53.623 for 1.5 hours this postgres-2 was down because of the multiple master situation. Below snippet we can discuss why postgres-0 became master , when postgres-1 was already a master On postgres-0 looks like it restarted the service around 2021-01-29 06:13:xx , this postgres-0 identified that there was only 1 Master. Then it tried to compare the timelines with existing master. While it was trying to compare existing timelines with the existing MASTER. Because of this difference in the timeline postgres-0 decides to become MASTER by itself + set -e + source /scripts/utils.sh ++ export PGCTLTIMEOUT=120 ++ PGCTLTIMEOUT=120 ++ export PGDATA=/data/live ++ PGDATA=/data/live * * * + unique_master_nodes=($(echo "${master_nodes[@]}" | tr ' ' '\n' | sort | uniq | tr '\n' ' ')) ++ echo 101 101 ++ tr ' ' '\n' ++ sort ++ uniq ++ tr '\n' ' ' + unique_master_count=1 + '[' 1 -eq 0 ']' + '[' 1 -eq 1 ']' + echo 'Found master node 101.' Found master node 101. + CURRENT_MASTER=101 + return + [[ ! -f /data/live/postgresql.conf ]] + [[ -f /data/live/recovery.conf ]] + [[ 101 == \F\o\u\n\d\ \m\o\r\e\ \t\h\a\n\ \o\n\e\ \a\c\t\i\v\e\ \m\a\s\t\e\r\. ]] + [[ 101 == \F\o\u\n\d\ \n\o\ \a\c\t\i\v\e\ \m\a\s\t\e\r\s\. ]] + [[ 101 == \1\0\0 ]] + echo 'Existing master 101 found, will compare timelines.' Existing master 101 found, will compare timelines. ++ /scripts/postgres_timeline.sh + my_timeline=5 +++ hostname_from_node_id 101 +++ local node_id=101 ++++ expr 101 - 100 +++ local node_index=1 ++++ hostname ++++ cut -d - -f 1 ++++ hostname -d +++ echo postgres-1.postgres.prelude.svc.cluster.local ++ ssh postgres@postgres-1.postgres.prelude.svc.cluster.local /scripts/postgres_timeline.sh + their_timeline=5 + '[' -n 5 ']' + '[' -z 5 ']' + '[' 5 -gt 5 ']' + echo 'The other master has equal or older timeline, I should not join to it!' + echo 'Start as a master and let the liveness check decide on.' + init_as_master The other master has equal or older timeline, I should not join to it! Start as a master and let the liveness check decide on. ++ find /data/live -maxdepth 0 -empty -type d ++ wc -l + '[' 0 -gt 0 ']' + setup_postgres_parameters + cat + setup_postgres_hba + local pg_hba=/data/live/pg_hba.conf + truncate -s 0 /data/live/pg_hba.conf + cat + chpst -u postgres pg_ctl start waiting for server to start....2021-01-29 06:13:33.487 UTC [120] LOG: listening on IPv4 address "0.0.0.0", port 5432 2021-01-29 06:13:33.488 UTC [120] LOG: listening on IPv6 address "::", port 5432 2021-01-29 06:13:33 +0000 UTC [postgres] [1-1] 2021-01-29 06:13:33.487 UTC [120] LOG: listening on IPv4 address "0.0.0.0", port 5432 2021-01-29 06:13:33 +0000 UTC [postgres] [2-1] 2021-01-29 06:13:33.488 UTC [120] LOG: listening on IPv6 address "::", port 5432 2021-01-29 06:13:33.494 UTC [120] LOG: listening on Unix socket "/run/postgresql/.s.PGSQL.5432" 2021-01-29 06:13:33 +0000 UTC [postgres] [3-1] 2021-01-29 06:13:33.494 UTC [120] LOG: listening on Unix socket "/run/postgresql/.s.PGSQL.5432" 2021-01-29 06:13:33.707 UTC [120] LOG: ending log output to stderr 2021-01-29 06:13:33.707 UTC [120] HINT: Future log output will go to log destination "syslog". 2021-01-29 06:13:33 +0000 UTC [postgres] [4-1] 2021-01-29 06:13:33.707 UTC [120] LOG: ending log output to stderr 2021-01-29 06:13:33 +0000 UTC [postgres] [4-2] 2021-01-29 06:13:33.707 UTC [120] HINT: Future log output will go to log destination "syslog". 2021-01-29 06:13:33 +0000 UTC [postgres] [5-1] 2021-01-29 06:13:33.715 UTC [121] LOG: database system was shut down at 2021-01-29 06:13:02 UTC 2021-01-29 06:13:33 +0000 UTC [postgres] [5-1] 2021-01-29 06:13:33.733 UTC [120] LOG: database system is ready to accept connections done server started + populate_databases + chpst -u postgres psql -f /scratch/populate_db.sql * * * You are now connected to database "repmgr-db" as user "postgres". ALTER EXTENSION psql:/scratch/populate_db.sql:199: NOTICE: version "5.1" of extension "repmgr" is already installed + [[ -f /data/flags/reindex.flag ]] + chpst -u postgres repmgr primary register --force INFO: connecting to primary database... INFO: "repmgr" extension is already installed INFO: executing notification command for event "primary_register" DETAIL: command is: /scripts/repmgr_event_handler.sh "100" "primary_register" "1" "2021-01-29 06:13:34.052955+00" "existing primary record updated" "" "" "" 2021-01-29 06:13:34 +0000 UTC [repmgr-event] 100 primary_register 1 2021-01-29 06:13:34.052955+00 existing primary record updated NOTICE: primary node record (ID: 100) updated + hostname -f + touch /tmp/initialized + runsv /etc/service/db-proxy/ + runsv /etc/service/repmgrd/ socat: no process found [2021-01-29 06:13:34] [NOTICE] using provided configuration file "/etc/repmgr.conf" [2021-01-29 06:13:34] [NOTICE] setup syslog (level: , facility: LOCAL0) 2021-01-29 06:13:34 +0000 UTC [repmgrd] repmgrd (repmgrd 5.1.0) starting up 2021-01-29 06:13:34 +0000 UTC [repmgrd] connecting to database "host=postgres-0.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10" 2021-01-29 06:13:34 +0000 UTC [repmgrd] starting monitoring of node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) 2021-01-29 06:13:34 +0000 UTC [repmgrd] "connection_check_type" set to "ping" 2021-01-29 06:13:34 +0000 UTC [repmgrd] executing notification command for event "repmgrd_start" 2021-01-29 06:13:34 +0000 UTC [repmgr-event] 100 repmgrd_start 1 2021-01-29 06:13:34.104243+00 monitoring cluster primary "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) 2021-01-29 06:13:34 +0000 UTC [repmgrd] monitoring cluster primary "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) 2021-01-29 06:13:34 +0000 UTC [repmgrd] child node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) is not yet attached 2021-01-29 06:13:34 +0000 UTC [repmgrd] child node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) is not yet attached 2021-01-29 06:14:34 +0000 UTC [repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state 2021-01-29 06:15:34 +0000 UTC [repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state 2021-01-29 06:16:34 +0000 UTC [repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state So between 2021-01-29 06:13:34 till 2021-01-29 07:45:56 +0000 there was a multiple master situation.Around 2021-01-29 07:45:56 +0000 is when termination was received on postgres-0 2021-01-29 07:40:59 +0000 UTC [repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state 2021-01-29 07:42:00 +0000 UTC [repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state 2021-01-29 07:43:00 +0000 UTC [repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state 2021-01-29 07:44:00 +0000 UTC [repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state 2021-01-29 07:45:00 +0000 UTC [repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state 2021-01-29 07:45:56 +0000 UTC [repmgrd] TERM signal received 2021-01-29 07:45:56 +0000 UTC [repmgrd] executing notification command for event "repmgrd_shutdown" 2021-01-29 07:45:56 +0000 UTC [repmgr-event] 100 repmgrd_shutdown 1 2021-01-29 07:45:56.497585+00 TERM signal received 2021-01-29 07:45:56 +0000 UTC [repmgrd] repmgrd terminating... 2021-01-29 07:45:56 +0000 UTC [postgres] [6-1] 2021-01-29 07:45:56.919 UTC [120] LOG: received fast shutdown request 2021-01-29 07:45:56 +0000 UTC [postgres] [7-1] 2021-01-29 07:45:56.925 UTC [120] LOG: aborting any active transactions 2021-01-29 07:45:56 +0000 UTC [postgres] [6-1] 2021-01-29 07:45:56.925 UTC [10502] FATAL: terminating connection due to administrator command 2021-01-29 07:45:56 +0000 UTC [postgres] [6-1] 2021-01-29 07:45:56.925 UTC [10504] FATAL: terminating connection due to administrator command 2021-01-29 07:45:56 +0000 UTC [postgres] [6-1] 2021-01-29 07:45:56.925 UTC [10496] FATAL: terminating connection due to administrator command This is when postgres-0 detects that postgres-1 has a new timeline and is the latest one and demotes itself to be a standby than a Master + unique_master_count=1 + '[' 1 -eq 0 ']' + '[' 1 -eq 1 ']' Found master node 101. + echo 'Found master node 101.' + CURRENT_MASTER=101 + return + [[ ! -f /data/live/postgresql.conf ]] + [[ -f /data/live/recovery.conf ]] + [[ 101 == \F\o\u\n\d\ \m\o\r\e\ \t\h\a\n\ \o\n\e\ \a\c\t\i\v\e\ \m\a\s\t\e\r\. ]] + [[ 101 == \F\o\u\n\d\ \n\o\ \a\c\t\i\v\e\ \m\a\s\t\e\r\s\. ]] + [[ 101 == \1\0\0 ]] Existing master 101 found, will compare timelines. + echo 'Existing master 101 found, will compare timelines.' ++ /scripts/postgres_timeline.sh + my_timeline=5 +++ hostname_from_node_id 101 +++ local node_id=101 ++++ expr 101 - 100 +++ local node_index=1 ++++ hostname ++++ cut -d - -f 1 ++++ hostname -d +++ echo postgres-1.postgres.prelude.svc.cluster.local ++ ssh postgres@postgres-1.postgres.prelude.svc.cluster.local /scripts/postgres_timeline.sh + their_timeline=6 + '[' -n 5 ']' + '[' -z 6 ']' + '[' 6 -gt 5 ']' + echo 'The other master has a more recent timeline, I should join to it.' The other master has a more recent timeline, I should join to it. + init_as_standby 101 + CURRENT_MASTER=101 ++ find /data/live -maxdepth 0 -empty -type d ++ wc -l + '[' 0 -eq 0 ']' +++ hostname_from_node_id 101 +++ local node_id=101 ++++ expr 101 - 100 +++ local node_index=1 ++++ hostname ++++ cut -d - -f 1 ++++ hostname -d +++ echo postgres-1.postgres.prelude.svc.cluster.local ++ ssh postgres@postgres-1.postgres.prelude.svc.cluster.local /scripts/postgres_timeline.sh + their_timeline=6 + '[' -z 6 ']' ++ /scripts/postgres_timeline.sh + my_timeline=5 + '[' -n 5 ']' Taking a backup of the database directory. + '[' 6 -lt 5 ']' + echo Taking a backup of the database directory. * * * 2021-01-29 07:53:39 +0000 UTC [repmgrd] repmgrd (repmgrd 5.1.0) starting up 2021-01-29 07:53:39 +0000 UTC [repmgrd] connecting to database "host=postgres-0.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10" 2021-01-29 07:53:39 +0000 UTC [repmgrd] starting monitoring of node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) 2021-01-29 07:53:39 +0000 UTC [repmgrd] "connection_check_type" set to "ping" 2021-01-29 07:53:39 +0000 UTC [repmgrd] executing notification command for event "repmgrd_start" 2021-01-29 07:53:39 +0000 UTC [repmgr-event] 100 repmgrd_start 1 2021-01-29 07:53:39.302793+00 monitoring connection to upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) 2021-01-29 07:53:39 +0000 UTC [repmgrd] monitoring connection to upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) 2021-01-29 07:54:39 +0000 UTC [repmgrd] node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 07:55:40 +0000 UTC [repmgrd] node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 07:56:41 +0000 UTC [repmgrd] node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 07:57:41 +0000 UTC [repmgrd] node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 07:58:42 +0000 UTC [repmgrd] node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state 2021-01-29 07:59:42 +0000 UTC [repmgrd] node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state The moment postgres-0 is now detected as standby at 2021-01-29 07:53:39.302793+00 , just 2 minutes early when this split brain issue was resolved on postgres-0 that's when postgres-2 comes back as well, to be precise : 2021-01-29 07:51:55 +0000 Looks like there is a cleanup script which runs in the backgorund after 1.5 hours to check if there is a multiple master situation. Will blog more details about it once research progresses
Deployment failures are seen when cloud-init based config added to the blueprint
When a user attempts to deploy a virtual machine from a blueprint that has cloud-init commands in it we then see the following exception as soon as we clone the machine and power it on Exception vCenter responded with "Permission to perform this operation was denied" Logs where the exception is recorded vRealize Automation logs provisioning-service-app-xxxxxxx/provisioning-service-app.log ebs-app-xxxxxxx/ebs-app.log catalog-service-app-xxxxxxx/catalog-service-app.log tango-blueprint-service-app-xxxxxxx/tango-blueprint-service-app.log proxy-service-xxxxxxx/squid-proxy.log vCenter vpxd-xxxx.log Snippets vCenter 2021-01-06T05:45:16.075Z verbose vpxd[04022] [Originator@6876 sub=Default opID=4d822cb6] [VpxVmomi] Invoking [reconfigure] on [vim.VirtualMachine:vm-748] session [52d723af-ca77-4846-2ffc-943a15ac0d3c(520ecdac-6870-7740-2a98-8ca92a0ed8f6)] 2021-01-06T05:45:16.075Z verbose vpxd[04022] [Originator@6876 sub=Vmomi opID=4d822cb6] Invoke error: vim.VirtualMachine.reconfigure session: 52d723af-ca77-4846-2ffc-943a15ac0d3c Throw: vim.fault.NoPermission vRA 2021-01-06T05:45:16.059Z [priority='INFO' thread='vsphere-io-70' user='' org='' context='' parent='' token=''] c.v.p.c.m.a.vsphere.InstanceClient.log:448 - Customizing after clone, setting cpu count to [2] on VM [VirtualMachine:vm-748]. 2021-01-06T05:45:16.060Z [priority='INFO' thread='vsphere-io-70' user='' org='' context='' parent='' token=''] c.v.p.c.m.a.vsphere.InstanceClient.log:448 - Customizing after clone, setting memory in MB to [2048] on VM [VirtualMachine:vm-748]. 2021-01-06T05:45:16.067Z [priority='ERROR' thread='vsphere-io-70' user='' org='' context='' parent='' token=''] c.v.p.c.m.a.v.u.c.SOAPMessageHandler.logMessage:113 - Fault received from vCenter : ServerFaultCodePermission to perform this operation was denied.VApp.ApplicationConfig After a little bit of research found that the following permission was missing adding this specific permission to the user account which is doing this task did help in resolving this problem vApp.vApp application configuration: Allows modification of a vApp's internal structure, such as product information and properties. If you look at the vRA log snippet in detail it clearly tells you that what permission is needed in order to fix this issue. VApp.ApplicationConfig For more information on permissions or privileges refer to doco: https://docs.vmware.com/en/VMware-vSphere/6.7/com.vmware.vsphere.security.doc/GUID-20C1A2F2-F7ED-4E03-A62C-E2E6F3743E39.html
What is Cirruswave?
Cirruswave is a Cloud Enablement Platform that allows us to develop, document, test & deploy enterprise-grade, secure robust, and scalable microservices Watch this video to know more......
vRealize Automation 8.1 P3 deploy.sh is failing with Error: cannot connect to Tiller
Deploying vRealize Automation 8.1 P3 fails at the "checking helm version" step with the error: cannot connect to Tiller. This issue occurs because the tiller is running on a single vRA node in the case of vRA HA. However, if the deploy.sh is run from a node different than the tiller node, the helm requests will timeout and the deploy.sh will fail. Workaround Download the file attached below On only one of the vRA nodes, upload the "2672731-add-tiller-proxy.tar.gz" file in "/root" Run this command tar -xvf 2672731-add-tiller-proxy.tar.gz && chmod a+x 2672731-add-tiller-proxy.sh && ./2672731-add-tiller-proxy.sh && rm 2672731-add-tiller-proxy.* Run the "helm ls" command from all vRA nodes to confirm that helm can be reached from all vRA nodes