I was doing analysis on a problem where PostgreSQL pods were showing down. vRA 8.x deployments were stuck and not working as expected
In an ideal scenario, there should always be one master and two standby nodes in vRA 8.x
Ideal Status
Node "postgres-0.postgres.prelude.svc.cluster.local":
PostgreSQL version: 10.10
Total data size: 26 GB
Conninfo: host=postgres-0.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10
Role: standby
WAL archiving: disabled (on standbys "archive_mode" must be set to "always" to be effective)
Archive command: /bin/true
WALs pending archiving: 0 pending files
Replication connections: 0 (of maximal 10)
Replication slots: 0 physical (of maximal 10; 0 missing)
Upstream node: postgres-1.postgres.prelude.svc.cluster.local (ID: 101)
Replication lag: 0 seconds
Last received LSN: 307/595EBA48
Last replayed LSN: 307/595EBA48
Node "postgres-1.postgres.prelude.svc.cluster.local":
PostgreSQL version: 10.10
Total data size: 26 GB
Conninfo: host=postgres-1.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10
Role: primary
WAL archiving: enabled
Archive command: /bin/true
WALs pending archiving: 0 pending files
Replication connections: 1 (of maximal 10)
Replication slots: 0 physical (of maximal 10; 0 missing)
Replication lag: n/a
Node "postgres-2.postgres.prelude.svc.cluster.local":
PostgreSQL version: 10.10
Total data size: 26 GB
Conninfo: host=postgres-2.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10
Role: standby
WAL archiving: disabled (on standbys "archive_mode" must be set to "always" to be effective)
Archive command: /bin/true
WALs pending archiving: 0 pending files
Replication connections: 0 (of maximal 10)
Replication slots: 0 physical (of maximal 10; 0 missing)
Upstream node: postgres-1.postgres.prelude.svc.cluster.local (ID: 101)
Replication lag: 0 seconds
Last received LSN: 304/E1A1B458
Last replayed LSN: 304/E1A1B458
When issue was seen one of the postgres pods was reporting down
postgres-0 1/1 Running 1 3d11h
postgres-1 1/1 Running 0 3d11h
postgres-2 0/1 Running 1 3d11h
By the time we realized, it fixed itself
Was curious to know what happened and how did it fix itself. So let's start our deepdive
Started inspecting postgres-1 which is the current master node
Till 2021-01-29 06:10:19 +0000 UTC , it was detecting that postgres-0 was it's master and all of a sudden at 2021-01-29 06:10:35 +0000 it was unable to ping postgres-0
2021-01-29 06:09:19 +0000 UTC [repmgrd] node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) monitoring upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state
2021-01-29 06:10:19 +0000 UTC [repmgrd] node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) monitoring upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state
2021-01-29 06:10:35 +0000 UTC [repmgrd] unable to ping "host=postgres-0.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10"
2021-01-29 06:10:35 +0000 UTC [repmgrd] unable to connect to upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100)
2021-01-29 06:10:35 +0000 UTC [repmgrd] checking state of node 100, 1 of 1 attempts
2021-01-29 06:10:37 +0000 UTC [socat] W waitpid(-1, {}, WNOHANG): no child has exited
. Below snippets show that postgres-1 has been promoted as MASTER, where previous MASTER was postgres-0.
postgres-1 has been promoted as MASTER on 2021-01-29 06:11:06.17188+00
2021-01-29 06:09:19 +0000 UTC [repmgrd] node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) monitoring upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state
2021-01-29 06:10:19 +0000 UTC [repmgrd] node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) monitoring upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state
2021-01-29 06:10:35 +0000 UTC [repmgrd] unable to ping "host=postgres-0.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10"
2021-01-29 06:10:35 +0000 UTC [repmgrd] unable to connect to upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100)
2021-01-29 06:10:35 +0000 UTC [repmgrd] checking state of node 100, 1 of 1 attempts
2021-01-29 06:10:37 +0000 UTC [socat] W waitpid(-1, {}, WNOHANG): no child has exited
2021-01-29 06:10:45 +0000 UTC [repmgrd] unable to ping "user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10 dbname=repmgr-db host=postgres-0.postgres.prelude.svc.cluster.local fallback_application_name=repmgr"
2021-01-29 06:10:45 +0000 UTC [repmgrd] unable to reconnect to node 100 after 1 attempts
2021-01-29 06:10:45 +0000 UTC [repmgrd] 1 active sibling nodes registered
2021-01-29 06:10:45 +0000 UTC [repmgrd] primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) and this node have the same location ("default")
2021-01-29 06:10:45 +0000 UTC [repmgrd] local node's last receive lsn: 418/C2E8E700
2021-01-29 06:10:45 +0000 UTC [repmgrd] checking state of sibling node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102)
2021-01-29 06:10:45 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) reports its upstream is node 100, last seen 20 second(s) ago
2021-01-29 06:10:45 +0000 UTC [repmgrd] standby node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) last saw primary node 20 second(s) ago
2021-01-29 06:10:45 +0000 UTC [repmgrd] last receive LSN for sibling node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) is: 418/C2E8E700
2021-01-29 06:10:45 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) has same LSN as current candidate "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101)
2021-01-29 06:10:45 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) has lower priority (98) than current candidate "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) (99)
2021-01-29 06:10:45 +0000 UTC [repmgrd] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 2 seconds
2021-01-29 06:10:45 +0000 UTC [repmgrd] promotion candidate is "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101)
2021-01-29 06:10:45 +0000 UTC [repmgrd] executing "failover_validation_command"
2021-01-29 06:10:56 +0000 UTC [repmgrd] no output returned from command
2021-01-29 06:10:56 +0000 UTC [repmgrd] failover validation command returned zero
2021-01-29 06:10:56 +0000 UTC [repmgrd] this node is the winner, will now promote itself and inform other nodes
2021-01-29 06:10:56 +0000 UTC [repmgrd] promote_command is:
2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:58 +0000 UTC [socat] W waitpid(-1, {}, WNOHANG): no child has exited
*
*
*
2021-01-29 06:11:04 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:11:04 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:11:06 +0000 UTC [repmgr] 1 sibling nodes found, but option "--siblings-follow" not specified
2021-01-29 06:11:06 +0000 UTC [repmgr] promoting standby to primary
2021-01-29 06:11:06 +0000 UTC [postgres] [9-1] 2021-01-29 06:11:06.065 UTC [171] LOG: received promote request
2021-01-29 06:11:06 +0000 UTC [postgres] [7-1] 2021-01-29 06:11:06.065 UTC [175] FATAL: terminating walreceiver process due to administrator command
2021-01-29 06:11:06 +0000 UTC [postgres] [10-1] 2021-01-29 06:11:06.066 UTC [171] LOG: invalid resource manager ID 47 at 418/C2E8E700
2021-01-29 06:11:06 +0000 UTC [postgres] [11-1] 2021-01-29 06:11:06.066 UTC [171] LOG: redo done at 418/C2E8E6D8
2021-01-29 06:11:06 +0000 UTC [postgres] [12-1] 2021-01-29 06:11:06.066 UTC [171] LOG: last completed transaction was at log time 2021-01-29 06:10:25.633443+00
2021-01-29 06:11:06 +0000 UTC [postgres] [13-1] 2021-01-29 06:11:06.076 UTC [171] LOG: selected new timeline ID: 6
2021-01-29 06:11:06 +0000 UTC [postgres] [14-1] 2021-01-29 06:11:06.148 UTC [171] LOG: archive recovery complete
2021-01-29 06:11:06 +0000 UTC [repmgr] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
2021-01-29 06:11:06 +0000 UTC [repmgr] STANDBY PROMOTE successful
2021-01-29 06:11:06 +0000 UTC [repmgr-event] 101 standby_promote 1 2021-01-29 06:11:06.17188+00 server "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) was successfully promoted to primary
2021-01-29 06:11:06 +0000 UTC [repmgr-event] Reconfiguring the db proxy with primary: postgres-1.postgres.prelude.svc.cluster.local
2021-01-29 06:11:06 +0000 UTC [postgres] [6-1] 2021-01-29 06:11:06.203 UTC [170] LOG: database system is ready to accept connections
2021-01-29 06:11:06 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/sbin/sv stop db-proxy
2021-01-29 06:11:06 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0)
Post this on postgres-1 we see following snippets. At 2021-01-29 06:11:07 +0000 postgres-0 has been marked as a failure. Notifies postgres-2 as a new standby node
2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15
2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15
2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15
2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15
2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15
2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15
*
*
2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15
2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15
2021-01-29 06:11:06 +0000 UTC [sudo] pam_unix(sudo:session): session closed for user root
2021-01-29 06:11:06 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/bin/tee /tmp/primary
2021-01-29 06:11:06 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0)
2021-01-29 06:11:06 +0000 UTC [sudo] pam_unix(sudo:session): session closed for user root
2021-01-29 06:11:06 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/sbin/sv start db-proxy
2021-01-29 06:11:06 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0)
socat: no process found
2021-01-29 06:11:07 +0000 UTC [sudo] pam_unix(sudo:session): session closed for user root
2021-01-29 06:11:07 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/bin/touch /tmp/initialized
2021-01-29 06:11:07 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0)
2021-01-29 06:11:07 +0000 UTC [sudo] pam_unix(sudo:session): session closed for user root
2021-01-29 06:11:07 +0000 UTC [repmgrd] executing notification command for event "repmgrd_failover_promote"
2021-01-29 06:11:07 +0000 UTC [repmgr-event] 101 repmgrd_failover_promote 1 2021-01-29 06:11:07.074526+00 node 101 promoted to primary; old primary 100 marked as failed
2021-01-29 06:11:07 +0000 UTC [repmgrd] 1 followers to notify
2021-01-29 06:11:07 +0000 UTC [repmgrd] notifying node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) to follow node 101
2021-01-29 06:11:07 +0000 UTC [repmgrd] switching to primary monitoring mode
2021-01-29 06:11:07 +0000 UTC [repmgrd] executing notification command for event "repmgrd_reload"
2021-01-29 06:11:07 +0000 UTC [repmgr-event] 101 repmgrd_reload 1 2021-01-29 06:11:07.086852+00 monitoring cluster primary "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101)
2021-01-29 06:11:07 +0000 UTC [repmgrd] monitoring cluster primary "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101)
2021-01-29 06:11:12 +0000 UTC [repmgrd] new standby "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) has connected
2021-01-29 06:11:12 +0000 UTC [repmgrd] executing notification command for event "child_node_new_connect"
2021-01-29 06:11:12 +0000 UTC [repmgr-event] 101 child_node_new_connect 1 2021-01-29 06:11:12.114782+00 new standby "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) has connected
2021-01-29 06:12:07 +0000 UTC [repmgrd] monitoring primary node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state
2021-01-29 06:13:07 +0000 UTC [repmgrd] monitoring primary node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state
2021-01-29 06:14:07 +0000 UTC [repmgrd] monitoring primary node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state
Around 2021-01-29 06:16:48 +0000 postgres-1 reports that postgres-2 has disconnected as a standby
postgres-1
2021-01-29 06:16:08 +0000 UTC [repmgrd] monitoring primary node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state
2021-01-29 06:16:48 +0000 UTC [repmgrd] standby node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) has disconnected
2021-01-29 06:16:48 +0000 UTC [repmgrd] executing notification command for event "child_node_disconnect"
2021-01-29 06:16:48 +0000 UTC [repmgr-event] 101 child_node_disconnect 1 2021-01-29 06:16:48.7483+00 standby node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) has disconnected
2021-01-29 06:17:08 +0000 UTC [repmgrd] monitoring primary node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state
2021-01-29 06:17:16 +0000 UTC [postgres] [7-1] 2021-01-29 06:17:16.899 UTC [2820] LOG: unexpected EOF on client connection with an open transaction
2021-01-29 06:17:16 +0000 UTC [postgres] [7-1] 2021-01-29 06:17:16.914 UTC [2982] LOG: could not receive data from client: Connection reset by peer
2021-01-29 06:17:16 +0000 UTC [postgres] [7-1] 2021-01-29 06:17:16.914 UTC [2973] LOG: could not send data to client: Broken pipe
2021-01-29 06:17:16 +0000 UTC [postgres] [8-1] 2021-01-29 06:17:16.918 UTC [2973] LOG: could not receive data from client: Connection reset by peer
2021-01-29 06:17:16 +0000 UTC [postgres] [9-1] 2021-01-29 06:17:16.918 UTC [2973] FATAL: connection to client lost
2021-01-29 06:17:16 +0000 UTC [postgres] [7-1] 2021-01-29 06:17:16.920 UTC [2811] LOG: unexpected EOF on client connection with an open transaction
2021-01-29 06:18:09 +0000 UTC [repmgrd] monitoring primary node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state
2021-01-29 06:19:09 +0000 UTC [repmgrd] monitoring primary node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state
postgres-2
2021-01-29 06:11:08 +0000 UTC [postgres] [6-1] 2021-01-29 06:11:08.631 UTC [7056] LOG: fetching timeline history file for timeline 6 from primary server
2021-01-29 06:11:08 +0000 UTC [repmgr] STANDBY FOLLOW successful
2021-01-29 06:11:08 +0000 UTC [repmgr] executing notification command for event "standby_follow"
2021-01-29 06:11:08 +0000 UTC [repmgr-event] 102 standby_follow 1 2021-01-29 06:11:08.63358+00 standby attached to upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) 101 host=postgres-1.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/rep
mgr-db.cred connect_timeout=10 postgres-1.postgres.prelude.svc.cluster.local
2021-01-29 06:11:08 +0000 UTC [repmgr-event] Reconfiguring the db proxy with primary: postgres-1.postgres.prelude.svc.cluster.local
2021-01-29 06:11:08 +0000 UTC [postgres] [7-1] 2021-01-29 06:11:08.662 UTC [7056] LOG: started streaming WAL from primary at 418/C2000000 on timeline 5
2021-01-29 06:11:08 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/sbin/sv stop db-proxy
2021-01-29 06:11:08 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0)
2021-01-29 06:11:08 +0000 UTC [socat] W exiting on signal 15
2021-01-29 06:11:08 +0000 UTC [socat] W exiting on signal 15
*
*
2021-01-29 06:11:08 +0000 UTC [postgres] [8-1] 2021-01-29 06:11:08.821 UTC [7056] LOG: replication terminated by primary server
2021-01-29 06:11:08 +0000 UTC [postgres] [8-2] 2021-01-29 06:11:08.821 UTC [7056] DETAIL: End of WAL reached on timeline 5 at 418/C2E8E700.
2021-01-29 06:11:08 +0000 UTC [postgres] [10-1] 2021-01-29 06:11:08.822 UTC [7052] LOG: new target timeline is 6
2021-01-29 06:11:08 +0000 UTC [postgres] [9-1] 2021-01-29 06:11:08.828 UTC [7056] LOG: restarted WAL streaming at 418/C2000000 on timeline 6
2021-01-29 06:11:09 +0000 UTC [sudo] pam_unix(sudo:session): session closed for user root
2021-01-29 06:11:09 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/bin/tee /tmp/primary
2021-01-29 06:11:09 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0)
2021-01-29 06:11:09 +0000 UTC [sudo] pam_unix(sudo:session): session closed for user root
2021-01-29 06:11:09 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/sbin/sv start db-proxy
2021-01-29 06:11:09 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0)
socat: no process found
2021-01-29 06:11:09 +0000 UTC [sudo] pam_unix(sudo:session): session closed for user root
2021-01-29 06:11:09 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/bin/touch /tmp/initialized
2021-01-29 06:11:09 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0)
2021-01-29 06:11:09 +0000 UTC [sudo] pam_unix(sudo:session): session closed for user root
2021-01-29 06:11:09 +0000 UTC [repmgrd] node 102 now following new upstream node 101
2021-01-29 06:11:09 +0000 UTC [repmgrd] executing notification command for event "repmgrd_failover_follow"
2021-01-29 06:11:09 +0000 UTC [repmgr-event] 102 repmgrd_failover_follow 1 2021-01-29 06:11:09.56332+00 node 102 now following new upstream node 101
2021-01-29 06:11:09 +0000 UTC [repmgrd] resuming standby monitoring mode
2021-01-29 06:12:10 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state
2021-01-29 06:13:11 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state
2021-01-29 06:14:12 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state
2021-01-29 06:15:13 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state
2021-01-29 06:16:14 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state
2021-01-29 06:16:46 +0000 UTC [repmgrd] TERM signal received
2021-01-29 06:16:46 +0000 UTC [repmgrd] executing notification command for event "repmgrd_shutdown"
2021-01-29 06:16:46 +0000 UTC [repmgr-event] 102 repmgrd_shutdown 1 2021-01-29 06:16:46.832319+00 TERM signal received
2021-01-29 06:16:46 +0000 UTC [repmgrd] repmgrd terminating...
2021-01-29 06:16:47 +0000 UTC [postgres] [6-1] 2021-01-29 06:16:47.251 UTC [7051] LOG: received fast shutdown request
2021-01-29 06:16:47 +0000 UTC [postgres] [7-1] 2021-01-29 06:16:47.263 UTC [7051] LOG: aborting any active transactions
2021-01-29 06:16:47 +0000 UTC [postgres] [10-1] 2021-01-29 06:16:47.263 UTC [7056] FATAL: terminating walreceiver process due to administrator command
2021-01-29 06:16:48 +0000 UTC [postgres] [5-1] 2021-01-29 06:16:48.594 UTC [7053] LOG: shutting down
2021-01-29 06:16:48 +0000 UTC [postgres] [8-1] 2021-01-29 06:16:48.719 UTC [7051] LOG: database system is shut down
After the shutdown on postgres-2 which happened around 2021-01-29 06:16:48 +0000 , we see there was a startup procedure immediately
+ set -e
+ source /scripts/utils.sh
++ export PGCTLTIMEOUT=120
++ PGCTLTIMEOUT=120
++ export PGDATA=/data/live
++ PGDATA=/data/live
++ export PGFLAGS=/data/flags
++ PGFLAGS=/data/flags
++ export BACKUP_DIR=/data/backup
++ BACKUP_DIR=/data/backup
++ export DB_POPULATE=/scratch/populate_db.sql
++ DB_POPULATE=/scratch/populate_db.sql
++ export REPMGR_USER=repmgr-db
++ REPMGR_USER=repmgr-db
++ export REPMGR_PASSFILE=/scratch/repmgr-db.cred
++ REPMGR_PASSFILE=/scratch/repmgr-db.cred
++ export REPMGR_DB=repmgr-db
++ REPMGR_DB=repmgr-db
++ export DEFAULT_FAILURE_DELAY=10s
++ DEFAULT_FAILURE_DELAY=10s
++ export 'ERR_TOO_MANY_MASTERS=Found more than one active master.'
++ ERR_TOO_MANY_MASTERS='Found more than one active master.'
++ export 'ERR_NO_ACTIVE_MASTER=Found no active masters.'
++ ERR_NO_ACTIVE_MASTER='Found no active masters.'
*
*
*
https://10.244.4.1:443/apis/apps/v1/namespaces/prelude/statefulsets/postgres
+ local replicas_count=3
+ '[' 3 -eq 0 ']'
++ hostname -s
++ cut -b 10-
+ local index=2
++ xargs -n 1 -I '{}' echo 'postgres-{}.postgres.prelude.svc.cluster.local'
+++ expr 3 - 1
++ seq 0 2
+ export 'ALL_NODES=postgres-0.postgres.prelude.svc.cluster.local
postgres-1.postgres.prelude.svc.cluster.local
postgres-2.postgres.prelude.svc.cluster.local'
+ ALL_NODES='postgres-0.postgres.prelude.svc.cluster.local
postgres-1.postgres.prelude.svc.cluster.local
postgres-2.postgres.prelude.svc.cluster.local'
++ hostname -f
+ export SELF_NAME=postgres-2.postgres.prelude.svc.cluster.local
+ SELF_NAME=postgres-2.postgres.prelude.svc.cluster.local
++ expr 100 + 2
+ export SELF_ID=102
+ SELF_ID=102
++ echo 'postgres-0.postgres.prelude.svc.cluster.local
postgres-1.postgres.prelude.svc.cluster.local
postgres-2.postgres.prelude.svc.cluster.local'
++ c1grep -v postgres-2.postgres.prelude.svc.cluster.local
++ grep -v postgres-2.postgres.prelude.svc.cluster.local
+ export 'OTHER_NODES=postgres-0.postgres.prelude.svc.cluster.local
postgres-1.postgres.prelude.svc.cluster.local'
+ OTHER_NODES='postgres-0.postgres.prelude.svc.cluster.local
postgres-1.postgres.prelude.svc.cluster.local'
+ return 0
+ setup_repmgr
+ cat
++ connstring_from_node_id 102
++ local node_id=102
+++ expr 102 - 100
++ local node_index=2
+++ hostname
+++ cut -d - -f 1
+++ hostname -d
++ local host=postgres-2.postgres.prelude.svc.cluster.local
++ echo 'host=postgres-2.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10'
++ expr 200 - 102
+ setup_repmgrd
+ cat
+ find_current_master
+ master_nodes=()
+ read -r line
++ echo 'postgres-0.postgres.prelude.svc.cluster.local
postgres-1.postgres.prelude.svc.cluster.local'
+ '[' -z postgres-0.postgres.prelude.svc.cluster.local ']'
++ ssh postgres@postgres-0.postgres.prelude.svc.cluster.local repmgr node status --csv
ssh-keygen: generating new host keys: RSA DSA ECDSA ED25519
++ true
+ local 'node_status="Node name","postgres-0.postgres.prelude.svc.cluster.local"
"Node ID","100"
"PostgreSQL version","10.10"
"Total data size","32 GB"
"Conninfo","host=postgres-0.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10"
"Role","primary"
"WAL archiving","enabled"
"Archive command","/bin/true"
"WALs pending archiving","0"
"Replication connections","0 (of maximal 10)"
"Replication slots","0 physical (of maximal 10; 0 missing)"
"Upstream node","(none)"
"Replication lag","n/a"
"Last received LSN","(none)"
"Last replayed LSN","(none)"
"max_wal_senders",10
"occupied_wal_senders",0
"max_replication_slots",10
"active_replication_slots",0
"inactive_replication_slots",0
"missing_replication_slots",0'
*
*
*
++ cut -d ' ' -f 1
+ node_id=100
+ master_nodes+=("${node_id}")
+ read -r line
+ '[' -z postgres-1.postgres.prelude.svc.cluster.local ']'
++ ssh postgres@postgres-1.postgres.prelude.svc.cluster.local repmgr node status --csv
++ true
+ local 'node_status="Node name","postgres-1.postgres.prelude.svc.cluster.local"
"Node ID","101"
"PostgreSQL version","10.10"
"Total data size","32 GB"
"Conninfo","host=postgres-1.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10"
"Role","primary"
"WAL archiving","enabled"
"Archive command","/bin/true"
"WALs pending archiving","0"
"Replication connections","0 (of maximal 10)"
"Replication slots","0 physical (of maximal 10; 0 missing)"
"Upstream node","(none)"
"Replication lag","n/a"
"Last received LSN","(none)"
"Last replayed LSN","(none)"
"max_wal_senders",10
"occupied_wal_senders",0
"max_replication_slots",10
"active_replication_slots",0
"inactive_replication_slots",0
"missing_replication_slots",0'
*
*
*
++ cut -d , -f 2
++ grep '"Node ID"'
++ cut -d ' ' -f 1
++ tr -d '"'
+ node_id=101
+ master_nodes+=("${node_id}")
+ read -r line
+ unique_master_nodes=($(echo "${master_nodes[@]}" | tr ' ' '\n' | sort | uniq | tr '\n' ' '))
++ echo 100 101
++ sort
++ uniq
++ tr '\n' ' '
++ tr ' ' '\n'
+ unique_master_count=2
+ '[' 2 -eq 0 ']'
+ '[' 2 -eq 1 ']'
+ echo 'Found multiple master nodes 100' 101.
+ CURRENT_MASTER='Found more than one active master.'
+ return
+ [[ ! -f /data/live/postgresql.conf ]]
+ [[ -f /data/live/recovery.conf ]]
+ [[ Found more than one active master. == \F\o\u\n\d\ \m\o\r\e\ \t\h\a\n\ \o\n\e\ \a\c\t\i\v\e\ \m\a\s\t\e\r\. ]]
+ echo Too many active masters, will sleep and retry
+ delayed_exec /scripts/postgres_main.sh
+ sleep 10s
Found multiple master nodes 100 101.
Too many active masters, will sleep and retry
Even though postgres-2 was attempting to start it was not coming up as it was detecting multiple master nodes
This retry went on till the point it found only 1 Master: 2021-01-29 07:51:53.623 , that's when the service started
+ set -e
+ source /scripts/utils.sh
++ export PGCTLTIMEOUT=120
++ PGCTLTIMEOUT=120
++ export PGDATA=/data/live
++ PGDATA=/data/live
++ export PGFLAGS=/data/flags
++ PGFLAGS=/data/flags
++ export BACKUP_DIR=/data/backup
++ BACKUP_DIR=/data/backup
++ export DB_POPULATE=/scratch/populate_db.sql
++ DB_POPULATE=/scratch/populate_db.sql
++ export REPMGR_USER=repmgr-db
++ REPMGR_USER=repmgr-db
++ export REPMGR_PASSFILE=/scratch/repmgr-db.cred
++ REPMGR_PASSFILE=/scratch/repmgr-db.cred
++ export REPMGR_DB=repmgr-db
*
*
*
+ unique_master_nodes=($(echo "${master_nodes[@]}" | tr ' ' '\n' | sort | uniq | tr '\n' ' '))
++ echo 101
++ sort
++ tr ' ' '\n'
++ uniq
++ tr '\n' ' '
+ unique_master_count=1
+ '[' 1 -eq 0 ']'
+ '[' 1 -eq 1 ']'
+ echo 'Found master node 101.'
+ CURRENT_MASTER=101
+ return
Found master node 101.
+ [[ ! -f /data/live/postgresql.conf ]]
+ [[ -f /data/live/recovery.conf ]]
+ [[ 101 == \F\o\u\n\d\ \m\o\r\e\ \t\h\a\n\ \o\n\e\ \a\c\t\i\v\e\ \m\a\s\t\e\r\. ]]
+ [[ 101 == \F\o\u\n\d\ \n\o\ \a\c\t\i\v\e\ \m\a\s\t\e\r\s\. ]]
+ [[ 101 == \1\0\2 ]]
+ echo 'Existing master 101 found, will attempt to join to it.'
+ init_as_standby 101
+ CURRENT_MASTER=101
Existing master 101 found, will attempt to join to it.
++ find /data/live -maxdepth 0 -empty -type d
++ wc -l
+ '[' 0 -eq 0 ']'
+++ hostname_from_node_id 101
+++ local node_id=101
++++ expr 101 - 100
+++ local node_index=1
++++ hostname
++++ cut -d - -f 1
++++ hostname -d
+++ echo postgres-1.postgres.prelude.svc.cluster.local
++ ssh postgres@postgres-1.postgres.prelude.svc.cluster.local /scripts/postgres_timeline.sh
+ their_timeline=6
+ '[' -z 6 ']'
++ /scripts/postgres_timeline.sh
+ my_timeline=6
+ '[' -n 6 ']'
+ '[' 6 -lt 6 ']'
+ echo Taking a backup of the database directory.
Taking a backup of the database directory.
+ rm -rf /data/backup/backup_label.old /data/backup/base /data/backup/global /data/backup/pg_commit_ts /data/backup/pg_dynshmem /data/backup/pg_hba.conf /data/backup/pg_ident.conf /data/backup/pg_logical /data/backup/pg_multixact /data/backup/pg_notify /data/backup/pg_replslot /data/backup/pg_serial /data/backup/pg_snapshots /data/backup/pg_stat /data/backup/pg_stat_tmp /data/backup/pg_subtrans /data/backup/pg_tblspc /data/backup/pg_twophase /data/backup/PG_VERSION /data/backup/pg_wal /data/backup/pg_xact /data/backup/postgresql.auto.conf /data/backup/postgresql.conf /data/backup/postmaster.opts /data/backup/recovery.conf /data/backup/recovery.done
+ mv /data/live/backup_label.old /data/live/base /data/live/global /data/live/pg_commit_ts /data/live/pg_dynshmem /data/live/pg_hba.conf /data/live/pg_ident.conf /data/live/pg_logical /data/live/pg_multixact /data/live/pg_notify /data/live/pg_replslot /data/live/pg_serial /data/live/pg_snapshots /data/live/pg_stat /data/live/pg_stat_tmp /data/live/pg_subtrans /data/live/pg_tblspc /data/live/pg_twophase /data/live/PG_VERSION /data/live/pg_wal /data/live/pg_xact /data/live/postgresql.auto.conf /data/live/postgresql.conf /data/live/postmaster.opts /data/live/recovery.conf /data/live/recovery.done /data/backup/
+ chown -R postgres:postgres /data/live
*
*
/scripts/repmgr_event_handler.sh "102" "standby_clone" "1" "2021-01-29 07:51:53.563216+00" "cloned from host \"postgres-1.postgres.prelude.svc.cluster.local\", port 5432; backup method: pg_basebackup; --force: Y" "" "" ""
2021-01-29 07:51:53 +0000 UTC [repmgr-event] 102 standby_clone 1 2021-01-29 07:51:53.563216+00 cloned from host "postgres-1.postgres.prelude.svc.cluster.local", port 5432; backup method: pg_basebackup; --force: Y
+ chpst -u postgres pg_ctl start
waiting for server to start....2021-01-29 07:51:53.623 UTC [18761] LOG: listening on IPv4 address "0.0.0.0", port 5432
2021-01-29 07:51:53.623 UTC [18761] LOG: listening on IPv6 address "::", port 5432
2021-01-29 07:51:53 +0000 UTC [postgres] [1-1] 2021-01-29 07:51:53.623 UTC [18761] LOG: listening on IPv4 address "0.0.0.0", port 5432
2021-01-29 07:51:53 +0000 UTC [postgres] [2-1] 2021-01-29 07:51:53.623 UTC [18761] LOG: listening on IPv6 address "::", port 5432
2021-01-29 07:51:53.647 UTC [18761] LOG: listening on Unix socket "/run/postgresql/.s.PGSQL.5432"
2021-01-29 07:51:53 +0000 UTC [postgres] [3-1] 2021-01-29 07:51:53.647 UTC [18761] LOG: listening on Unix socket "/run/postgresql/.s.PGSQL.5432"
2021-01-29 07:51:53.802 UTC [18761] LOG: ending log output to stderr
2021-01-29 07:51:53.802 UTC [18761] HINT: Future log output will go to log destination "syslog".
2021-01-29 07:51:53 +0000 UTC [postgres] [4-1] 2021-01-29 07:51:53.802 UTC [18761] LOG: ending log output to stderr
2021-01-29 07:51:53 +0000 UTC [postgres] [4-2] 2021-01-29 07:51:53.802 UTC [18761] HINT: Future log output will go to log destination "syslog".
2021-01-29 07:51:53 +0000 UTC [postgres] [5-1] 2021-01-29 07:51:53.818 UTC [18762] LOG: database system was interrupted; last known up at 2021-01-29 07:51:03 UTC
2021-01-29 07:51:53 +0000 UTC [postgres] [6-1] 2021-01-29 07:51:53.947 UTC [18762] LOG: entering standby mode
2021-01-29 07:51:53 +0000 UTC [postgres] [7-1] 2021-01-29 07:51:53.977 UTC [18762] LOG: redo starts at 418/F9000028
2021-01-29 07:51:54 +0000 UTC [postgres] [8-1] 2021-01-29 07:51:54.171 UTC [18762] LOG: consistent recovery state reached at 418/FBB43658
.2021-01-29 07:51:55 +0000 UTC [postgres] [5-1] 2021-01-29 07:51:55.299 UTC [18761] LOG: database system is ready to accept read only connections
2021-01-29 07:51:55 +0000 UTC [postgres] [6-1] 2021-01-29 07:51:55.324 UTC [18771] LOG: started streaming WAL from primary at 418/FC000000 on timeline 6
done
server started
+ chpst -u postgres repmgr standby register --force
INFO: connecting to local node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102)
INFO: connecting to primary database
INFO: executing notification command for event "standby_register"
DETAIL: command is:
/scripts/repmgr_event_handler.sh "102" "standby_register" "1" "2021-01-29 07:51:55.429089+00" "standby registration succeeded; upstream node ID is 101 (-F/--force option was used)" "101" "host=postgres-1.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10" "postgres-1.postgres.prelude.svc.cluster.local"
2021-01-29 07:51:55 +0000 UTC [repmgr-event] 102 standby_register 1 2021-01-29 07:51:55.429089+00 standby registration succeeded; upstream node ID is 101 (-F/--force option was used) 101 host=postgres-1.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10 postgres-1.postgres.prelude.svc.cluster.local
INFO: standby registration complete
NOTICE: standby node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) successfully registered
+ hostname_from_node_id 101
So pretty much between 2021-01-29 06:16:48 +0000 --> 2021-01-29 07:51:53.623 for 1.5 hours this postgres-2 was down because of the multiple master situation.
Below snippet we can discuss why postgres-0 became master , when postgres-1 was already a master
On postgres-0 looks like it restarted the service around 2021-01-29 06:13:xx , this postgres-0 identified that there was only 1 Master. Then it tried to compare the timelines with existing master. While it was trying to compare existing timelines with the existing MASTER. Because of this difference in the timeline postgres-0 decides to become MASTER by itself
+ set -e
+ source /scripts/utils.sh
++ export PGCTLTIMEOUT=120
++ PGCTLTIMEOUT=120
++ export PGDATA=/data/live
++ PGDATA=/data/live
*
*
*
+ unique_master_nodes=($(echo "${master_nodes[@]}" | tr ' ' '\n' | sort | uniq | tr '\n' ' '))
++ echo 101 101
++ tr ' ' '\n'
++ sort
++ uniq
++ tr '\n' ' '
+ unique_master_count=1
+ '[' 1 -eq 0 ']'
+ '[' 1 -eq 1 ']'
+ echo 'Found master node 101.'
Found master node 101.
+ CURRENT_MASTER=101
+ return
+ [[ ! -f /data/live/postgresql.conf ]]
+ [[ -f /data/live/recovery.conf ]]
+ [[ 101 == \F\o\u\n\d\ \m\o\r\e\ \t\h\a\n\ \o\n\e\ \a\c\t\i\v\e\ \m\a\s\t\e\r\. ]]
+ [[ 101 == \F\o\u\n\d\ \n\o\ \a\c\t\i\v\e\ \m\a\s\t\e\r\s\. ]]
+ [[ 101 == \1\0\0 ]]
+ echo 'Existing master 101 found, will compare timelines.'
Existing master 101 found, will compare timelines.
++ /scripts/postgres_timeline.sh
+ my_timeline=5
+++ hostname_from_node_id 101
+++ local node_id=101
++++ expr 101 - 100
+++ local node_index=1
++++ hostname
++++ cut -d - -f 1
++++ hostname -d
+++ echo postgres-1.postgres.prelude.svc.cluster.local
++ ssh postgres@postgres-1.postgres.prelude.svc.cluster.local /scripts/postgres_timeline.sh
+ their_timeline=5
+ '[' -n 5 ']'
+ '[' -z 5 ']'
+ '[' 5 -gt 5 ']'
+ echo 'The other master has equal or older timeline, I should not join to it!'
+ echo 'Start as a master and let the liveness check decide on.'
+ init_as_master
The other master has equal or older timeline, I should not join to it!
Start as a master and let the liveness check decide on.
++ find /data/live -maxdepth 0 -empty -type d
++ wc -l
+ '[' 0 -gt 0 ']'
+ setup_postgres_parameters
+ cat
+ setup_postgres_hba
+ local pg_hba=/data/live/pg_hba.conf
+ truncate -s 0 /data/live/pg_hba.conf
+ cat
+ chpst -u postgres pg_ctl start
waiting for server to start....2021-01-29 06:13:33.487 UTC [120] LOG: listening on IPv4 address "0.0.0.0", port 5432
2021-01-29 06:13:33.488 UTC [120] LOG: listening on IPv6 address "::", port 5432
2021-01-29 06:13:33 +0000 UTC [postgres] [1-1] 2021-01-29 06:13:33.487 UTC [120] LOG: listening on IPv4 address "0.0.0.0", port 5432
2021-01-29 06:13:33 +0000 UTC [postgres] [2-1] 2021-01-29 06:13:33.488 UTC [120] LOG: listening on IPv6 address "::", port 5432
2021-01-29 06:13:33.494 UTC [120] LOG: listening on Unix socket "/run/postgresql/.s.PGSQL.5432"
2021-01-29 06:13:33 +0000 UTC [postgres] [3-1] 2021-01-29 06:13:33.494 UTC [120] LOG: listening on Unix socket "/run/postgresql/.s.PGSQL.5432"
2021-01-29 06:13:33.707 UTC [120] LOG: ending log output to stderr
2021-01-29 06:13:33.707 UTC [120] HINT: Future log output will go to log destination "syslog".
2021-01-29 06:13:33 +0000 UTC [postgres] [4-1] 2021-01-29 06:13:33.707 UTC [120] LOG: ending log output to stderr
2021-01-29 06:13:33 +0000 UTC [postgres] [4-2] 2021-01-29 06:13:33.707 UTC [120] HINT: Future log output will go to log destination "syslog".
2021-01-29 06:13:33 +0000 UTC [postgres] [5-1] 2021-01-29 06:13:33.715 UTC [121] LOG: database system was shut down at 2021-01-29 06:13:02 UTC
2021-01-29 06:13:33 +0000 UTC [postgres] [5-1] 2021-01-29 06:13:33.733 UTC [120] LOG: database system is ready to accept connections
done
server started
+ populate_databases
+ chpst -u postgres psql -f /scratch/populate_db.sql
*
*
*
You are now connected to database "repmgr-db" as user "postgres".
ALTER EXTENSION
psql:/scratch/populate_db.sql:199: NOTICE: version "5.1" of extension "repmgr" is already installed
+ [[ -f /data/flags/reindex.flag ]]
+ chpst -u postgres repmgr primary register --force
INFO: connecting to primary database...
INFO: "repmgr" extension is already installed
INFO: executing notification command for event "primary_register"
DETAIL: command is:
/scripts/repmgr_event_handler.sh "100" "primary_register" "1" "2021-01-29 06:13:34.052955+00" "existing primary record updated" "" "" ""
2021-01-29 06:13:34 +0000 UTC [repmgr-event] 100 primary_register 1 2021-01-29 06:13:34.052955+00 existing primary record updated
NOTICE: primary node record (ID: 100) updated
+ hostname -f
+ touch /tmp/initialized
+ runsv /etc/service/db-proxy/
+ runsv /etc/service/repmgrd/
socat: no process found
[2021-01-29 06:13:34] [NOTICE] using provided configuration file "/etc/repmgr.conf"
[2021-01-29 06:13:34] [NOTICE] setup syslog (level: , facility: LOCAL0)
2021-01-29 06:13:34 +0000 UTC [repmgrd] repmgrd (repmgrd 5.1.0) starting up
2021-01-29 06:13:34 +0000 UTC [repmgrd] connecting to database "host=postgres-0.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10"
2021-01-29 06:13:34 +0000 UTC [repmgrd] starting monitoring of node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100)
2021-01-29 06:13:34 +0000 UTC [repmgrd] "connection_check_type" set to "ping"
2021-01-29 06:13:34 +0000 UTC [repmgrd] executing notification command for event "repmgrd_start"
2021-01-29 06:13:34 +0000 UTC [repmgr-event] 100 repmgrd_start 1 2021-01-29 06:13:34.104243+00 monitoring cluster primary "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100)
2021-01-29 06:13:34 +0000 UTC [repmgrd] monitoring cluster primary "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100)
2021-01-29 06:13:34 +0000 UTC [repmgrd] child node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) is not yet attached
2021-01-29 06:13:34 +0000 UTC [repmgrd] child node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) is not yet attached
2021-01-29 06:14:34 +0000 UTC [repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state
2021-01-29 06:15:34 +0000 UTC [repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state
2021-01-29 06:16:34 +0000 UTC [repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state
So between 2021-01-29 06:13:34 till 2021-01-29 07:45:56 +0000 there was a multiple master situation.Around 2021-01-29 07:45:56 +0000 is when termination was received on postgres-0
2021-01-29 07:40:59 +0000 UTC [repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state
2021-01-29 07:42:00 +0000 UTC [repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state
2021-01-29 07:43:00 +0000 UTC [repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state
2021-01-29 07:44:00 +0000 UTC [repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state
2021-01-29 07:45:00 +0000 UTC [repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state
2021-01-29 07:45:56 +0000 UTC [repmgrd] TERM signal received
2021-01-29 07:45:56 +0000 UTC [repmgrd] executing notification command for event "repmgrd_shutdown"
2021-01-29 07:45:56 +0000 UTC [repmgr-event] 100 repmgrd_shutdown 1 2021-01-29 07:45:56.497585+00 TERM signal received
2021-01-29 07:45:56 +0000 UTC [repmgrd] repmgrd terminating...
2021-01-29 07:45:56 +0000 UTC [postgres] [6-1] 2021-01-29 07:45:56.919 UTC [120] LOG: received fast shutdown request
2021-01-29 07:45:56 +0000 UTC [postgres] [7-1] 2021-01-29 07:45:56.925 UTC [120] LOG: aborting any active transactions
2021-01-29 07:45:56 +0000 UTC [postgres] [6-1] 2021-01-29 07:45:56.925 UTC [10502] FATAL: terminating connection due to administrator command
2021-01-29 07:45:56 +0000 UTC [postgres] [6-1] 2021-01-29 07:45:56.925 UTC [10504] FATAL: terminating connection due to administrator command
2021-01-29 07:45:56 +0000 UTC [postgres] [6-1] 2021-01-29 07:45:56.925 UTC [10496] FATAL: terminating connection due to administrator command
This is when postgres-0 detects that postgres-1 has a new timeline and is the latest one and demotes itself to be a standby than a Master
+ unique_master_count=1
+ '[' 1 -eq 0 ']'
+ '[' 1 -eq 1 ']'
Found master node 101.
+ echo 'Found master node 101.'
+ CURRENT_MASTER=101
+ return
+ [[ ! -f /data/live/postgresql.conf ]]
+ [[ -f /data/live/recovery.conf ]]
+ [[ 101 == \F\o\u\n\d\ \m\o\r\e\ \t\h\a\n\ \o\n\e\ \a\c\t\i\v\e\ \m\a\s\t\e\r\. ]]
+ [[ 101 == \F\o\u\n\d\ \n\o\ \a\c\t\i\v\e\ \m\a\s\t\e\r\s\. ]]
+ [[ 101 == \1\0\0 ]]
Existing master 101 found, will compare timelines.
+ echo 'Existing master 101 found, will compare timelines.'
++ /scripts/postgres_timeline.sh
+ my_timeline=5
+++ hostname_from_node_id 101
+++ local node_id=101
++++ expr 101 - 100
+++ local node_index=1
++++ hostname
++++ cut -d - -f 1
++++ hostname -d
+++ echo postgres-1.postgres.prelude.svc.cluster.local
++ ssh postgres@postgres-1.postgres.prelude.svc.cluster.local /scripts/postgres_timeline.sh
+ their_timeline=6
+ '[' -n 5 ']'
+ '[' -z 6 ']'
+ '[' 6 -gt 5 ']'
+ echo 'The other master has a more recent timeline, I should join to it.'
The other master has a more recent timeline, I should join to it.
+ init_as_standby 101
+ CURRENT_MASTER=101
++ find /data/live -maxdepth 0 -empty -type d
++ wc -l
+ '[' 0 -eq 0 ']'
+++ hostname_from_node_id 101
+++ local node_id=101
++++ expr 101 - 100
+++ local node_index=1
++++ hostname
++++ cut -d - -f 1
++++ hostname -d
+++ echo postgres-1.postgres.prelude.svc.cluster.local
++ ssh postgres@postgres-1.postgres.prelude.svc.cluster.local /scripts/postgres_timeline.sh
+ their_timeline=6
+ '[' -z 6 ']'
++ /scripts/postgres_timeline.sh
+ my_timeline=5
+ '[' -n 5 ']'
Taking a backup of the database directory.
+ '[' 6 -lt 5 ']'
+ echo Taking a backup of the database directory.
*
*
*
2021-01-29 07:53:39 +0000 UTC [repmgrd] repmgrd (repmgrd 5.1.0) starting up
2021-01-29 07:53:39 +0000 UTC [repmgrd] connecting to database "host=postgres-0.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10"
2021-01-29 07:53:39 +0000 UTC [repmgrd] starting monitoring of node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100)
2021-01-29 07:53:39 +0000 UTC [repmgrd] "connection_check_type" set to "ping"
2021-01-29 07:53:39 +0000 UTC [repmgrd] executing notification command for event "repmgrd_start"
2021-01-29 07:53:39 +0000 UTC [repmgr-event] 100 repmgrd_start 1 2021-01-29 07:53:39.302793+00 monitoring connection to upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101)
2021-01-29 07:53:39 +0000 UTC [repmgrd] monitoring connection to upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101)
2021-01-29 07:54:39 +0000 UTC [repmgrd] node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state
2021-01-29 07:55:40 +0000 UTC [repmgrd] node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state
2021-01-29 07:56:41 +0000 UTC [repmgrd] node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state
2021-01-29 07:57:41 +0000 UTC [repmgrd] node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state
2021-01-29 07:58:42 +0000 UTC [repmgrd] node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state
2021-01-29 07:59:42 +0000 UTC [repmgrd] node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) monitoring upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state
The moment postgres-0 is now detected as standby at 2021-01-29 07:53:39.302793+00 , just 2 minutes early when this split brain issue was resolved on postgres-0 that's when postgres-2 comes back as well, to be precise : 2021-01-29 07:51:55 +0000
Looks like there is a cleanup script which runs in the backgorund after 1.5 hours to check if there is a multiple master situation. Will blog more details about it once research progresses
Comments