Before we initiate an upgrade on vRealize Automation 8.x to 8.2, we have to upgrade vRLCM to 8.2
Now once we have vRLCM ready on 8.2, let's go ahead and discuss steps taken to upgrade vRA to version 8.2
User validations
Validate Postgres Replication
I've ensured there are no Postgres replication issues by executing the below command
seq 0 2 | xargs -r -n 1 -I {} kubectl -n prelude exec postgres-{} -- chpst -u postgres repmgr node status
DEBUG: connecting to: "user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10 dbname=repmgr-db host=postgres-0.postgres.prelude.svc.cluster.local fallback_application_name=repmgr"
Node "postgres-0.postgres.prelude.svc.cluster.local":
PostgreSQL version: 10.10
Total data size: 936 MB
Conninfo: host=postgres-0.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10
Role: primary
WAL archiving: enabled
Archive command: /bin/true
WALs pending archiving: 0 pending files
Replication connections: 2 (of maximal 10)
Replication slots: 0 physical (of maximal 10; 0 missing)
Replication lag: n/a
DEBUG: connecting to: "user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10 dbname=repmgr-db host=postgres-1.postgres.prelude.svc.cluster.local fallback_application_name=repmgr"
Node "postgres-1.postgres.prelude.svc.cluster.local":
PostgreSQL version: 10.10
Total data size: 933 MB
Conninfo: host=postgres-1.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10
Role: standby
WAL archiving: disabled (on standbys "archive_mode" must be set to "always" to be effective)
Archive command: /bin/true
WALs pending archiving: 0 pending files
Replication connections: 0 (of maximal 10)
Replication slots: 0 physical (of maximal 10; 0 missing)
Upstream node: postgres-0.postgres.prelude.svc.cluster.local (ID: 100)
Replication lag: 0 seconds
Last received LSN: 2/DA9C5A00
Last replayed LSN: 2/DA9C5A00
DEBUG: connecting to: "user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10 dbname=repmgr-db host=postgres-2.postgres.prelude.svc.cluster.local fallback_application_name=repmgr"
Node "postgres-2.postgres.prelude.svc.cluster.local":
PostgreSQL version: 10.10
Total data size: 933 MB
Conninfo: host=postgres-2.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10
Role: standby
WAL archiving: disabled (on standbys "archive_mode" must be set to "always" to be effective)
Archive command: /bin/true
WALs pending archiving: 0 pending files
Replication connections: 0 (of maximal 10)
Replication slots: 0 physical (of maximal 10; 0 missing)
Upstream node: postgres-0.postgres.prelude.svc.cluster.local (ID: 100)
Replication lag: 0 seconds
Last received LSN: 2/DA9C5DA8
Last replayed LSN: 2/DA9C5DA8
My vRA 8.x environment is a distributed instance. hence it consists of 3 vRA nodes.
Each Postgres Instance belongs to one node in the background which is constantly replicated
No LB Changes Needed
I have not made any changes to my Load Balancer which is managing my distributed vRA 8.2 instances.
Validate Pods Health
Ensure All Pods are in Running and Ready state
Trigger Inventory Sync
Trigger Inventory sync before the upgrade
Submitting Upgrade Request and Prechecks
Step:1
Create a snapshot using vRLCM
Browse through the vRA environment and then select UPGRADE
Step:2
This will bring you an upgrade UI where you have to select Repository Type
In my case, I've downloaded 8.2 beforehand and
had it ready under my Product Binaries
Step:3
This pane would give you an option to trigger inventory sync if this was not performed before. If this has been done before triggering an upgrade then you may ignore it.
Once Inventory Sync is complete you may proceed to the next step
Step:4
In this step, one has to perform a precheck before performing an upgrade
Once you click on run precheck, you would be presented with a pane where you have to agree that all manual validations have been performed. This is talking about vIDM Hardware resources
Prechecks start
There is a failure. VMware introduced a check to ensure /services/logs has enough space on all the vRealize Automation appliances
This is a mandatory step that should not be missed.
If we click on VIEW under the Recommendations pane we will be presented with a pane that has all the steps to resolve the above problem.
The exception is stating that /dev/sdc which is Hard Disk 3 on the Virtual Appliance does not have enough space
Remember, I've taken snapshots for my vRealize Automation Appliances before. So to extend I had to remove to snapshots
Then extend Hard Disk 3 size from 8 GB to 30 GB, adding additional 22 GB of space
In the below screenshot as you can see my /dev/sdc was only 8 GB
Even after performing a resize the new size was not reflecting
Resize was throwing an error
[2020-10-08T04:41:12.050Z] Disk size for disk /dev/sdb has not changed.
[2020-10-08T04:41:12.079Z] Rescanning disk /dev/sdc...
[2020-10-08T04:41:12.222Z] Disk size for disk /dev/sdc has increased from 8589934592 to 32212254720.
[2020-10-08T04:41:12.423Z] Resizing physical volume...
Physical volume "/dev/sdc" changed
1 physical volume(s) resized / 0 physical volume(s) not resized
[2020-10-08T04:41:12.559Z] Physical volume resized.
[2020-10-08T04:41:12.722Z] Extending logical volume services-logs...
Size of logical volume logs_vg/services-logs changed from <8.00 GiB (2047 extents) to <30.00 GiB (7679 extents).
Logical volume logs_vg/services-logs successfully resized.
[2020-10-08T04:41:12.903Z] Logical volume resized.
[2020-10-08T04:41:12.916Z] Resizing file system...
resize2fs 1.43.4 (31-Jan-2017)
open: No such file or directory while opening /dev/mapper/logs_vg-services-logs
[2020-10-08T04:41:13.029Z] ERROR: Error resizing file system.
[2020-10-08T04:41:13.053Z] Rescanning disk /dev/sdd...
[2020-10-08T04:41:13.178Z] Disk size for disk /dev/sdd has not changed.
This was the same instruction present under the View pane. if you hit this exception, then we have to follow Step#3 from KB article 79925
After this step, the new size is reflected and we can now move forward as know that the prechecks will be successful
As stated earlier, after resolving /services-logs partition sizing issue all prechecks validations have been successful
Now when we click on next, we now head into the final phase of submitting the vRA upgrade request
Once you click on submit, the upgrade is initiated
Upgrade
There is nothing a user has to do once an upgrade request is submitted. It takes 2 hours and 35 minutes to complete 2 stages of the upgrade
Stage 1 is called as vRealize Automation Upgrade/Patch/Internal Network Range Change
Stage 2 is called as productupgradeinventoryupdate
Stage 1 in detail
Starts the upgrade
Checks vRealize Automation version
Copies vIDM Admin token to vRA
Initiates vRA upgrade
Upload vRA upgrade pre-hook script
Run vRA upgrade pre-hook script
vRA upgrade status check
Prepare vRA for an upgrade, this goes in a loop for a while till all the nodes are prepared
Proceed to take a snapshot
Extract vRA nodes
Extract vMoid from VM's for vRA
Take a snapshot of vRA using vMOID
13. Power On vRA using vMOID
14. Performs Hostname and IP checks until the appliance is back
15. Upgrade vRealize Automation is triggered
16. This goes in a loop with upgrade status check
17. Waits for initialization after vRA upgrade
18. Finalization
That's it for Stage:1, it takes a lot of time, 2 hours and 35 minutes for a 3 node architecture at the 15th and 16th step which is quite obvious
The second stage of productupgradeinventoryupdate takes hardly any milliseconds
Logs to check during an upgrade
These are a few logs which can be monitored or involved during the upgrade
The order of the logs is not the way it's being upgraded
/var/log/vmware/prelude/upgrade-YYYY-MM-DD-HH-NN-SS.log
/var/log/vmware/prelude/upgrade-report-latest
/var/log/vmware/prelude/upgrade-report-latest.json
/var/log/deploy.log
/opt/vmware/var/log/vami/vami.log
/opt/vmware/var/log/vami/updatecli.log
We will deep-dive from logs perspective in my next blog
Comments