At work, I’ve had the pleasure and the ability to build new systems and migrate users by workload instead of needing to upgrade systems that I know little about. I’ve done this when I needed to go from vSphere 5.5 to 6.5, Horizon View 6.2 to Horizon 7.3, and, most recently, from vRealize Automation 7.0.1 to 7.5. vSphere was done by Cluster. Horizon was done by Desktop Pool. vRealize Automation was done by Tenant. It’s a much longer process because of the transition, but it’s important to me to not interrupt the business. I’d much rather have the fallback of “let’s go back to the other system for now” instead of needing to roll everything back.
I’ve finally reached the point where I’ve shut down vRealize Automation 7.0.1 and all of the supporting systems. All the old templates have been deleted. There are no more unmanaged machines. The ESXi hosts have been moved into a new vCenter. Today was a good opportunity to re-provision them to bring them back into the rotation of running workloads.
During the upgrade process I ran into a snag. I had kicked off an upgrade via vSphere Update Manager (VUM) and gone about my way doing other things. When I looked at the system more than 30 minutes later, I still found it showing as (disconnected). Dropping into iDRAC, I was greeted with an error message:
I found this to be a rather interesting error message. I had added the host to a vCenter that’s at vSphere 6.7 Update 3 – ESXi hosts running 5.5 (regardless of what update they are) can’t connect to a 6.7 vCenter Server. I flipped over to vCenter and found that I, in fact, wasn’t crazy. The system reported that it was running ESXi 6.0.0 3620759. It’s an old build, but still it should be capable of upgrading to the latest and greatest.
I pressed Enter so that I could continue (it did tell me to do so, after all). The system rebooted to the same error again. Because I felt like it was a fluke the first two times, I rebooted a third time to the same error. This time, I decided to take a look at the vmkernel.log and see what I could find there.
When tailing the vmkernel.log file, I found most of the error that I was being presented with in the DCUI. A few lines above the error, I discovered that there were two bootconfig files which were being evaluated. Another handful of lines above that and I was presented with the boodDiskUUID.
Sure enough, one of the files showed ESXi 5.5 with the other showing ESXi 6.0. I found this to be interesting since
ls /bootbank returned data, but
ls /altbootbank returned “not found” errors.
(Aside: I wonder whether this is because the host booted after receiving the payload from VUM to perform an upgrade…?)
I was happy to discover the problem, but still struggled to find a solution. I played around a bit in localcli (esxcli was unavailable as the system wasn’t fully booted at the time) in an effort to change the primary boot configuration. I was able to query the boot configuration (
localcli system boot device get) but did not have a way to change it. I searched online for a while, but didn’t find any information about setting a specific boot volume.
Thankfully, the system successfully mounted shared storage. I made a copy of the boot volume to the shared storage. I verified that the copy completed, carefully deleted the 5.5 boot volume, and rebooted the host. Upon reboot, the system successfully booted the ESXi 6.0 Update 2 image. This then allowed the system to connect to vCenter where I again issued the Remediation to upgrade the host. The host upgraded successfully. Success!
Important note: I imagine that there is a better way to accomplish this task. Should this have failed, I was prepared to configure a new host after a fresh installation of ESXi 6.7 Update 3. If you are not prepared to install from scratch, it may be best to open a ticket with VMware Support instead of deleting boot-related items.
Do you know how to set the boot volume via CLI or how to fix this in a slightly cleaner way? Share it with me, please!