Yesterday I wrote a little about the process I went through when upgrading my production environment. Today, I want to talk about the issues I faced when doing so and what I learned during the project.
During the first part of the project, I moved virtual machines from a Distributed Switch to a Standard Switch. I was able to do this as a result of redundant NICs on all but a single host. The idea was simple – assign a redundant NIC to a Standard Switch, copy port groups from the Distributed Switch, and update virtual machine network adapters using a script.
First Gotcha –
I learned that the redundant NIC on two of my hosts were set at the switch level to be access ports. As a result, some virtual machines lost network connectivity until the issue was identified and rectified. The overall issue was one of configuration consistency and not one that I had anticipated.
Second Gotcha –
The single host without a redundant NIC had been migrated to a Distributed Switch. I searched but was unable to come up with a scenario where migrating back to a Standard Switch didn’t result in a loss of connectivity. One might argue that I could simply vMotion machines to other hosts. Due to operational constraints, that was not an option for this host.
In my head, the move is a two-step process. Physical NIC and vmkernel port for management traffic. Moving either one loses connectivity, right? Anyone with a better way to do this… please, PLEASE let me know. I ended up using esxcli to remove the vmkernel port from the Distributed Switch and re-add it to the Standard Switch after migrating the physical NIC via UI.
In my opinion, this is another configuration inconsistency and another at the physical layer. The host has the NICs, they just weren’t up. Entirely my fault for not stalling on this host until I could plug in another NIC. That said – planned downtime (albeit, rushed) of VMs was accepted.
Third Gotcha –
After all my migration tasks were complete and all VMs were back on a Distributed Switch in my new environment, I received a message indicating that a team member couldn’t log in to vCenter. A quick review of some logs indicated that time had drifted approximately ten minutes. I was able to manually set my date and time in vCenter and team members were able to log in successfully again.
The issue ended up being an NTP server syncing time with one of the hosts via VMware Tools – a configuration which I didn’t know existed. The root cause was, as you may have already guessed, an ESXi host improperly configured for NTP. Another configuration inconsistency.
Fourth Gotcha –
A single VLAN tag did not update on either Standard Switch or Distributed Switch which resulted in loss of VM network connectivity. The resolution was to add the VLAN tag to the port group. I was unable to identify exactly why the script I wrote was unable to properly create that port group.
In an attempt to re-create the issue, I was unsuccessful. Thankfully, it was a quick fix. Lesson learned – work on better error handling and/or output for scripts to make things easier to track down.
Fifth Gotcha –
Simultaneous Purple Screens of Death on two upgraded hosts. Exception 13 PSOD essentially caused by the differences in vNUMA handling between 5.5 and 6.5. I tested extensively for vMotion between 5.5 and 6.5 hosts to make sure something like this wouldn’t happen. What I didn’t test was a VM with a larger memory footprint like Exchange or SQL Server. Accessing multiple NUMA nodes caused a PSOD for two hosts at the same time. The neat part of this was that the VMs that actually caused the PSOD never actually migrated to the 6.5 hosts and remained on their 5.5 source.
This stopped my project pretty hard. VMware GSS confirmed that an upgrade from 6.5 GA to at least 6.5a would have prevented this. This project was initially greenlit with 6.5 GA and I had not spent extensive hours testing with Update 1. Lesson learned – RTFM. This is listed as a Known Issue and I missed it.
Sixth Gotcha –
Upgrade of a host failed on reboot when the host was unable to locate installation media. After multiple attempts to upgrade, I determined that the install media had died an untimely death. Thankfully, this didn’t result in an outage or a constrained cluster.
Ultimately, this would have happened at the next patch interval or whenever the host rebooted again. It wasn’t at all unique to the project, but was included in my retrospective anyway.
Seventh Gotcha –
With all other systems upgraded to Update 1, I needed up upgrade my VCSA to Update 1 (from GA). My VCSA was configured for VCHA and I was able to successfully upgrade both Passive and Witness nodes without issue. I manually initiated a failover to the Passive node and vCenter just never came back.
I found that the VCSA received a different IP than the static IP set on my Active node but I still have yet to determine why. There’s a KB article
about the vcha user password expiring and causing replication to fail, but I had visited that prior and did not have any database replication errors.
Resolution was to log in to the VAMI and re-assign the proper static IP. When I did this initially, all of my hosts disconnected when I logged in to the Web Client. After rebooting both Active and Passive nodes simultaneously (defeating the purpose of VCHA entirely), the VCSA came back properly. That said – I still have an open case with VMware GSS.
There’s a handful of things to be said for the project and I think the largest one is configuration management. I’ve struggled with Host Profiles and recently lamented about this in the vExpert Slack. Even with Host Profiles, at least two of the three configuration issues would have still been encountered!
Overall the project was very successful and this was one of the most smooth “upgrades” I’ve seen (or been the lead on, for that matter).
What have your experiences with upgrading to 6.5 been? Did you face any of these issues?