Meltdown/Spectre Patching – Enhanced vMotion Compatibility

tl;dr – If you’ve patched everything already but still fail verification, check to see if EVC is enabled on the cluster. I found that EVC did not update itself as described in the KB article. Disable/re-enable of EVC enabled new instructions to be applied.


By now, you and the rest of the world know about the Meltdown/Spectre vulnerabilities that were disclosed on 1/3/18 or sometime thereabouts. Earlier this week, VMware released patches and this KB detailing how to apply them.

I had already taken steps to patch the (physical) systems with available BIOS/firmware updates in my environment. When vCenter Server and ESXi patches were released (links to all of which can be found in the VMware Security Advisory here), I added those patches to the pile. Guest OSs had already received patches through other channels. There were probably patches for various lamps and/or lampshade microcode that needed to be applied elsewhere. Read: There’s really just a lot of patches to apply to mitigate this mess… moving on!

Being the absolute demon that he is, William Lam has created a script which will report on the vulnerability mitigation capability of ESXi hosts and VMs running on them. He’s documented that script and how to use it on his blog.

Using Mr. Lam’s script, I found that my ESXi hosts were patched properly and successfully seeing the new CPU instructions added by new microcode. The problem I had was identifying exactly why the VMs themselves weren’t receiving the same CPU instructions.

My validation process: Create a new VM of VM Hardware version 8 and verify that Mr. Lam’s script reported it as such. Upgrade the VM Hardware Version, Power-On the VM, and again run Mr. Lam’s script for comparison. The results were varied across the environment.

I narrowed the issue down to Enhanced vMotion Compatibility (EVC). On clusters where EVC was not enabled, my validation process showed that the VM was receiving the new CPU instructions. On clusters where EVC was enabled, the VM was not being presented with the new instructions.

The VMware KB article above indicates that when ESXi hosts in an EVC-enabled cluster are upgraded, the cluster maintains the current instruction set until all ESXi hosts in the cluster have been upgraded. At that time, EVC will automatically upgrade itself to enable VMs to receive the new instructions. Based on my observations, I needed to test this.

All hosts in my EVC-enabled cluster had been patched and are reporting as such both in vCenter and Mr. Lam’s script. My vCenter Server has also been patched appropriately (which is required for exactly this reason) and reports as such in… well, vCenter. Putting a host in Maintenance Mode, I removed it from an EVC-enabled cluster and left it as a standalone host. Performing my validation process was successful! I moved the host back into an EVC-enabled cluster and performed another validation – vulnerable again!

This confirmed to me that EVC was the culprit ruining my path to successful mitigation. After considering the potential consequences, I disabled EVC and re-enabled EVC on the cluster (at the same EVC level). Once re-enabled, all validation via Mr. Lam’s script passed. Further, the Microsoft code to validate in the Guest OS passed as well. Successful mitigation!

EVC – I won’t let you get the best of me!

Note: This information has been relayed to VMware. I expect that this will be updated at some point in the near future. I’ll update this blog post to reflect that as soon as I’m made aware.

Helpful Jira Query for Total Time Spent

My last two posts have detailed information about a major project that I had an opportunity to lead. As part of that project, I wrote a fairly detailed retrospective where I wanted to cover interesting facts, issues that I encountered, and the resolutions to those issues.

One of the interesting facts that I included was the total number of hours I logged against the project in our tracking software, Jira. What I learned very quickly was that Jira does not have very extensive time reporting out of the box. Most time reporting appears to require third-party add-ons. Some of those add-ons are free while others are paid. I have absolutely no idea what my company does and doesn’t have. My only time in the tool is to create my Epics, Stories, Tasks, etc. and to log work against them.

It seems crazy to me that something so basic is not readily available in the tool. (If I’m overlooking something point me in the right direction, please!) At the very least, I feel like the work done in the sub-parts of the Epic should roll up to the Epic itself. The worst part is that it doesn’t seem to be just me searching for this information. A not-so-exhaustive search identified a feature request for exactly this type of function to be added. That feature request was quite old and well-looked after by members of the community.

I’ll admit – for all intents and purposes, I’m a Jira rookie. After a little trial and error and some Google-fu, I was able to put together a query that displays what I’m looking for! It doesn’t produce a sexy report, but it seems to be worth having in my back pocket until I can figure out exactly how to obtain said sexy report. There were a lot of people looking for similar information, so I hope this is helpful. This query returns time worked in a little pop-up directly below the search bar.

For me, the following code only shows me the Stories linked to Epics, but no Tasks or Sub-tasks. The total time displayed appears to be correct, at least.

project = [Project Name] AND "Epic Link" = [EPIC-###] AND (issuetype = Story OR issuetype = Task OR issuetype = Sub-task) AND issueFunction in aggregateExpression("Total time", "timespent.sum()") 
Note: You’ll want to replace the name of your project and however you number your Epics (the stuff in brackets). If you happen to find that copy/paste doesn’t work, try to re-type all the quotation marks. In my ample experience (ha!), this resolves the broken query.

Later on, I found that there was a duplicate story with some work logged against it. To query multiple Epics you can change the code a little. Drop this after the first AND in the query above and include as many Epics as you need (separating them with a comma).

"Epic Link" IN [EPIC-123], [EPIC-234]

What I hope to do now is to compare the time spent in various phases of the project and the number of issues seen on those phases. I’m hoping to draw some type of correlation between the amount of time spent testing that particular phase and the number of issues seen when actually performing the tasks.

If you have a better way of reporting Jira time information, please share with me here or reach out to me on Twitter!

 

vSphere Migration – The Gotchas

Yesterday I wrote a little about the process I went through when upgrading my production environment. Today, I want to talk about the issues I faced when doing so and what I learned during the project.

During the first part of the project, I moved virtual machines from a Distributed Switch to a Standard Switch. I was able to do this as a result of redundant NICs on all but a single host. The idea was simple – assign a redundant NIC to a Standard Switch, copy port groups from the Distributed Switch, and update virtual machine network adapters using a script.

First Gotcha –

I learned that the redundant NIC on two of my hosts were set at the switch level to be access ports. As a result, some virtual machines lost network connectivity until the issue was identified and rectified. The overall issue was one of configuration consistency and not one that I had anticipated.

Second Gotcha –

The single host without a redundant NIC had been migrated to a Distributed Switch. I searched but was unable to come up with a scenario where migrating back to a Standard Switch didn’t result in a loss of connectivity. One might argue that I could simply vMotion machines to other hosts. Due to operational constraints, that was not an option for this host.

In my head, the move is a two-step process. Physical NIC and vmkernel port for management traffic. Moving either one loses connectivity, right? Anyone with a better way to do this… please, PLEASE let me know. I ended up using esxcli to remove the vmkernel port from the Distributed Switch and re-add it to the Standard Switch after migrating the physical NIC via UI.

In my opinion, this is another configuration inconsistency and another at the physical layer. The host has the NICs, they just weren’t up. Entirely my fault for not stalling on this host until I could plug in another NIC. That said – planned downtime (albeit, rushed) of VMs was accepted.

Third Gotcha –

After all my migration tasks were complete and all VMs were back on a Distributed Switch in my new environment, I received a message indicating that a team member couldn’t log in to vCenter. A quick review of some logs indicated that time had drifted approximately ten minutes. I was able to manually set my date and time in vCenter and team members were able to log in successfully again.
The issue ended up being an NTP server syncing time with one of the hosts via VMware Tools – a configuration which I didn’t know existed. The root cause was, as you may have already guessed, an ESXi host improperly configured for NTP. Another configuration inconsistency.

Fourth Gotcha –

A single VLAN tag did not update on either Standard Switch or Distributed Switch which resulted in loss of VM network connectivity. The resolution was to add the VLAN tag to the port group. I was unable to identify exactly why the script I wrote was unable to properly create that port group.

In an attempt to re-create the issue, I was unsuccessful. Thankfully, it was a quick fix. Lesson learned – work on better error handling and/or output for scripts to make things easier to track down.

Fifth Gotcha 

Simultaneous Purple Screens of Death on two upgraded hosts. Exception 13 PSOD essentially caused by the differences in vNUMA handling between 5.5 and 6.5. I tested extensively for vMotion between 5.5 and 6.5 hosts to make sure something like this wouldn’t happen. What I didn’t test was a VM with a larger memory footprint like Exchange or SQL Server. Accessing multiple NUMA nodes caused a PSOD for two hosts at the same time. The neat part of this was that the VMs that actually caused the PSOD never actually migrated to the 6.5 hosts and remained on their 5.5 source.
This stopped my project pretty hard. VMware GSS confirmed that an upgrade from 6.5 GA to at least 6.5a would have prevented this. This project was initially greenlit with 6.5 GA and I had not spent extensive hours testing with Update 1. Lesson learned – RTFM. This is listed as a Known Issue and I missed it.

Sixth Gotcha – 

Upgrade of a host failed on reboot when the host was unable to locate installation media. After multiple attempts to upgrade, I determined that the install media had died an untimely death. Thankfully, this didn’t result in an outage or a constrained cluster.
Ultimately, this would have happened at the next patch interval or whenever the host rebooted again. It wasn’t at all unique to the project, but was included in my retrospective anyway.

Seventh Gotcha – 

With all other systems upgraded to Update 1, I needed up upgrade my VCSA to Update 1 (from GA). My VCSA was configured for VCHA and I was able to successfully upgrade both Passive and Witness nodes without issue. I manually initiated a failover to the Passive node and vCenter just never came back.
I found that the VCSA received a different IP than the static IP set on my Active node but I still have yet to determine why. There’s a KB article about the vcha user password expiring and causing replication to fail, but I had visited that prior and did not have any database replication errors.
Resolution was to log in to the VAMI and re-assign the proper static IP. When I did this initially, all of my hosts disconnected when I logged in to the Web Client. After rebooting both Active and Passive nodes simultaneously (defeating the purpose of VCHA entirely), the VCSA came back properly. That said – I still have an open case with VMware GSS.

Conclusion – 

There’s a handful of things to be said for the project and I think the largest one is configuration management. I’ve struggled with Host Profiles and recently lamented about this in the vExpert Slack. Even with Host Profiles, at least two of the three configuration issues would have still been encountered!

Overall the project was very successful and this was one of the most smooth “upgrades” I’ve seen (or been the lead on, for that matter).

What have your experiences with upgrading to 6.5 been? Did you face any of these issues?

vSphere Migration – What I Did (Or Oh No… What Did I Do!?)

A quick post about my recent vSphere Upgrade!

Earlier this month, I began working on a migration project to move production systems from vSphere 5.5 Update 3 to vSphere 6.5. There’s many different reasons I wanted to do so. I’ll be honest, I’m one of the few that actually wanted to get back to the Web Client. At my last job, I used the Web Client exclusively. It’s been really tough to go back to the C# client for the last many months. A few other things? The Content Library and vCenter High Availability are must-haves, in my opinion.

So many options!

Option 1: Upgrade my existing environment to 6.5 with a Windows vCenter and external vCDB

Option 2: Migrate2VCSA – solid, viable choice

 

Option 3: Fresh environment beside the older environment and manual migration
Option 4: Wait until 5.5 hits End of Life and cry when that day hits
There’s probably a handful more options, but I think these four are likely the most popular (with more than we all care to admit taking Option 4).

Quick Option Breakdown:

In my opinion, Option 1 isn’t all that appealing. VMware has announced the deprecation of Windows vCenter already. On top of that, a distributed deployment of that is a nightmare – no thanks.
Option 2 brings me to the VCSA which has been identified by VMware as the direction which will continue to receive development resources. Added plus is that I can use VCHA with the Appliance. The cons here are that I bring over all of the things that shouldn’t be there. Stale permissions that should be removed or weird vCenter configurations that I haven’t happened upon yet. Not my cup of tea.

Option 3 offered me the ability to do things slowly with more consideration. This is it. This is the one. I was able to set up a new distributed deployment of the PSC and deploy VCHA for the vCenter Appliance. I was able to re-create and organize my Virtual Machine folder hierarchy. I was able to create a better logical design of the virtual infrastructure and use that to guide the migration process.

The best part – it’s all mine! I understand the decisions made. I was able to document permissions and age out stale ones (finding one or two that weren’t, in fact, stale but seldom used). I can say that I know the environment inside and out.

Option 4 was listed as comedic relief, but I’m well aware that there’s some organizations that WILL stay on 5.5U3 right up until it is no longer supported.

 

A little about the process:

This migration has been a long time in the making. Many months ago I wrote about a script to move DVS port groups to VSS port groups. It was AWESOME to finally see that thing run for its intended purpose. It worked flawlessly!

Essentially, I ran one script to copy all DVS port groups to VSS port groups and flip virtual machine networking. Once I verified network connectivity, I consumed the host in my new environment. Once an entire cluster was in the new environment, I ran another script to create a DVS from VSS port groups. I then manually updated permissions and organized VMs. Were the extra steps? There sure were. Could I have done things a different way? Absolutely! And none of them would have been wrong.

After iterating through all of the clusters, assigning permissions, putting VMs into folders, and allowing some time for the systems to stew I began the upgrade process. The upgrade was the easiest part and another one of the reasons I wanted to use the VCSA – built-in vSphere Update Manager.

In the vExpert Slack and on Twitter, I’ve seen horror stories of VUM in both Windows and VCSA versions. I’ve never had an issue! This time around was no different. The upgrade felt like it was the most time-consuming part of the entire process… host reboots take forever.

Don’t let me fool you – I ran into some issues with the upgrade. Despite hours of testing the process, it just didn’t go perfectly. I beat myself up for it quite a bit, but I learned from it.

I just finished writing a pretty detailed retrospective on the issues I saw when upgrading. I’ll give an overview to those in another post.
How would you have chosen to upgrade? What would you have done to make things easier? What design concepts do you think I missed? Let me know!

 

Google Chrome Flash Fix for vSphere Web Client

Sometime during the middle of last week, my Google Chrome updated and made it so that I was unable to access vCenter via vSphere Web Client. Currently, my Chrome is Version 62.0.3202.62. During this update, the embedded Adobe Flash Player was updated from 27.0.0.159 to 27.0.0.170. The result was less than stellar:

ChromeVersion

VirtuallyGhettoCrash

Image stolen from @lamw‘s blog post (listed below) because I fixed all my stuff before I screenshot it…

In a blog post (which has now been updated to reflect the fix), William Lam provided instructions on how to use an older version of the pepflashplayer.dll to re-gain access to vCenter via the vSphere Web Client.

Good news! Adobe has released a newer version of Flash Player and that quick fix is no longer required. Note: At this time, it does not appear that Chrome itself has been updated.

To update the Chrome Flash component manually:

  • Open Chrome and enter chrome://components in the address bar
  • Scroll down to Adobe Flash Player which likely reads 27.0.0.170
  • Click Check for Updates and watch Flash Player update to 27.0.0.183 (shown below)
  • Connect to vCenter and test

ChromeComponents

 

This has worked for me and many co-workers so far. If you have a different experience, please share!

Update: New installs of Chrome will automatically download the newest version of Adobe Flash Player.

 

 

 

vRealize Automation 7.3: Install, Configure, Manage (BETA) Review

Here’s the scenario:

I’ve recently been introduced to vRealize Automation at work. We’re entering the tail end of the roll-out and I’ve only just begun to understand exactly how to interact with it. I understand why an organization might want to use the product, but I’ve just never been at a place in my career to get a good hold on it.

Fast-forward only a moment or two in time. My Principle Virtualization Engineer, the guy leading the vRA deployment project, is moving to a different role in the organization. vRA is left to me to support, but I have no idea what I’m doing. Time to see what VMware has for training! Registration for vRealize Automation 7.3 Install, Configure, Manage (BETA) complete.

Note: The Beta for this course is no longer available as it is now Generally Available. Check out available Beta courses here.

Present day:

Last week, I had an opportunity to take the vRA 7.3: ICM Beta class. Initially, I had some reservations about taking a Beta training course. I wasn’t exactly sure what I should expect, but I was pleasantly surprised by the delivery and material. Where some may have been disappointed, I found it to be more helpful overall.

I don’t want to go into detail about the course materials. If you’re familiar with vRA, you can read about new features. If you’re not… attend a Beta! Let’s discuss things that I felt were beneficial about taking the Beta version of the course.

Price –

I’m not sure whether Beta courses vary in price, but this ICM only cost me 50% of what it normally would have. The $2k price tag is still more than I would want to shell out if I were covering it by myself, but the price point is no longer insurmountable. Better yet, the price made it an easier argument for the boss. I list this first because, when comparing all other benefits, this really is a substantial one. Not being able to attend… Well, it really makes getting any benefit out of the training difficult.

Instructors –

On the first day of training my instructor, Brian Watrous, gave a good explanation of how the Beta class would be given. Brian is the Lead Instructor for the course and it was very clear that he’s extremely knowledgeable and passionate about the product. (His blog is solid, too.) Daniel Crider, who developed the course, created the lab scenarios, and built the lab environment joined Brian for the delivery of the Beta.

Essentially, we were going to be seeing a yet-to-be-final version of the course. Daniel was on-hand to take suggestions and feedback from Brian and other attendees on the lecture and lab materials while also offering his fair share of knowledge about the product. I don’t feel that I can adequately capture how unique the instructors made this training feel. I’m sure that this was a result of the Beta and am even more pleased that I attended as a result. The two instructors were excellent.

Attendees –

One of the more interesting aspects of the class was that there were a bunch of VMware Certified Instructors attending with me. As a result, there were some interesting dynamics in the classroom that I don’t think I would have otherwise experienced. I was learning about vRA for the first time among others who were probably learning the “What’s New” pieces of 7.3 (likely having conducted training sessions of their own on previous versions of the product).

Having never seen most of the product before, I asked my fair share of questions. I received answers from the instructors and/or other attendees. In retrospect, I think I may have been the guy that everyone got annoyed with… I’ll choose not to dwell on that.

Labs –

I mentioned before that Daniel had created the lab scenarios and the lab environment. The Beta class saw a new, never-before seen lab on using Storage Policy Based Management in vRA. We almost saw a lab on Containers! Turns out that several layers of abstraction may make certain things difficult (note: containers are one of those certain things).

To me, labs can be difficult to take. They’re very detail oriented and written so that anyone can complete them. It’s very easy to find yourself clicking through the steps and not really understanding what you’re doing or why you’re doing it. As a result, it’s just as easy to leave the course not really having learned anything as a result.

The labs weren’t perfect. Not everything worked as it was intended to. Some things broke and needed some troubleshooting to fix. I don’t say these things as a negative. This was a core piece of my learning experience – troubleshooting a vRA deployment that should be working. The beauty of this was the guidance from two very knowledgeable instructors. I even managed to fix some things on my own which made me feel pretty good at the end of the day!

As an interesting tidbit – Our final lab was learning how to install vRealize Automation. As Brian quipped, “You could argue that this course should be titled, ‘Configure, Manage, Install’ but it doesn’t sound as good.”

Summary (TL;DR) –

The Beta course was excellent! My instructors were extremely well-versed in both using and teaching how to use the software. While the labs were troublesome at times, I spent a good amount of time in and troubleshooting the software. I learned a lot.

At the end of the day, I really feel that the training gave me the knowledge and tools that I need to support my organization’s deployment. Only time will tell. I feel good about the challenge and hope I can mature the deployment well.

Welcome to #Blogtober

Happy #Blogtober, everyone! In this #Blogtober post, I want to give a brief description of exactly what #Blogtober is and drop in some details on how I intend to use it. I hope to include an overview of ideas (so I don’t forget and so that you can hold me accountable) that I’ll post over the month. Big should out to Matt Heldstab who encouraged me to participate. For more info directly from the source head over to blogtober.net.

 

What is #blogtober?

#Blogtober is a commitment for rookie and accomplished bloggers alike. The goal of #blogtober is pretty simple – create five blog posts in the month of October. In announcing the program, Matt lists the following three reasons –

  • #Blogtober gives new bloggers visibility in the community and issues a challenge to be held accountable to
  • We’re in conference season – There are many different conferences in the industry (VMworld, MS Ignite, etc.) that can provide blog topics
  • vExpert 2017 – Blogging can help earn vExpert status by sharing your knowledge with the community

 

To me, #blogtober is a more consumable version of #vdm30in30 which takes place in November. #vdm30in30’s goals are very much the same, but the requirements are loftier – 30 posts in the 30 days in November.  I want to participate, but I just don’t have that much content… yet!

 

How do I plan to use #blogtober?

That’s where #blogtober comes in. To me, this program will give me an easier, more consumable challenge. Five posts over the course of a month allows me to get into the mindset of blogging while also allowing me time to research how to start thinking like a blogger. I want to post technical content. I get so much useful information from others in our community. I have a sneaking suspicion that a lot of what I post will end up being about soft-skills or other observations. We’ll see what happens…

 

What topics are you going to cover, James?

I had a list of ideas that I wrote down specifically so as to not forget them. I’m pretty sure I threw it away. Great job, self!

What I remember:

  • Home Lab Setup – A description of my initial home lab setup followed by config changes and difficulties making said changes. Likely to be two separate posts.
  • DevOps – A discussion on  company culture and Deming’s 14 Points of Management. I may drop DevOps out of this. I’m not an authority on it and it feels buzzword-y. Open to suggestions.
  • VMUG – A discussion about how VMUG has influenced my career.
  • Project Work – A retrospective of a to-be-completed project (pending approval).
  • Maybe more if I can keep going?

Also, I’m open to ideas! If you think something is worth  expanding on, please let me know.

 

Want to get involved?

It’s October 3rd and there’s still plenty of month left! At the time of this writing, there’s approximately 65 people who are participating in the program! If you want to challenge yourself and/or get more involved in the community, this is a great starting place.

Step 1 – Head over to www.blogtober.net and comment on the blog post.
Step 2 – Write some neat stuff and post it on your blog (acquire a blog, if necessary… then blog about it!)
Step 3 – Throw the post out to the Twitterverse with the #blogtober hashtag

It’s really that simple.

Here’s to hoping that I can stick with five more posts for the month of #Blogtober (remember, this post doesn’t count!).