vRealize Automation 7.x Directory Sync Failure

tl;dr – In the event of Directory Sync failure, check the following two files on the vRA appliance for the proper Domain and Domain Controller information:

/usr/local/horizon/conf/domain_krb.properties and

If stale records exist, remove them and restart the appliance. If you use an external vIDM, you may need to search for similar files there.

Scenario –

I have a vRealize Automation Appliance that hosts different tenants with identical directory structure. Directory sync for all tenants but one completes correctly. When attempting to manually sync, I was met with the following error:

Nondescript Error2
That’s it. Seriously – that little red box was all. Super descriptive, right!?

To troubleshoot, I opened every different browser I had on hand. I even installed a new one with hopes that an error message would appear in that box – no dice! At this point, I didn’t really know what I was looking for. I rebooted the vRA Appliance a few times and was still met with the same issue. I opted to open a Support Request.

As part of that Support Request, Randy (VMware Support) and I verified a few easy things. I compared Directory configuration across tenants and confirmed that the bind account functioned properly. Eventually, we took a look at the connector.log file found in /storage/log/vmware/horizon/connector.log and didn’t find much worthwhile. We generated some new traffic with less +F to while I regenerated the above-referenced error. To my delight, the log received some action.

In the log, I found the following snippets (note – these aren’t contiguous entries, but relatively close to one another) –

2018-02-05 22:43:30,974 INFO  (SimpleAsyncTaskExecutor-41084) [3010@TENANTNAME;username@TENANTNAME;] com.vmware.horizon.directory.ldap.LdapConnector - Attempting to bind to sunset-dc.mueller-tech.com:389

2018-02-05 22:43:30,975 INFO  (SimpleAsyncTaskExecutor-41084) [3010@TENANTNAME;username@TENANTNAME;] com.vmware.horizon.directory.ldap.LdapConnector - LDAP Context env Json Values: {
"java.naming.provider.url" : "ldap://sunset-dc.mueller-tech.com:389",
2018-02-05 21:43:09,100 WARN  (SimpleAsyncTaskExecutor-39489) [3010@TENANTNAME;username@TENANTNAME;] com.vmware.horizon.directory.ldap.LdapConnector - Failed to connect to sunset-dc.mueller-tech.com:389
javax.naming.CommunicationException: sunset-dc.mueller-tech.com:389 [Root exception is java.net.UnknownHostException: sunset-dc.mueller-tech.com]

This particular tenant was trying to bind to a recently-sunset domain controller and the sync was failing as a result. Elsewhere in the log were details of exactly which users and groups were unable to update/sync.

While Randy researched on his end, I found some information in the VMware Identity Manager Documentation that stated the domain_krb.properties file needed to be updated manually when DCs were added or removed.

The domain_krb.properties file is located at /usr/local/horizon/conf/domain_krb.properties and contained the following:

#Date of Initial Creation

I took a quick snapshot of the vRA appliance and edited the file to remove the recently-sunset domain controller reference. Afterward, I issued service horizon-workspace restart and waited for the tenant to come back online. No good fortune. I rebooted the vRA appliance for good luck. Still no dice!

At the guidance of VMware Support, I looked at the config-state.json file at /usr/local/horizon/conf/states/TENANTNAME/####/config-state.json. In this file, I found more references to the recently-sunset domain controller listed as the “kdc” entry as shown below.

"crossRefs" : [ {
"host" : "mueller-tech.com",
"rootDomainController" : "DC=mueller-tech,DC=com",
"kdc" : "sunset-dc.mueller-tech.com",
"port" : 389,
"forestDn" : "DC=mueller-tech,DC=com",
"netBiosName" : "MUELLER-TECH"
} ],
"unresolvedCrossRefs" : [ ],
"crossRefMap" : {
"DC=mueller-tech,DC=com" : {
"host" : "mueller-tech.com",
"rootDomainController" : "DC=mueller-tech,DC=com",
"kdc" : "sunset-dc.mueller-tech.com",
"port" : 389,
"forestDn" : "DC=mueller-tech,DC=com",
"netBiosName" : "DOMAIN"
"netBiosNameByCrossRefMap" : {
"DOMAIN" : {
"host" : "mueller-tech.com",
"rootDomainController" : "DC=mueller-tech,DC=cp,",
"kdc" : "sunset-dc.mueller-tech.com",
"port" : 389,
"forestDn" : "DC=MUELLER-TECH,DC=COM",
"netBiosName" : "MUELLER-TECH"`

Comparing this file from the broken tenant to another with a functioning Sync, I found that the “kdc” entry needed to be updated to “ldap.mueller-tech.com” and I made changes accordingly. Another quick service horizon-workspace restart and it was time to test.

DICE! (Wait… is that the opposite of “no dice” or not?) Directory sync finished faster than normal and my users were able to connect as needed.

Hopefully this won’t be needed in future versions of the VMware Identity Manager or vRealize Automation Appliance. This scenario doesn’t seem to be well-documented by VMware. Hopefully this will point someone in the right direction if need be.

Meltdown/Spectre Patching – Enhanced vMotion Compatibility

tl;dr – If you’ve patched everything already but still fail verification, check to see if EVC is enabled on the cluster. I found that EVC did not update itself as described in the KB article. Disable/re-enable of EVC enabled new instructions to be applied.

By now, you and the rest of the world know about the Meltdown/Spectre vulnerabilities that were disclosed on 1/3/18 or sometime thereabouts. Earlier this week, VMware released patches and this KB detailing how to apply them.

I had already taken steps to patch the (physical) systems with available BIOS/firmware updates in my environment. When vCenter Server and ESXi patches were released (links to all of which can be found in the VMware Security Advisory here), I added those patches to the pile. Guest OSs had already received patches through other channels. There were probably patches for various lamps and/or lampshade microcode that needed to be applied elsewhere. Read: There’s really just a lot of patches to apply to mitigate this mess… moving on!

Being the absolute demon that he is, William Lam has created a script which will report on the vulnerability mitigation capability of ESXi hosts and VMs running on them. He’s documented that script and how to use it on his blog.

Using Mr. Lam’s script, I found that my ESXi hosts were patched properly and successfully seeing the new CPU instructions added by new microcode. The problem I had was identifying exactly why the VMs themselves weren’t receiving the same CPU instructions.

My validation process: Create a new VM of VM Hardware version 8 and verify that Mr. Lam’s script reported it as such. Upgrade the VM Hardware Version, Power-On the VM, and again run Mr. Lam’s script for comparison. The results were varied across the environment.

I narrowed the issue down to Enhanced vMotion Compatibility (EVC). On clusters where EVC was not enabled, my validation process showed that the VM was receiving the new CPU instructions. On clusters where EVC was enabled, the VM was not being presented with the new instructions.

The VMware KB article above indicates that when ESXi hosts in an EVC-enabled cluster are upgraded, the cluster maintains the current instruction set until all ESXi hosts in the cluster have been upgraded. At that time, EVC will automatically upgrade itself to enable VMs to receive the new instructions. Based on my observations, I needed to test this.

All hosts in my EVC-enabled cluster had been patched and are reporting as such both in vCenter and Mr. Lam’s script. My vCenter Server has also been patched appropriately (which is required for exactly this reason) and reports as such in… well, vCenter. Putting a host in Maintenance Mode, I removed it from an EVC-enabled cluster and left it as a standalone host. Performing my validation process was successful! I moved the host back into an EVC-enabled cluster and performed another validation – vulnerable again!

This confirmed to me that EVC was the culprit ruining my path to successful mitigation. After considering the potential consequences, I disabled EVC and re-enabled EVC on the cluster (at the same EVC level). Once re-enabled, all validation via Mr. Lam’s script passed. Further, the Microsoft code to validate in the Guest OS passed as well. Successful mitigation!

EVC – I won’t let you get the best of me!

Note: This information has been relayed to VMware. I expect that this will be updated at some point in the near future. I’ll update this blog post to reflect that as soon as I’m made aware.

Helpful Jira Query for Total Time Spent

My last two posts have detailed information about a major project that I had an opportunity to lead. As part of that project, I wrote a fairly detailed retrospective where I wanted to cover interesting facts, issues that I encountered, and the resolutions to those issues.

One of the interesting facts that I included was the total number of hours I logged against the project in our tracking software, Jira. What I learned very quickly was that Jira does not have very extensive time reporting out of the box. Most time reporting appears to require third-party add-ons. Some of those add-ons are free while others are paid. I have absolutely no idea what my company does and doesn’t have. My only time in the tool is to create my Epics, Stories, Tasks, etc. and to log work against them.

It seems crazy to me that something so basic is not readily available in the tool. (If I’m overlooking something point me in the right direction, please!) At the very least, I feel like the work done in the sub-parts of the Epic should roll up to the Epic itself. The worst part is that it doesn’t seem to be just me searching for this information. A not-so-exhaustive search identified a feature request for exactly this type of function to be added. That feature request was quite old and well-looked after by members of the community.

I’ll admit – for all intents and purposes, I’m a Jira rookie. After a little trial and error and some Google-fu, I was able to put together a query that displays what I’m looking for! It doesn’t produce a sexy report, but it seems to be worth having in my back pocket until I can figure out exactly how to obtain said sexy report. There were a lot of people looking for similar information, so I hope this is helpful. This query returns time worked in a little pop-up directly below the search bar.

For me, the following code only shows me the Stories linked to Epics, but no Tasks or Sub-tasks. The total time displayed appears to be correct, at least.

project = [Project Name] AND "Epic Link" = [EPIC-###] AND (issuetype = Story OR issuetype = Task OR issuetype = Sub-task) AND issueFunction in aggregateExpression("Total time", "timespent.sum()") 
Note: You’ll want to replace the name of your project and however you number your Epics (the stuff in brackets). If you happen to find that copy/paste doesn’t work, try to re-type all the quotation marks. In my ample experience (ha!), this resolves the broken query.

Later on, I found that there was a duplicate story with some work logged against it. To query multiple Epics you can change the code a little. Drop this after the first AND in the query above and include as many Epics as you need (separating them with a comma).

"Epic Link" IN [EPIC-123], [EPIC-234]

What I hope to do now is to compare the time spent in various phases of the project and the number of issues seen on those phases. I’m hoping to draw some type of correlation between the amount of time spent testing that particular phase and the number of issues seen when actually performing the tasks.

If you have a better way of reporting Jira time information, please share with me here or reach out to me on Twitter!


vSphere Migration – The Gotchas

Yesterday I wrote a little about the process I went through when upgrading my production environment. Today, I want to talk about the issues I faced when doing so and what I learned during the project.

During the first part of the project, I moved virtual machines from a Distributed Switch to a Standard Switch. I was able to do this as a result of redundant NICs on all but a single host. The idea was simple – assign a redundant NIC to a Standard Switch, copy port groups from the Distributed Switch, and update virtual machine network adapters using a script.

First Gotcha –

I learned that the redundant NIC on two of my hosts were set at the switch level to be access ports. As a result, some virtual machines lost network connectivity until the issue was identified and rectified. The overall issue was one of configuration consistency and not one that I had anticipated.

Second Gotcha –

The single host without a redundant NIC had been migrated to a Distributed Switch. I searched but was unable to come up with a scenario where migrating back to a Standard Switch didn’t result in a loss of connectivity. One might argue that I could simply vMotion machines to other hosts. Due to operational constraints, that was not an option for this host.

In my head, the move is a two-step process. Physical NIC and vmkernel port for management traffic. Moving either one loses connectivity, right? Anyone with a better way to do this… please, PLEASE let me know. I ended up using esxcli to remove the vmkernel port from the Distributed Switch and re-add it to the Standard Switch after migrating the physical NIC via UI.

In my opinion, this is another configuration inconsistency and another at the physical layer. The host has the NICs, they just weren’t up. Entirely my fault for not stalling on this host until I could plug in another NIC. That said – planned downtime (albeit, rushed) of VMs was accepted.

Third Gotcha –

After all my migration tasks were complete and all VMs were back on a Distributed Switch in my new environment, I received a message indicating that a team member couldn’t log in to vCenter. A quick review of some logs indicated that time had drifted approximately ten minutes. I was able to manually set my date and time in vCenter and team members were able to log in successfully again.
The issue ended up being an NTP server syncing time with one of the hosts via VMware Tools – a configuration which I didn’t know existed. The root cause was, as you may have already guessed, an ESXi host improperly configured for NTP. Another configuration inconsistency.

Fourth Gotcha –

A single VLAN tag did not update on either Standard Switch or Distributed Switch which resulted in loss of VM network connectivity. The resolution was to add the VLAN tag to the port group. I was unable to identify exactly why the script I wrote was unable to properly create that port group.

In an attempt to re-create the issue, I was unsuccessful. Thankfully, it was a quick fix. Lesson learned – work on better error handling and/or output for scripts to make things easier to track down.

Fifth Gotcha 

Simultaneous Purple Screens of Death on two upgraded hosts. Exception 13 PSOD essentially caused by the differences in vNUMA handling between 5.5 and 6.5. I tested extensively for vMotion between 5.5 and 6.5 hosts to make sure something like this wouldn’t happen. What I didn’t test was a VM with a larger memory footprint like Exchange or SQL Server. Accessing multiple NUMA nodes caused a PSOD for two hosts at the same time. The neat part of this was that the VMs that actually caused the PSOD never actually migrated to the 6.5 hosts and remained on their 5.5 source.
This stopped my project pretty hard. VMware GSS confirmed that an upgrade from 6.5 GA to at least 6.5a would have prevented this. This project was initially greenlit with 6.5 GA and I had not spent extensive hours testing with Update 1. Lesson learned – RTFM. This is listed as a Known Issue and I missed it.

Sixth Gotcha – 

Upgrade of a host failed on reboot when the host was unable to locate installation media. After multiple attempts to upgrade, I determined that the install media had died an untimely death. Thankfully, this didn’t result in an outage or a constrained cluster.
Ultimately, this would have happened at the next patch interval or whenever the host rebooted again. It wasn’t at all unique to the project, but was included in my retrospective anyway.

Seventh Gotcha – 

With all other systems upgraded to Update 1, I needed up upgrade my VCSA to Update 1 (from GA). My VCSA was configured for VCHA and I was able to successfully upgrade both Passive and Witness nodes without issue. I manually initiated a failover to the Passive node and vCenter just never came back.
I found that the VCSA received a different IP than the static IP set on my Active node but I still have yet to determine why. There’s a KB article about the vcha user password expiring and causing replication to fail, but I had visited that prior and did not have any database replication errors.
Resolution was to log in to the VAMI and re-assign the proper static IP. When I did this initially, all of my hosts disconnected when I logged in to the Web Client. After rebooting both Active and Passive nodes simultaneously (defeating the purpose of VCHA entirely), the VCSA came back properly. That said – I still have an open case with VMware GSS.

Conclusion – 

There’s a handful of things to be said for the project and I think the largest one is configuration management. I’ve struggled with Host Profiles and recently lamented about this in the vExpert Slack. Even with Host Profiles, at least two of the three configuration issues would have still been encountered!

Overall the project was very successful and this was one of the most smooth “upgrades” I’ve seen (or been the lead on, for that matter).

What have your experiences with upgrading to 6.5 been? Did you face any of these issues?

vSphere Migration – What I Did (Or Oh No… What Did I Do!?)

A quick post about my recent vSphere Upgrade!

Earlier this month, I began working on a migration project to move production systems from vSphere 5.5 Update 3 to vSphere 6.5. There’s many different reasons I wanted to do so. I’ll be honest, I’m one of the few that actually wanted to get back to the Web Client. At my last job, I used the Web Client exclusively. It’s been really tough to go back to the C# client for the last many months. A few other things? The Content Library and vCenter High Availability are must-haves, in my opinion.

So many options!

Option 1: Upgrade my existing environment to 6.5 with a Windows vCenter and external vCDB

Option 2: Migrate2VCSA – solid, viable choice


Option 3: Fresh environment beside the older environment and manual migration
Option 4: Wait until 5.5 hits End of Life and cry when that day hits
There’s probably a handful more options, but I think these four are likely the most popular (with more than we all care to admit taking Option 4).

Quick Option Breakdown:

In my opinion, Option 1 isn’t all that appealing. VMware has announced the deprecation of Windows vCenter already. On top of that, a distributed deployment of that is a nightmare – no thanks.
Option 2 brings me to the VCSA which has been identified by VMware as the direction which will continue to receive development resources. Added plus is that I can use VCHA with the Appliance. The cons here are that I bring over all of the things that shouldn’t be there. Stale permissions that should be removed or weird vCenter configurations that I haven’t happened upon yet. Not my cup of tea.

Option 3 offered me the ability to do things slowly with more consideration. This is it. This is the one. I was able to set up a new distributed deployment of the PSC and deploy VCHA for the vCenter Appliance. I was able to re-create and organize my Virtual Machine folder hierarchy. I was able to create a better logical design of the virtual infrastructure and use that to guide the migration process.

The best part – it’s all mine! I understand the decisions made. I was able to document permissions and age out stale ones (finding one or two that weren’t, in fact, stale but seldom used). I can say that I know the environment inside and out.

Option 4 was listed as comedic relief, but I’m well aware that there’s some organizations that WILL stay on 5.5U3 right up until it is no longer supported.


A little about the process:

This migration has been a long time in the making. Many months ago I wrote about a script to move DVS port groups to VSS port groups. It was AWESOME to finally see that thing run for its intended purpose. It worked flawlessly!

Essentially, I ran one script to copy all DVS port groups to VSS port groups and flip virtual machine networking. Once I verified network connectivity, I consumed the host in my new environment. Once an entire cluster was in the new environment, I ran another script to create a DVS from VSS port groups. I then manually updated permissions and organized VMs. Were the extra steps? There sure were. Could I have done things a different way? Absolutely! And none of them would have been wrong.

After iterating through all of the clusters, assigning permissions, putting VMs into folders, and allowing some time for the systems to stew I began the upgrade process. The upgrade was the easiest part and another one of the reasons I wanted to use the VCSA – built-in vSphere Update Manager.

In the vExpert Slack and on Twitter, I’ve seen horror stories of VUM in both Windows and VCSA versions. I’ve never had an issue! This time around was no different. The upgrade felt like it was the most time-consuming part of the entire process… host reboots take forever.

Don’t let me fool you – I ran into some issues with the upgrade. Despite hours of testing the process, it just didn’t go perfectly. I beat myself up for it quite a bit, but I learned from it.

I just finished writing a pretty detailed retrospective on the issues I saw when upgrading. I’ll give an overview to those in another post.
How would you have chosen to upgrade? What would you have done to make things easier? What design concepts do you think I missed? Let me know!


Google Chrome Flash Fix for vSphere Web Client

Sometime during the middle of last week, my Google Chrome updated and made it so that I was unable to access vCenter via vSphere Web Client. Currently, my Chrome is Version 62.0.3202.62. During this update, the embedded Adobe Flash Player was updated from to The result was less than stellar:



Image stolen from @lamw‘s blog post (listed below) because I fixed all my stuff before I screenshot it…

In a blog post (which has now been updated to reflect the fix), William Lam provided instructions on how to use an older version of the pepflashplayer.dll to re-gain access to vCenter via the vSphere Web Client.

Good news! Adobe has released a newer version of Flash Player and that quick fix is no longer required. Note: At this time, it does not appear that Chrome itself has been updated.

To update the Chrome Flash component manually:

  • Open Chrome and enter chrome://components in the address bar
  • Scroll down to Adobe Flash Player which likely reads
  • Click Check for Updates and watch Flash Player update to (shown below)
  • Connect to vCenter and test



This has worked for me and many co-workers so far. If you have a different experience, please share!

Update: New installs of Chrome will automatically download the newest version of Adobe Flash Player.




vRealize Automation 7.3: Install, Configure, Manage (BETA) Review

Here’s the scenario:

I’ve recently been introduced to vRealize Automation at work. We’re entering the tail end of the roll-out and I’ve only just begun to understand exactly how to interact with it. I understand why an organization might want to use the product, but I’ve just never been at a place in my career to get a good hold on it.

Fast-forward only a moment or two in time. My Principle Virtualization Engineer, the guy leading the vRA deployment project, is moving to a different role in the organization. vRA is left to me to support, but I have no idea what I’m doing. Time to see what VMware has for training! Registration for vRealize Automation 7.3 Install, Configure, Manage (BETA) complete.

Note: The Beta for this course is no longer available as it is now Generally Available. Check out available Beta courses here.

Present day:

Last week, I had an opportunity to take the vRA 7.3: ICM Beta class. Initially, I had some reservations about taking a Beta training course. I wasn’t exactly sure what I should expect, but I was pleasantly surprised by the delivery and material. Where some may have been disappointed, I found it to be more helpful overall.

I don’t want to go into detail about the course materials. If you’re familiar with vRA, you can read about new features. If you’re not… attend a Beta! Let’s discuss things that I felt were beneficial about taking the Beta version of the course.

Price –

I’m not sure whether Beta courses vary in price, but this ICM only cost me 50% of what it normally would have. The $2k price tag is still more than I would want to shell out if I were covering it by myself, but the price point is no longer insurmountable. Better yet, the price made it an easier argument for the boss. I list this first because, when comparing all other benefits, this really is a substantial one. Not being able to attend… Well, it really makes getting any benefit out of the training difficult.

Instructors –

On the first day of training my instructor, Brian Watrous, gave a good explanation of how the Beta class would be given. Brian is the Lead Instructor for the course and it was very clear that he’s extremely knowledgeable and passionate about the product. (His blog is solid, too.) Daniel Crider, who developed the course, created the lab scenarios, and built the lab environment joined Brian for the delivery of the Beta.

Essentially, we were going to be seeing a yet-to-be-final version of the course. Daniel was on-hand to take suggestions and feedback from Brian and other attendees on the lecture and lab materials while also offering his fair share of knowledge about the product. I don’t feel that I can adequately capture how unique the instructors made this training feel. I’m sure that this was a result of the Beta and am even more pleased that I attended as a result. The two instructors were excellent.

Attendees –

One of the more interesting aspects of the class was that there were a bunch of VMware Certified Instructors attending with me. As a result, there were some interesting dynamics in the classroom that I don’t think I would have otherwise experienced. I was learning about vRA for the first time among others who were probably learning the “What’s New” pieces of 7.3 (likely having conducted training sessions of their own on previous versions of the product).

Having never seen most of the product before, I asked my fair share of questions. I received answers from the instructors and/or other attendees. In retrospect, I think I may have been the guy that everyone got annoyed with… I’ll choose not to dwell on that.

Labs –

I mentioned before that Daniel had created the lab scenarios and the lab environment. The Beta class saw a new, never-before seen lab on using Storage Policy Based Management in vRA. We almost saw a lab on Containers! Turns out that several layers of abstraction may make certain things difficult (note: containers are one of those certain things).

To me, labs can be difficult to take. They’re very detail oriented and written so that anyone can complete them. It’s very easy to find yourself clicking through the steps and not really understanding what you’re doing or why you’re doing it. As a result, it’s just as easy to leave the course not really having learned anything as a result.

The labs weren’t perfect. Not everything worked as it was intended to. Some things broke and needed some troubleshooting to fix. I don’t say these things as a negative. This was a core piece of my learning experience – troubleshooting a vRA deployment that should be working. The beauty of this was the guidance from two very knowledgeable instructors. I even managed to fix some things on my own which made me feel pretty good at the end of the day!

As an interesting tidbit – Our final lab was learning how to install vRealize Automation. As Brian quipped, “You could argue that this course should be titled, ‘Configure, Manage, Install’ but it doesn’t sound as good.”

Summary (TL;DR) –

The Beta course was excellent! My instructors were extremely well-versed in both using and teaching how to use the software. While the labs were troublesome at times, I spent a good amount of time in and troubleshooting the software. I learned a lot.

At the end of the day, I really feel that the training gave me the knowledge and tools that I need to support my organization’s deployment. Only time will tell. I feel good about the challenge and hope I can mature the deployment well.