Archives For VMware

Tips when preparing for the 6.1 upgrade:

  • In addition, you can typically find some good blog posts with step-by-step guides for this type of work. is just one example.  It can also be helpful to do a quick search to see what existing KBs have come out so far for the new release. Get to googling!
  • There are some very friendly folks in the virtualization community that enjoy sharing their tech experiences with others, so keep an eye on social media chatter regarding the upgrade (ahem, Twitter).
  • Test the upgrade in the lab. (Though, unfortunately, if your lab is not identical to production, you may not be able to test for all potential issues. In my case, the biggest bug I hit did not occur in the lab but was seen in the production environment)
  • Snapshot, snapshot, snapshot. Especially the IaaS server. And backup the IaaS DB.

A few issues to be aware of when upgrading to 6.1:

Issue #1 –  If you try to upgrade the ID appliance and get “Error: Failed to install updates (Error while running installation tests…” and you see errors in the logs about conflicts with the certificate updates, try the following:

SSH to the ID appliance and run rpm -e vmware-certificate-client. Then try running the update again. Thanks to @stvkpln for sharing the fix.

Issue #2 – If you are going through the IaaS upgrade and get the following error near the end of the wizard (before the upgrade install even begins):


Check to see how many DEM registry keys you have in the following location:  HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\VMware, Inc.\VMware vCloud Automation Center DEM

If you see extra DEM or DEO keys (ie. you only have 2 DEM workers install on the server but you see 3 DEM worker keys), this may be related to your issue.


Option 1 (remove duplicate keys):

  • Export the DEM registry key to back it up: HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\VMware, Inc.\VMware vCloud Automation Center DEM
  • Check the contents of the registry keys that match your installed DEMs for version and install path information.
  • Remove the duplicate DEMInstance keys for the DEO and DEM.
  • Run the upgrade.

Option 2 (remove/reinstall):

  • Remove all DEMS from machine
  • Remove DEM registry keys
  • Run Upgrade
  • Install 6.1 DEMs with the Same name as the 6.0 DEMs

I would recommend going with Option 2 (especially if it is difficult to confirm by looking at the contents which keys match the installed DEMs). Thanks to @virtualjad and VMware engineering for sharing the workaround.

Issue #3 – Make sure to import the vCO package update for vCAC 6.1, as mentioned in KB 2088838, especially if you use stub workflows.


…written from the perspective of a Virtualization Engineer.  A very special thanks to Networking Guru @cjordanVA for being a key contributor on this post.

Overlay Transport Virtualization (OTV), which is a Cisco feature that was released in 2010, can be used to extend Layer 2 traffic between distributed data centers.  The extension of Layer 2 domains across data centers may be required to support certain high availability solutions, such as stretched clusters or application mobility.  Instead of traffic being sent as Layer 2 across a Data Center Interconnect (DCI), OTV encapsulates the Layer 2 traffic in Layer 3 packets.  There are some benefits to using OTV for Layer 2 extension between sites, such as limiting the impact of unknown-unicast flooding.  OTV also allows for FHRP Isolation, which allows the same default gateway to exist in the distributed data centers at the same time.  This can help reduce traffic tromboning between sites.

When planning an OTV implementation in an enterprise environment with existing production systems, here are a few things to include in the testing phase when collaborating with other teams:

  • Setup a conference call for the OTV implementation day and share this information with the Infrastructure groups involved in the implementation and testing, ie. Network, Storage, Server, and Virtualization engineers.  This will allow staff involved to easily communicate when performing testing following the change.
  • Test pinging physical server interfaces by IP address at one datacenter from the other datacenter, and from various subnets.  Can you ping the interface from the same site, but not from the other site? (Make sure to establish a baseline before implementation day.)  Is your monitoring software at one site randomly alerting that it cannot ping devices at the other site?
  • If your vCenter Server manages hosts located in multiple data centers, was vCenter able to reconnect to ESXi hosts at the other datacenter (across the DCI) after OTV was enabled?
  • If you have systems that replicate storage/data between the data centers, test this replication after OTV is enabled and verify it completes successfully.


Be aware of a couple of gotchas:

ARP aging timer/CAM aging timer – Make sure to set the ARP aging timer lower than the CAM aging timer to prevent traffic from getting randomly blackholed.  This is an issue to watch out for if OTV is being implemented in a mixed Catalyst/Nexus environment, and will not likely be an issue if the environment is all Nexus.  The default times for the aging timer depend on the Cisco platform.  The default for a Catalyst 6500 is different than the default for a Nexus 7000.

Symptoms of an aging timer issue:  You will more than likely see failures during the pings tests mentioned above or you may see intermittent issues with establishing connectivity to certain hosts.

MTU Settings – Since OTV adds additional bytes to IP header packets and also sets the do not fragment “DF” bit, a larger MTU will need to be configured on any interfaces along the path of an OTV encapsulated packet.  Check the MTU settings prior to implementation, and again if issues arise when OTV is rolled out.  If MTU settings were properly configured, consider rebooting the OTV edge devices as a troubleshooting step if issues are encountered to verify the MTU settings actually applied properly and did not get stuck — (it’s happened).

Symptoms of an MTU-related issue:  If you have a vCenter server in one data center that manages hosts at the other datacenter, it may not be able to reconnect to the hosts at the other data center.  Storage replication may not complete successfully after OTV has been enabled.

This post is a bit late since vCAC 6.0.1 (Service Pack 1) was just released.  However, I wanted to share some of the issues I came across during the installation and setup of vCloud Automation Center (vCAC) 6.0.  I have not yet had the opportunity to upgrade to 6.0.1, but I’m hoping one or more of the issues below has been fixed or at least identified.

      • After setting up two identity stores on the vCAC Identity/SSO appliance, one for a parent domain and one for a child domain, I had an issue authenticating to the parent domain when identity stores used LDAP ports 389 or 636.  The issue only occurred when the user had an account in both domains and the username was the same for both.  No longer had this issue when switching to LDAP Global Catalog ports 3268 or 3269.  (Verified that there was no issue authenticating and binding to the same domain controller using the same service account via ports 389 and 636 when testing with ldp.exe.)
      • Have not found documentation for changing vCAC service account password.  This is assuming the same service account is being used for four vCAC IaaS services, one or more vCAC identity stores, and vCAC endpoint credentials. When I needed to attempt to change the password for all of these, it broke vCAC forcing me to revert the IaaS server back to it’s original state and reinstall the IaaS components.  Note**This brings me to some of the best advice I can give someone performing a vCAC installation – SNAPSHOT THE IaaS SERVER!!  I usually take a snapshot once before the pre-reqs, and once before installing the actual components. 
      • Service Account used for vCAC endpoint credentials cannot use a password containing ‘=’ sign at the end.
      • Cannot add Active Directory security group that contains spaces to vCAC for assigning permissions.
      • When adding Active Directory security groups to vCAC to assign permissions for Business Groups, vCAC is not able to “pull up”/discover the group  (like it does for domain user accounts).  It does, however, work, provided the group really exists and the group name does not contain spaces.
      • When using a vCloud Suite Standard license, there is no option in the GUI to add a vCO Endpoint.  This was a big one for me.

If you run VMware on HP Proliant servers, then you are probably familiar with  In addition to HP customized VMware ESXi ISOs and software bundles, this site also has what HP refers to as VMware firmware and software “recipes”.  The “recipes” list the drivers and firmware that HP recommends running along with a specified Service Pack for Proliant (SPP) and certain ESXi versions.  While applying newer firmware and drivers to HP Blade enclosures can be a pain, it’s a good idea to perform these updates 1-2 times a year since each SPP is only supported for 1 year.

Stacy’s Example:

In the following example, I used the September 2013 “recipe” to apply updates to HP C7000 Blade Enclosures that were already running ESXi 5.0 Update 2 hosts.  There is more than one way to apply these updates, but this is the method I found the easiest.

  • Each HP Blade Enclosure was updated one at a time.
  • For each enclosure, updates were applied to the Onboard Administrators, and Virtual Connect Flex-10 Ethernet modules, and the blades themselves.  (FC switches in enclosures handled separately)
  • Performed the steps detailed below for each enclosure.
  • Note: If your hosts have FC HBAs, check with your storage vendor as well to see if they support the new HBA firmware/drivers.
Blade Driver Updates – VUM
  • Created new VMware Update Manager (VUM) HP Extension/driver baselines based on the September 2013 HP “recipe” (   Reviewed host hardware for each cluster (ie looked at network adapters, RAID controllers, latest offline bundle, etc) to determine the appropriate drivers to include in the baselines.
  • Attached the appropriate baselines to appropriate clusters (again based on hardware for each cluster and the “recipe”, and scanned.
  • Placed all ESXi hosts in the enclosure to be updated in maintenance mode. (It’s great if you are able to shut down and update all blades in the enclosure at once, but not everyone will have this luxury)
  • Suspend alerting for hosts.
  • Remediated the hosts in the blade enclosure using the VUM baselines (Host Extensions).
Blade Firmware Updates – EFM
  • Used the Enclosure Firmware Management (EFM) feature to update blade firmware.  EFM can mount an SPP ISO via URL, where it is hosted on an internal server running IIS.  Prior to updating blade firmware, updated the SPP ISO on the IIS server and re-mounted the ISO in EFM.
  • Shutdown hosts (which were still in maintenance mode) using the vSphere client.
  • Once hosts were shutdown, used the HP EFM feature to manually apply firmware updates.
  • After the firmware updates completed (could take an hour), clicked on Rack Firmware in the OA and reviewed the current version/Firmware ISO version.
Virtual Connect (VC) and Onboard Administrator (OA) Updates – HPSUM from desktop
  • Temporarily disabled the Virtual Connect Domain IP Address (optional setting) in the Virtual Connect Manager in order for HPSUM to discover the Virtual Connects when the Onboard Administrator is added as a target (yes, HP bug workaround).
  • Ran HP SUM from the appropriate HP SPP from desktop.
  • Added Active OA hostname OR IP address as a target, chose Onboard Administrator as type.
  • Blade iLO interfaces, Virtual Connect Manager, and FC Switches were all discovered as associated targets by adding the OA.  For associated targets, de-selected everything except for the Virtual Connect Manager and clicked OK (the iLO interfaces for the blades were updated along with the rest of their firmware using the EFM, and the FC Switch firmware is handled separately).
  • The Virtual Connect Manager may then show as unknown in HPSUM.  Edited that target and changed target type to Virtual Connect, and entered the appropriate credentials.
  • After applying updates to the OAs and VCs, verified they updated to the correct firmware levels.
  • Re-enabled the Virtual Connect Domain IP Address setting.
  • Re-enabled alerting.

Hi folks – It’s time to take a quick break from the excitement of the vSphere 5.5 and VSAN announcement to read a blog post about vSphere Metro Storage Clusters (vMSCs aka stretched clusters)!  Specifically, this post is about what I’ve learned in regards to vMSC workload mobility between sites, or downtime avoidance.  Since my vMSC experience is solely based on the NetApp MetroCluster solution, the content below is NetApp-centric.

To take a step back – When you look at the the scenarios that would cause all VMs running on a stretched cluster to completely vacate all hosts at one of the sites and (eventually) end up at the other site, I see two major types of events:

Unplanned Site Failover (Disaster Recovery)

  • Example:  Site A, which hosts half of a stretched cluster, goes completely down. This is an unplanned event, which results in a hard shutdown of all systems in the Site A datacenter.  Once the Site A LUNs are taken over by the Site B controller and fully available, VMs that were running at Site A need to be started at Site B.  Some would argue the DR process should be triggered manually (ie without MetroCluster TieBreaker).  The following doc is a great reference for testing vMSC failure or DR scenarios if you’re doing a proof of concept:

Planned Site Failover (Disaster/Downtime Avoidance)

  • Proactive non-disruptive migration of VM workloads and storage from Site A to Site B.  Performing this work non-disruptively is one of the benefits of a vSphere Metro Storage Cluster.  If equipment needs to be powered down at one of the physical sites (ie. for site maintenance or impending power outage scenario described in Duncan Epping’s blog post), this can be done without downtime for VMs on a stretched cluster.

If you have hundreds of VMs and multiple stretched clusters, it is important to plan and document the steps for these scenarios.  Since I could not find specific VMware documentation discussing the Planned Failover scenario in detail, I wanted to share an example of how this can be performed.  These steps happen to be for a 5.0 stretched cluster environment with one or more NetApp Fabric Metroclusters on the backend.

The following is an example of the process that can be used to non-disruptively failover storage and VMs from site A to site B, and then fail everything back to site A.  This process could be different depending on your storage solution, or how many VMs you have hosted on your stretched cluster(s).  The steps on the VMware side could of course be scripted, but I am listing out the manual steps.  If you have multiple stretched clusters, you can perform VM migrations for the clusters simultaneously, depending on available resources/bandwidth.  *Note – If it’s within the budget, 10Gb nics can make a huge difference in how quickly you can complete the vMotions.


Preparation – Document the steps beforehand, including an estimated timeline.  If you are in an environment where infrastructure management responsibilities are divided between various teams, meet with other teams to discuss plans and share documentation. Review NetApp KB 1014120 titled “How to perform MetroCluster failover method for a planned site-wide maintenance not requiring CFOD”.


  1. Failover Metrocluster(s) from site A to site B using steps in NetApp KB, including offlining plexes and performing Cf takeover.
  2. Once it is confirmed that storage has successfully been failed over, you can begin the VM migrations.
  3. Verify that DRS is in fact set to Fully Automated.
  4. For each stretched cluster, edit the cluster settings and modify the DRS Affinity “should” rule that keeps VMs at site A.  Change the Affinity rule so that it contains the Site B Host Affinity group instead of Site A Host Affinity group.  Within 5 minutes, DRS should kick of the vMotions for the VMs in the associated VM Affinity group.  You can Run DRS manually if short on time.
  5. Once you confirm all vMotions were successful, place the hosts in site A in maintenance mode.
  6. Document failover times and keep an eye on monitoring software for any VM network connectivity issues.


  1. Failover Metrocluster(s) from site B to site A using steps in NetApp KB, including Online Plexes/Resync and Giveback.
  2. Once it is confirmed that storage has been successfully failed back and synced, you can begin the VM migrations.
  3. Remove the hosts in site A from maintenance mode.
  4. For each stretched cluster, edit the cluster settings and modify the same DRS Affinity “should” rule that was modified during Failover.  Change the Affinity rule so that it contains the original Site A Host Affinity group.  Within 5 minutes, DRS should kick off the vMotions.
  5. Document failover times and keep an eye on monitoring software for any VM network connectivity issues.

For those in IT that remember life before virtualization, it is exciting to see this in action and confirm that storage and compute for hundreds of VMs can be non-disruptively failed over to a site kilometers away in just an hour.  As always, feel free to leave a comment if you have feedback.

Now that I’ve had a few days to recover, I wanted to share my experience from my trip to VMworld 2013.  After having such an amazing time last year, I decided that attending the 10th annual VMworld would be in my best interest, even if it meant “paying my own way”.  It turns out this was a good call!  Unlike last year, I had a better idea of what to expect during my time at the conference.

IMG_0411 IMG_0408I arrived in San Francisco on Sunday, August 25th, just in time to participate in the v0dgeball fundraising event.  (Thanks again to @CommsNinja (Amy Lewis) for letting me play for the Cloudbunnies #FearTheEars!).  This was a good opportunity to help raise money for The Wounded Warrior Project playing dodgeball with a bunch of folks in the VMware community.  It turns out that my dodgeball skills are as about as good as they were back in 3rd grade, not much improvement there 🙂 Thankfully, I made it through without injury and had a fun time in the tournament. Congrats to the EMC team on the victory!

I missed the VMworld Opening Ceremony, but fortunately after getting a bit lost I made it to the VMunderground party.  Great event for networking and catching up with everyone in the community!

On Monday, I attended the 1st Keynote where VMware announced the release of vSphere 5.5 and vCloud Suite 5.5.  VMware continued to talk about the path toward the Software-Defined Datacenter (SDDC) and the latest features included with 5.5.  I won’t go into the details since there are several bloggers that did a great job posting live blogs of the keynotes.  (Check out for example).  I will say that many of the announcements made during this keynote were not a surprise.  The rest of the day I spent attending sessions for the most part.  I really enjoyed the “group discussion” on HA with Duncan Epping and Keith Farcas; it was nice to give feedback, hear from peers and learn about possible futures. Later that evening, I made my way to CXIParty.  @CXI (Christopher Kusek) did a great job putting this together for the community.

IMG_0410On Tuesday, I was able to catch the last part of the 2nd Keynote with “Carl and Kit”.  My favorite part of this general session was the vCAC demo, since I will be building out a Proof Of Concept environment for vCAC when I return to work.  Like many other VMware customers, I am looking at how certain automation and management tools can bring an organization beyond basic virtualization and into a private cloud solution.  Attended the “Ask the Expert vBloggers” session, which I enjoyed just as much as last year.  Later that evening, I had a great time attending the Veeam and vBacon parties.

Most of my Wednesday was spent preparing for and taking the VCAP5-DCA exam.  I’ll save that experience for a different post, but this may be the last time I mix cert exams with VMworld (a bit too much excitement all at one time).

Wednesday night was the VMworld 2013 party.  VMware did an awesome job putting this party together!  Imagine Dragons and Train performed at SF Giants stadium (AT&T park).  They basically threw a carnival in the stadium, along with a huge concert, and then topped it all off with fireworks at the end.  I was not too familiar with either of the bands before the party, but I became a fan of Imagine Dragons during their performance.  Not sure how VMware is going to top this one!

IMG_0417I was able to work on some Hands on Labs on Thursday morning before I left for SFO to head back home.  I did BYOD this year and would highly recommend going that route if you can.  Though I haven’t looked into it yet, I’m assuming I’ll (hopefully) be able to do many of these labs eventually online via Project Nee.

Overall, it was an outstanding VMworld trip!  Very grateful that I was able to catch up with friends I made last year and make new ones.

Without a doubt, troubleshooting storage performance issues can be a challenging part of any VMware admin’s job.  The potential cause of a VMware storage-related issue, in particular on a SAN, can be difficult to identify on infrastructure when the problem could be anywhere on the storage fabric.  Take your pick: host driver/firmware bug, bad host HBA, bad cable, bad FC switch port, wrong FC switch port setting, FC switch firmware bug, controller HBA bug, storage OS bug, misalignment…and the list goes on.  Here is an example of one experience I had when working on a VMware storage issue.


Intermittent, brief storage disconnects seen occurring on all VMware clusters/hosts attached via FC to two NetApp storage arrays.  When the disconnects occurred, they were seen across the hosts at the same time.  Along with the brief disconnects, very high latency spikes and a tremendous amount of SCSI resets were other symptoms seen on the hosts.  There was no obvious pattern – though it often seemed that the symptoms occurred more during the overnight hours, this behavior would also occur during the day.

The storage disconnects in vCenter looked like this in Tasks/Events for the hosts:

Lost access to volume xxxxxxxxxxxxxxxx (datastore name) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

Seconds later…

Successfully restored access to volume xxxxxxxxxxxxxxxxx (datastore name) following connectivity issues.

These events were considered “informational”, so they were not errors that triggered vCenter email notifications, and if you weren’t monitoring logs, these could easily get missed.

Host logs:

A few different event types in the logs, including hIgh number of –

H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0  DID_RESET

Latency spikes up to 1600 milliseconds via ESXTOP….



In collaboration with my colleagues on the storage side, we went through several troubleshooting steps using the tools we had available.  Cases were opened with both VMware and storage vendor.  Steps included:

  • Using the NetApp Virtual Storage Console (VSC) plugin, we used the online “functional” alignment feature to fix storage alignment for certain VMs in the short-term
  • Comparing VMware host and NetApp logs
  • Uploading tons of system bundles and running ESXTOP
  • Closer look at the environment – was anything being triggered by VM activity at those times?  IE. antivirus, resource intensive jobs
  • Reviewed HBA driver/firmware versions
  • Worked with DBAs to try and stagger SQL maintenance plans
  • Verified all aggregate snapshots except for aggr0 were still disabled on the NetApp side, since this had caused similar symptoms in the past. (Yep, still disabled)

After all of the troubleshooting steps above, the issue remained.  Here are the steps that finally led to the solution:

  • Get both vendors and customer together on status call – make sure everyone is on the same page
  • Run perfstats on the NetApp side and ESXTOP in batch mode on VMware side to capture data as the issue is occurring.
  • ESXTOP command: esxtop -b -a -d 10 -n 3000 | gzip -9c > /vmfs/volumes/datastorename/esxtopoutput.csv.gz  And of course, Duncan’s handy ESXTOP page to help analyze:
  • Provide storage vendor with perfstats, esxtop files, and examples times for when the disconnects occurred during the captures.


NetApp bug #253964 was identified as the cause:  The “Long-running “s” type CP after a “z” (sync) type CP” seen in the perfstat indicated we were hitting this bug.  The fix was to upgrade Data OnTAP to a newer version – 8.0.4, 8.0.5, or 8.1.2.  I am happy to confirm that upgrading Data OnTAP did resolve the issue.

Recently, I needed to move 100+ VMs in a VMware environment from an AMD cluster (4.x) to an Intel cluster (5.x). Here are the details and steps I took to accomplish this with PowerCLI:

Storage:  This work did not include moving the VM files to different datastores/LUNs or upgrading datastores, since the storage changes could be handled separately at a later time and do not require an outage.

Time constraints:  The migration could take no longer than about 1 hour, while the VMs were already down.  Due to my limited PowerCLI experience, I ended up dividing the VMs into three separate groups/scripts and running them concurrently.  This met the 1 hour requirement, which made everyone happy!

Networking:  The old cluster and new cluster were on different distributed vswitches, and the old cluster could not be added to the newer 5.x dvswitch.  I knew that this would cause a problem with even cold migrations.  Therefore, as part of the move, I needed to create a temporary 4.x “transfer” dvswitch w/ a “transfer” dvportgroup.   The idea of using a “transfer” dvswitch was taken from this awesome blog post:  I prepared three separate CSV files for the three different groups, with the old and new dvportgroup info.

Disclaimer:  I’m a PowerCLI noob.  The PowerCLI scripts I share may be incredibly simple or not that exciting (to you).  However, if I’m posting it, that means it worked for me and I found it useful.  🙂

After searching the Interwebs to see if others had performed similar work, I found the following discussion in the VMware Communities Forum:  Thanks to “bentech201110141” and Luc Dekens, I was able to use the PowerCLI script mentioned and modify it to fit the solution.

The Logical:

For each VM in the AMD cluster, specified in CSV file –

  • Change dvportgroup for VM to port group on “transfer” dvswitch
  • Move VM to destination host
  • Change dvportgroup for VM to the correct port group on destination dvswitch
  • Start VM

PowerCLI Script:

$vms = Import-CSV c:\csmoveinput
foreach ($vm in $vms){
$VMdestination = Get-VMHost $vm.VMhost
$Network = $vm.VLAN
Get-VM -Name $ | Get-NetworkAdapter | Set-NetworkAdapter -StartConnected:$true -Confirm:$false -NetworkName dvTransfer
Move-VM -vm $ -Destination $VMdestination
Get-VM -Name $ | Get-NetworkAdapter | Set-NetworkAdapter -StartConnected:$true -Confirm:$false -NetworkName $Network
Start-VM -vm $

The idea of doing a little blogging first crossed my mind around this time last year.  In August 2012, I attended VMworld for the first time ever.  That amazing trip was made possible by a contest I won, thanks to vExpert Greg Stuart, a panel of judges, and a few generous sponsors.  For those that are looking for a way to get to VMworld, I encourage you in the future to keep an eye on social media and websites like  Don’t pass up the chance to enter these contests, because after all, if I can win, so can you!

If you’re interested in reading about that experience from last year, here are some links:

During the conference, I was able to meet quite a few bloggers/vExperts in the VMware community and talk with them.  It was great being able to share and learn from others working with virtualization and server infrastructure.  That trip inspired me, and after almost a year, I decided it was time to join the online blogging community.  I’ll be writing about my experiences with virtualization, servers, and…since this is my personal blog, whatever else I find is blog worthy.  Like many others with personal tech blogs, I’m hoping one of my posts will save someone the time it took for me to find a resolution to a technical issue.  If nothing else, it will be great to have my own personalized tech knowledgebase available, whenever I need it.  Don’t hesitate to leave a comment or give feedback on a post.  Keep comments polite, no trolls or IT snobbery please – ain’t nobody got time for that!  Welcome all, and thanks for reading.