Archives For VMware

If you happen to run at least some of your ESXi clusters on blades, and you have multiple chassis/enclosures,  you may choose to distribute the hosts in these clusters across multiple enclosures.  This is a good idea, in general, for many environments.  After all, though this type of blade enclosure failure is rare, you don’t want a critical issue with a blade enclosure taking down an entire cluster.

Recently, I saw one of these rare events impact an enclosure, and it was not pretty.    For Sys Admins – You know, that feeling you get when alerts about multiple hosts being “down” come streaming into your inbox.  You think – which hosts, which cluster, perhaps even which site…any commonality?  In this case, it was due to the fact that the following bug had just hit an HP Enclosure:

This enclosure was NOT extremely out-of-date in terms of firmware and drivers.  The firmware was at a February 2013 SPP level, and the hosts were built from the latest HP ESXi 5.0 U2 customized ISO.

Here is a summary of what was seen when troubleshooting the issue for the impacted enclosure:

  • Both the Onboard Administrators and the Virtual Connect Manager were still accessible – somewhat.  See next bullet.
  • Virtual Connect Manager could be logged into, but was slow to respond.
  • Virtual Connect Manager showed the “stacking link” in a critical state.
  • Virtual Connect Manager also showed the 10Gb aggregated (LAG) uplinks were in an active/passive state as opposed to active/active, which is how they were originally configured.
  • None of the hosts in the enclosure could be pinged.  That is, every single blade lost network connectivity.  They still had FC connectivity to the FC switches.
  • Some of the ESXi hosts were still running, and some had suffered PSODs as a result of the bug.
  • Hosts that were still up eventually saw themselves as “isolated”.  Since the isolation response was set to “shutdown”, impacted VMs (luckily, not that many) were shutdown and restarted on non-isolated hosts.
  • Exported a log from the Virtual Connect Manager, and HP helped to identify the blade triggering the bug.  Host was shutdown, and blade itself was also “reset”, however this did not restore normal functionality.
  • Reset one of the Virtual Connect modules. This restored network connectivity for some of the blades, but not all.
  • Some of the blades had to be rebooted in order for network connectivity to be completely restored.

My plan for preventing this bug from re-occurring on any enclosure, based on the HP advisory (and in general to update everything to the September 2013 HP “recipe” from

  • Using the Enclosure Firmware Management feature to apply the HP September 2013 Service Pack for Proliant (SPP) to each blade
  • Running HPSUM from the latest SPP to update the OA and VC firmware
  • Using Update Manager to apply recommended ESXi nic and HBA drivers, as well as the latest HP Offline bundle.

As a side note, it appears so far that a different, minor HP bug still remains even with these latest updates– as described in Ben Loveday’s blog post:


Hi folks – It’s time to take a quick break from the excitement of the vSphere 5.5 and VSAN announcement to read a blog post about vSphere Metro Storage Clusters (vMSCs aka stretched clusters)!  Specifically, this post is about what I’ve learned in regards to vMSC workload mobility between sites, or downtime avoidance.  Since my vMSC experience is solely based on the NetApp MetroCluster solution, the content below is NetApp-centric.

To take a step back – When you look at the the scenarios that would cause all VMs running on a stretched cluster to completely vacate all hosts at one of the sites and (eventually) end up at the other site, I see two major types of events:

Unplanned Site Failover (Disaster Recovery)

  • Example:  Site A, which hosts half of a stretched cluster, goes completely down. This is an unplanned event, which results in a hard shutdown of all systems in the Site A datacenter.  Once the Site A LUNs are taken over by the Site B controller and fully available, VMs that were running at Site A need to be started at Site B.  Some would argue the DR process should be triggered manually (ie without MetroCluster TieBreaker).  The following doc is a great reference for testing vMSC failure or DR scenarios if you’re doing a proof of concept:

Planned Site Failover (Disaster/Downtime Avoidance)

  • Proactive non-disruptive migration of VM workloads and storage from Site A to Site B.  Performing this work non-disruptively is one of the benefits of a vSphere Metro Storage Cluster.  If equipment needs to be powered down at one of the physical sites (ie. for site maintenance or impending power outage scenario described in Duncan Epping’s blog post), this can be done without downtime for VMs on a stretched cluster.

If you have hundreds of VMs and multiple stretched clusters, it is important to plan and document the steps for these scenarios.  Since I could not find specific VMware documentation discussing the Planned Failover scenario in detail, I wanted to share an example of how this can be performed.  These steps happen to be for a 5.0 stretched cluster environment with one or more NetApp Fabric Metroclusters on the backend.

The following is an example of the process that can be used to non-disruptively failover storage and VMs from site A to site B, and then fail everything back to site A.  This process could be different depending on your storage solution, or how many VMs you have hosted on your stretched cluster(s).  The steps on the VMware side could of course be scripted, but I am listing out the manual steps.  If you have multiple stretched clusters, you can perform VM migrations for the clusters simultaneously, depending on available resources/bandwidth.  *Note – If it’s within the budget, 10Gb nics can make a huge difference in how quickly you can complete the vMotions.


Preparation – Document the steps beforehand, including an estimated timeline.  If you are in an environment where infrastructure management responsibilities are divided between various teams, meet with other teams to discuss plans and share documentation. Review NetApp KB 1014120 titled “How to perform MetroCluster failover method for a planned site-wide maintenance not requiring CFOD”.


  1. Failover Metrocluster(s) from site A to site B using steps in NetApp KB, including offlining plexes and performing Cf takeover.
  2. Once it is confirmed that storage has successfully been failed over, you can begin the VM migrations.
  3. Verify that DRS is in fact set to Fully Automated.
  4. For each stretched cluster, edit the cluster settings and modify the DRS Affinity “should” rule that keeps VMs at site A.  Change the Affinity rule so that it contains the Site B Host Affinity group instead of Site A Host Affinity group.  Within 5 minutes, DRS should kick of the vMotions for the VMs in the associated VM Affinity group.  You can Run DRS manually if short on time.
  5. Once you confirm all vMotions were successful, place the hosts in site A in maintenance mode.
  6. Document failover times and keep an eye on monitoring software for any VM network connectivity issues.


  1. Failover Metrocluster(s) from site B to site A using steps in NetApp KB, including Online Plexes/Resync and Giveback.
  2. Once it is confirmed that storage has been successfully failed back and synced, you can begin the VM migrations.
  3. Remove the hosts in site A from maintenance mode.
  4. For each stretched cluster, edit the cluster settings and modify the same DRS Affinity “should” rule that was modified during Failover.  Change the Affinity rule so that it contains the original Site A Host Affinity group.  Within 5 minutes, DRS should kick off the vMotions.
  5. Document failover times and keep an eye on monitoring software for any VM network connectivity issues.

For those in IT that remember life before virtualization, it is exciting to see this in action and confirm that storage and compute for hundreds of VMs can be non-disruptively failed over to a site kilometers away in just an hour.  As always, feel free to leave a comment if you have feedback.

Now that I’ve had a few days to recover, I wanted to share my experience from my trip to VMworld 2013.  After having such an amazing time last year, I decided that attending the 10th annual VMworld would be in my best interest, even if it meant “paying my own way”.  It turns out this was a good call!  Unlike last year, I had a better idea of what to expect during my time at the conference.

IMG_0411 IMG_0408I arrived in San Francisco on Sunday, August 25th, just in time to participate in the v0dgeball fundraising event.  (Thanks again to @CommsNinja (Amy Lewis) for letting me play for the Cloudbunnies #FearTheEars!).  This was a good opportunity to help raise money for The Wounded Warrior Project playing dodgeball with a bunch of folks in the VMware community.  It turns out that my dodgeball skills are as about as good as they were back in 3rd grade, not much improvement there 🙂 Thankfully, I made it through without injury and had a fun time in the tournament. Congrats to the EMC team on the victory!

I missed the VMworld Opening Ceremony, but fortunately after getting a bit lost I made it to the VMunderground party.  Great event for networking and catching up with everyone in the community!

On Monday, I attended the 1st Keynote where VMware announced the release of vSphere 5.5 and vCloud Suite 5.5.  VMware continued to talk about the path toward the Software-Defined Datacenter (SDDC) and the latest features included with 5.5.  I won’t go into the details since there are several bloggers that did a great job posting live blogs of the keynotes.  (Check out for example).  I will say that many of the announcements made during this keynote were not a surprise.  The rest of the day I spent attending sessions for the most part.  I really enjoyed the “group discussion” on HA with Duncan Epping and Keith Farcas; it was nice to give feedback, hear from peers and learn about possible futures. Later that evening, I made my way to CXIParty.  @CXI (Christopher Kusek) did a great job putting this together for the community.

IMG_0410On Tuesday, I was able to catch the last part of the 2nd Keynote with “Carl and Kit”.  My favorite part of this general session was the vCAC demo, since I will be building out a Proof Of Concept environment for vCAC when I return to work.  Like many other VMware customers, I am looking at how certain automation and management tools can bring an organization beyond basic virtualization and into a private cloud solution.  Attended the “Ask the Expert vBloggers” session, which I enjoyed just as much as last year.  Later that evening, I had a great time attending the Veeam and vBacon parties.

Most of my Wednesday was spent preparing for and taking the VCAP5-DCA exam.  I’ll save that experience for a different post, but this may be the last time I mix cert exams with VMworld (a bit too much excitement all at one time).

Wednesday night was the VMworld 2013 party.  VMware did an awesome job putting this party together!  Imagine Dragons and Train performed at SF Giants stadium (AT&T park).  They basically threw a carnival in the stadium, along with a huge concert, and then topped it all off with fireworks at the end.  I was not too familiar with either of the bands before the party, but I became a fan of Imagine Dragons during their performance.  Not sure how VMware is going to top this one!

IMG_0417I was able to work on some Hands on Labs on Thursday morning before I left for SFO to head back home.  I did BYOD this year and would highly recommend going that route if you can.  Though I haven’t looked into it yet, I’m assuming I’ll (hopefully) be able to do many of these labs eventually online via Project Nee.

Overall, it was an outstanding VMworld trip!  Very grateful that I was able to catch up with friends I made last year and make new ones.

Without a doubt, troubleshooting storage performance issues can be a challenging part of any VMware admin’s job.  The potential cause of a VMware storage-related issue, in particular on a SAN, can be difficult to identify on infrastructure when the problem could be anywhere on the storage fabric.  Take your pick: host driver/firmware bug, bad host HBA, bad cable, bad FC switch port, wrong FC switch port setting, FC switch firmware bug, controller HBA bug, storage OS bug, misalignment…and the list goes on.  Here is an example of one experience I had when working on a VMware storage issue.


Intermittent, brief storage disconnects seen occurring on all VMware clusters/hosts attached via FC to two NetApp storage arrays.  When the disconnects occurred, they were seen across the hosts at the same time.  Along with the brief disconnects, very high latency spikes and a tremendous amount of SCSI resets were other symptoms seen on the hosts.  There was no obvious pattern – though it often seemed that the symptoms occurred more during the overnight hours, this behavior would also occur during the day.

The storage disconnects in vCenter looked like this in Tasks/Events for the hosts:

Lost access to volume xxxxxxxxxxxxxxxx (datastore name) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

Seconds later…

Successfully restored access to volume xxxxxxxxxxxxxxxxx (datastore name) following connectivity issues.

These events were considered “informational”, so they were not errors that triggered vCenter email notifications, and if you weren’t monitoring logs, these could easily get missed.

Host logs:

A few different event types in the logs, including hIgh number of –

H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0  DID_RESET

Latency spikes up to 1600 milliseconds via ESXTOP….



In collaboration with my colleagues on the storage side, we went through several troubleshooting steps using the tools we had available.  Cases were opened with both VMware and storage vendor.  Steps included:

  • Using the NetApp Virtual Storage Console (VSC) plugin, we used the online “functional” alignment feature to fix storage alignment for certain VMs in the short-term
  • Comparing VMware host and NetApp logs
  • Uploading tons of system bundles and running ESXTOP
  • Closer look at the environment – was anything being triggered by VM activity at those times?  IE. antivirus, resource intensive jobs
  • Reviewed HBA driver/firmware versions
  • Worked with DBAs to try and stagger SQL maintenance plans
  • Verified all aggregate snapshots except for aggr0 were still disabled on the NetApp side, since this had caused similar symptoms in the past. (Yep, still disabled)

After all of the troubleshooting steps above, the issue remained.  Here are the steps that finally led to the solution:

  • Get both vendors and customer together on status call – make sure everyone is on the same page
  • Run perfstats on the NetApp side and ESXTOP in batch mode on VMware side to capture data as the issue is occurring.
  • ESXTOP command: esxtop -b -a -d 10 -n 3000 | gzip -9c > /vmfs/volumes/datastorename/esxtopoutput.csv.gz  And of course, Duncan’s handy ESXTOP page to help analyze:
  • Provide storage vendor with perfstats, esxtop files, and examples times for when the disconnects occurred during the captures.


NetApp bug #253964 was identified as the cause:  The “Long-running “s” type CP after a “z” (sync) type CP” seen in the perfstat indicated we were hitting this bug.  The fix was to upgrade Data OnTAP to a newer version – 8.0.4, 8.0.5, or 8.1.2.  I am happy to confirm that upgrading Data OnTAP did resolve the issue.

Recently, I needed to move 100+ VMs in a VMware environment from an AMD cluster (4.x) to an Intel cluster (5.x). Here are the details and steps I took to accomplish this with PowerCLI:

Storage:  This work did not include moving the VM files to different datastores/LUNs or upgrading datastores, since the storage changes could be handled separately at a later time and do not require an outage.

Time constraints:  The migration could take no longer than about 1 hour, while the VMs were already down.  Due to my limited PowerCLI experience, I ended up dividing the VMs into three separate groups/scripts and running them concurrently.  This met the 1 hour requirement, which made everyone happy!

Networking:  The old cluster and new cluster were on different distributed vswitches, and the old cluster could not be added to the newer 5.x dvswitch.  I knew that this would cause a problem with even cold migrations.  Therefore, as part of the move, I needed to create a temporary 4.x “transfer” dvswitch w/ a “transfer” dvportgroup.   The idea of using a “transfer” dvswitch was taken from this awesome blog post:  I prepared three separate CSV files for the three different groups, with the old and new dvportgroup info.

Disclaimer:  I’m a PowerCLI noob.  The PowerCLI scripts I share may be incredibly simple or not that exciting (to you).  However, if I’m posting it, that means it worked for me and I found it useful.  🙂

After searching the Interwebs to see if others had performed similar work, I found the following discussion in the VMware Communities Forum:  Thanks to “bentech201110141” and Luc Dekens, I was able to use the PowerCLI script mentioned and modify it to fit the solution.

The Logical:

For each VM in the AMD cluster, specified in CSV file –

  • Change dvportgroup for VM to port group on “transfer” dvswitch
  • Move VM to destination host
  • Change dvportgroup for VM to the correct port group on destination dvswitch
  • Start VM

PowerCLI Script:

$vms = Import-CSV c:\csmoveinput
foreach ($vm in $vms){
$VMdestination = Get-VMHost $vm.VMhost
$Network = $vm.VLAN
Get-VM -Name $ | Get-NetworkAdapter | Set-NetworkAdapter -StartConnected:$true -Confirm:$false -NetworkName dvTransfer
Move-VM -vm $ -Destination $VMdestination
Get-VM -Name $ | Get-NetworkAdapter | Set-NetworkAdapter -StartConnected:$true -Confirm:$false -NetworkName $Network
Start-VM -vm $