My commute to work, like many others in the DC/Maryland/Virginia area, is a long one.  After years of sticking to music or local news radio as a distraction during the drive, I finally decided to give podcasts a try (always late to the party, but better late than never).  In the past couple of months, I’ve started listening to several virtualization/VMware/tech podcasts on a regular basis.  Like Twitter, it’s another great way to keep current on the latest tech trends and hear what peers and experts have to say.  Here is a short list of the podcasts I’ve listened to so far.  I’ll probably be updating this list as time goes on or creating a new “Podcasts” tab on my site to keep track of the podcasts I enjoy listening to.

VMware Communities Roundtable Podcast, John Mark Troyer, Mike Laverick –  This was the first podcast I checked out, and I can see why it’s so popular!  Listen to the latest from the VMware perspective. Still trying to catch it live…

VUPaaS (Virtualization User Podcast as a Service), Gurusimran “GS” Khalsa, Chris Wahl, Josh Atwell –  This is an awesome, newer podcast.  It’s really easy to relate to the discussions from a Sys Admin point of view.  Very technical and useful info. For example, after hearing GS mention “The Phoenix Project”, I ended up reading this IT novel about DevOps over the holidays.  Was also led to the “Packet Pushers” podcast after it was mentioned in one of the episodes.

Adapting IT, Lauren Malhoit –  Great conversations about tech!  Relatable and very easy to listen to.  It’s also great to see a podcast that is showcasing women in IT.

Geek Whisperers, John Mark Troyer, Amy Lewis, Matthew Brender –  This is an interesting podcast and  it provides perspectives about Social Media and tech that are new to me.

Packet Pushers, Greg Ferro, Ethan Banks –  Listening to what those Network folks talk about.  No, I’m not a Network Engineer, so some of the conversation flies above my head.  But I’m always interested in getting the big picture when it comes to datacenter infrastructure, and it helps to get familiar with the network technologies and terms that are being discussed.  I found the two part deep dive on OTV to be very interesting.

vSoup, Christian Mohn, Chris Dearden, Ed Czerwin- – One more podcast I checked out after listening to VUPaaS (thanks again GS).

If there are other tech podcasts that you would like to recommend, please leave a comment.  Would also like to start listening to some non-tech podcasts, as time allows….

If you run VMware on HP Proliant servers, then you are probably familiar with  In addition to HP customized VMware ESXi ISOs and software bundles, this site also has what HP refers to as VMware firmware and software “recipes”.  The “recipes” list the drivers and firmware that HP recommends running along with a specified Service Pack for Proliant (SPP) and certain ESXi versions.  While applying newer firmware and drivers to HP Blade enclosures can be a pain, it’s a good idea to perform these updates 1-2 times a year since each SPP is only supported for 1 year.

Stacy’s Example:

In the following example, I used the September 2013 “recipe” to apply updates to HP C7000 Blade Enclosures that were already running ESXi 5.0 Update 2 hosts.  There is more than one way to apply these updates, but this is the method I found the easiest.

  • Each HP Blade Enclosure was updated one at a time.
  • For each enclosure, updates were applied to the Onboard Administrators, and Virtual Connect Flex-10 Ethernet modules, and the blades themselves.  (FC switches in enclosures handled separately)
  • Performed the steps detailed below for each enclosure.
  • Note: If your hosts have FC HBAs, check with your storage vendor as well to see if they support the new HBA firmware/drivers.
Blade Driver Updates – VUM
  • Created new VMware Update Manager (VUM) HP Extension/driver baselines based on the September 2013 HP “recipe” (   Reviewed host hardware for each cluster (ie looked at network adapters, RAID controllers, latest offline bundle, etc) to determine the appropriate drivers to include in the baselines.
  • Attached the appropriate baselines to appropriate clusters (again based on hardware for each cluster and the “recipe”, and scanned.
  • Placed all ESXi hosts in the enclosure to be updated in maintenance mode. (It’s great if you are able to shut down and update all blades in the enclosure at once, but not everyone will have this luxury)
  • Suspend alerting for hosts.
  • Remediated the hosts in the blade enclosure using the VUM baselines (Host Extensions).
Blade Firmware Updates – EFM
  • Used the Enclosure Firmware Management (EFM) feature to update blade firmware.  EFM can mount an SPP ISO via URL, where it is hosted on an internal server running IIS.  Prior to updating blade firmware, updated the SPP ISO on the IIS server and re-mounted the ISO in EFM.
  • Shutdown hosts (which were still in maintenance mode) using the vSphere client.
  • Once hosts were shutdown, used the HP EFM feature to manually apply firmware updates.
  • After the firmware updates completed (could take an hour), clicked on Rack Firmware in the OA and reviewed the current version/Firmware ISO version.
Virtual Connect (VC) and Onboard Administrator (OA) Updates – HPSUM from desktop
  • Temporarily disabled the Virtual Connect Domain IP Address (optional setting) in the Virtual Connect Manager in order for HPSUM to discover the Virtual Connects when the Onboard Administrator is added as a target (yes, HP bug workaround).
  • Ran HP SUM from the appropriate HP SPP from desktop.
  • Added Active OA hostname OR IP address as a target, chose Onboard Administrator as type.
  • Blade iLO interfaces, Virtual Connect Manager, and FC Switches were all discovered as associated targets by adding the OA.  For associated targets, de-selected everything except for the Virtual Connect Manager and clicked OK (the iLO interfaces for the blades were updated along with the rest of their firmware using the EFM, and the FC Switch firmware is handled separately).
  • The Virtual Connect Manager may then show as unknown in HPSUM.  Edited that target and changed target type to Virtual Connect, and entered the appropriate credentials.
  • After applying updates to the OAs and VCs, verified they updated to the correct firmware levels.
  • Re-enabled the Virtual Connect Domain IP Address setting.
  • Re-enabled alerting.

If you happen to run at least some of your ESXi clusters on blades, and you have multiple chassis/enclosures,  you may choose to distribute the hosts in these clusters across multiple enclosures.  This is a good idea, in general, for many environments.  After all, though this type of blade enclosure failure is rare, you don’t want a critical issue with a blade enclosure taking down an entire cluster.

Recently, I saw one of these rare events impact an enclosure, and it was not pretty.    For Sys Admins – You know, that feeling you get when alerts about multiple hosts being “down” come streaming into your inbox.  You think – which hosts, which cluster, perhaps even which site…any commonality?  In this case, it was due to the fact that the following bug had just hit an HP Enclosure:

This enclosure was NOT extremely out-of-date in terms of firmware and drivers.  The firmware was at a February 2013 SPP level, and the hosts were built from the latest HP ESXi 5.0 U2 customized ISO.

Here is a summary of what was seen when troubleshooting the issue for the impacted enclosure:

  • Both the Onboard Administrators and the Virtual Connect Manager were still accessible – somewhat.  See next bullet.
  • Virtual Connect Manager could be logged into, but was slow to respond.
  • Virtual Connect Manager showed the “stacking link” in a critical state.
  • Virtual Connect Manager also showed the 10Gb aggregated (LAG) uplinks were in an active/passive state as opposed to active/active, which is how they were originally configured.
  • None of the hosts in the enclosure could be pinged.  That is, every single blade lost network connectivity.  They still had FC connectivity to the FC switches.
  • Some of the ESXi hosts were still running, and some had suffered PSODs as a result of the bug.
  • Hosts that were still up eventually saw themselves as “isolated”.  Since the isolation response was set to “shutdown”, impacted VMs (luckily, not that many) were shutdown and restarted on non-isolated hosts.
  • Exported a log from the Virtual Connect Manager, and HP helped to identify the blade triggering the bug.  Host was shutdown, and blade itself was also “reset”, however this did not restore normal functionality.
  • Reset one of the Virtual Connect modules. This restored network connectivity for some of the blades, but not all.
  • Some of the blades had to be rebooted in order for network connectivity to be completely restored.

My plan for preventing this bug from re-occurring on any enclosure, based on the HP advisory (and in general to update everything to the September 2013 HP “recipe” from

  • Using the Enclosure Firmware Management feature to apply the HP September 2013 Service Pack for Proliant (SPP) to each blade
  • Running HPSUM from the latest SPP to update the OA and VC firmware
  • Using Update Manager to apply recommended ESXi nic and HBA drivers, as well as the latest HP Offline bundle.

As a side note, it appears so far that a different, minor HP bug still remains even with these latest updates– as described in Ben Loveday’s blog post:


Hi folks – It’s time to take a quick break from the excitement of the vSphere 5.5 and VSAN announcement to read a blog post about vSphere Metro Storage Clusters (vMSCs aka stretched clusters)!  Specifically, this post is about what I’ve learned in regards to vMSC workload mobility between sites, or downtime avoidance.  Since my vMSC experience is solely based on the NetApp MetroCluster solution, the content below is NetApp-centric.

To take a step back – When you look at the the scenarios that would cause all VMs running on a stretched cluster to completely vacate all hosts at one of the sites and (eventually) end up at the other site, I see two major types of events:

Unplanned Site Failover (Disaster Recovery)

  • Example:  Site A, which hosts half of a stretched cluster, goes completely down. This is an unplanned event, which results in a hard shutdown of all systems in the Site A datacenter.  Once the Site A LUNs are taken over by the Site B controller and fully available, VMs that were running at Site A need to be started at Site B.  Some would argue the DR process should be triggered manually (ie without MetroCluster TieBreaker).  The following doc is a great reference for testing vMSC failure or DR scenarios if you’re doing a proof of concept:

Planned Site Failover (Disaster/Downtime Avoidance)

  • Proactive non-disruptive migration of VM workloads and storage from Site A to Site B.  Performing this work non-disruptively is one of the benefits of a vSphere Metro Storage Cluster.  If equipment needs to be powered down at one of the physical sites (ie. for site maintenance or impending power outage scenario described in Duncan Epping’s blog post), this can be done without downtime for VMs on a stretched cluster.

If you have hundreds of VMs and multiple stretched clusters, it is important to plan and document the steps for these scenarios.  Since I could not find specific VMware documentation discussing the Planned Failover scenario in detail, I wanted to share an example of how this can be performed.  These steps happen to be for a 5.0 stretched cluster environment with one or more NetApp Fabric Metroclusters on the backend.

The following is an example of the process that can be used to non-disruptively failover storage and VMs from site A to site B, and then fail everything back to site A.  This process could be different depending on your storage solution, or how many VMs you have hosted on your stretched cluster(s).  The steps on the VMware side could of course be scripted, but I am listing out the manual steps.  If you have multiple stretched clusters, you can perform VM migrations for the clusters simultaneously, depending on available resources/bandwidth.  *Note – If it’s within the budget, 10Gb nics can make a huge difference in how quickly you can complete the vMotions.


Preparation – Document the steps beforehand, including an estimated timeline.  If you are in an environment where infrastructure management responsibilities are divided between various teams, meet with other teams to discuss plans and share documentation. Review NetApp KB 1014120 titled “How to perform MetroCluster failover method for a planned site-wide maintenance not requiring CFOD”.


  1. Failover Metrocluster(s) from site A to site B using steps in NetApp KB, including offlining plexes and performing Cf takeover.
  2. Once it is confirmed that storage has successfully been failed over, you can begin the VM migrations.
  3. Verify that DRS is in fact set to Fully Automated.
  4. For each stretched cluster, edit the cluster settings and modify the DRS Affinity “should” rule that keeps VMs at site A.  Change the Affinity rule so that it contains the Site B Host Affinity group instead of Site A Host Affinity group.  Within 5 minutes, DRS should kick of the vMotions for the VMs in the associated VM Affinity group.  You can Run DRS manually if short on time.
  5. Once you confirm all vMotions were successful, place the hosts in site A in maintenance mode.
  6. Document failover times and keep an eye on monitoring software for any VM network connectivity issues.


  1. Failover Metrocluster(s) from site B to site A using steps in NetApp KB, including Online Plexes/Resync and Giveback.
  2. Once it is confirmed that storage has been successfully failed back and synced, you can begin the VM migrations.
  3. Remove the hosts in site A from maintenance mode.
  4. For each stretched cluster, edit the cluster settings and modify the same DRS Affinity “should” rule that was modified during Failover.  Change the Affinity rule so that it contains the original Site A Host Affinity group.  Within 5 minutes, DRS should kick off the vMotions.
  5. Document failover times and keep an eye on monitoring software for any VM network connectivity issues.

For those in IT that remember life before virtualization, it is exciting to see this in action and confirm that storage and compute for hundreds of VMs can be non-disruptively failed over to a site kilometers away in just an hour.  As always, feel free to leave a comment if you have feedback.

Now that I’ve had a few days to recover, I wanted to share my experience from my trip to VMworld 2013.  After having such an amazing time last year, I decided that attending the 10th annual VMworld would be in my best interest, even if it meant “paying my own way”.  It turns out this was a good call!  Unlike last year, I had a better idea of what to expect during my time at the conference.

IMG_0411 IMG_0408I arrived in San Francisco on Sunday, August 25th, just in time to participate in the v0dgeball fundraising event.  (Thanks again to @CommsNinja (Amy Lewis) for letting me play for the Cloudbunnies #FearTheEars!).  This was a good opportunity to help raise money for The Wounded Warrior Project playing dodgeball with a bunch of folks in the VMware community.  It turns out that my dodgeball skills are as about as good as they were back in 3rd grade, not much improvement there 🙂 Thankfully, I made it through without injury and had a fun time in the tournament. Congrats to the EMC team on the victory!

I missed the VMworld Opening Ceremony, but fortunately after getting a bit lost I made it to the VMunderground party.  Great event for networking and catching up with everyone in the community!

On Monday, I attended the 1st Keynote where VMware announced the release of vSphere 5.5 and vCloud Suite 5.5.  VMware continued to talk about the path toward the Software-Defined Datacenter (SDDC) and the latest features included with 5.5.  I won’t go into the details since there are several bloggers that did a great job posting live blogs of the keynotes.  (Check out for example).  I will say that many of the announcements made during this keynote were not a surprise.  The rest of the day I spent attending sessions for the most part.  I really enjoyed the “group discussion” on HA with Duncan Epping and Keith Farcas; it was nice to give feedback, hear from peers and learn about possible futures. Later that evening, I made my way to CXIParty.  @CXI (Christopher Kusek) did a great job putting this together for the community.

IMG_0410On Tuesday, I was able to catch the last part of the 2nd Keynote with “Carl and Kit”.  My favorite part of this general session was the vCAC demo, since I will be building out a Proof Of Concept environment for vCAC when I return to work.  Like many other VMware customers, I am looking at how certain automation and management tools can bring an organization beyond basic virtualization and into a private cloud solution.  Attended the “Ask the Expert vBloggers” session, which I enjoyed just as much as last year.  Later that evening, I had a great time attending the Veeam and vBacon parties.

Most of my Wednesday was spent preparing for and taking the VCAP5-DCA exam.  I’ll save that experience for a different post, but this may be the last time I mix cert exams with VMworld (a bit too much excitement all at one time).

Wednesday night was the VMworld 2013 party.  VMware did an awesome job putting this party together!  Imagine Dragons and Train performed at SF Giants stadium (AT&T park).  They basically threw a carnival in the stadium, along with a huge concert, and then topped it all off with fireworks at the end.  I was not too familiar with either of the bands before the party, but I became a fan of Imagine Dragons during their performance.  Not sure how VMware is going to top this one!

IMG_0417I was able to work on some Hands on Labs on Thursday morning before I left for SFO to head back home.  I did BYOD this year and would highly recommend going that route if you can.  Though I haven’t looked into it yet, I’m assuming I’ll (hopefully) be able to do many of these labs eventually online via Project Nee.

Overall, it was an outstanding VMworld trip!  Very grateful that I was able to catch up with friends I made last year and make new ones.

Without a doubt, troubleshooting storage performance issues can be a challenging part of any VMware admin’s job.  The potential cause of a VMware storage-related issue, in particular on a SAN, can be difficult to identify on infrastructure when the problem could be anywhere on the storage fabric.  Take your pick: host driver/firmware bug, bad host HBA, bad cable, bad FC switch port, wrong FC switch port setting, FC switch firmware bug, controller HBA bug, storage OS bug, misalignment…and the list goes on.  Here is an example of one experience I had when working on a VMware storage issue.


Intermittent, brief storage disconnects seen occurring on all VMware clusters/hosts attached via FC to two NetApp storage arrays.  When the disconnects occurred, they were seen across the hosts at the same time.  Along with the brief disconnects, very high latency spikes and a tremendous amount of SCSI resets were other symptoms seen on the hosts.  There was no obvious pattern – though it often seemed that the symptoms occurred more during the overnight hours, this behavior would also occur during the day.

The storage disconnects in vCenter looked like this in Tasks/Events for the hosts:

Lost access to volume xxxxxxxxxxxxxxxx (datastore name) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

Seconds later…

Successfully restored access to volume xxxxxxxxxxxxxxxxx (datastore name) following connectivity issues.

These events were considered “informational”, so they were not errors that triggered vCenter email notifications, and if you weren’t monitoring logs, these could easily get missed.

Host logs:

A few different event types in the logs, including hIgh number of –

H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0  DID_RESET

Latency spikes up to 1600 milliseconds via ESXTOP….



In collaboration with my colleagues on the storage side, we went through several troubleshooting steps using the tools we had available.  Cases were opened with both VMware and storage vendor.  Steps included:

  • Using the NetApp Virtual Storage Console (VSC) plugin, we used the online “functional” alignment feature to fix storage alignment for certain VMs in the short-term
  • Comparing VMware host and NetApp logs
  • Uploading tons of system bundles and running ESXTOP
  • Closer look at the environment – was anything being triggered by VM activity at those times?  IE. antivirus, resource intensive jobs
  • Reviewed HBA driver/firmware versions
  • Worked with DBAs to try and stagger SQL maintenance plans
  • Verified all aggregate snapshots except for aggr0 were still disabled on the NetApp side, since this had caused similar symptoms in the past. (Yep, still disabled)

After all of the troubleshooting steps above, the issue remained.  Here are the steps that finally led to the solution:

  • Get both vendors and customer together on status call – make sure everyone is on the same page
  • Run perfstats on the NetApp side and ESXTOP in batch mode on VMware side to capture data as the issue is occurring.
  • ESXTOP command: esxtop -b -a -d 10 -n 3000 | gzip -9c > /vmfs/volumes/datastorename/esxtopoutput.csv.gz  And of course, Duncan’s handy ESXTOP page to help analyze:
  • Provide storage vendor with perfstats, esxtop files, and examples times for when the disconnects occurred during the captures.


NetApp bug #253964 was identified as the cause:  The “Long-running “s” type CP after a “z” (sync) type CP” seen in the perfstat indicated we were hitting this bug.  The fix was to upgrade Data OnTAP to a newer version – 8.0.4, 8.0.5, or 8.1.2.  I am happy to confirm that upgrading Data OnTAP did resolve the issue.

Recently, I needed to move 100+ VMs in a VMware environment from an AMD cluster (4.x) to an Intel cluster (5.x). Here are the details and steps I took to accomplish this with PowerCLI:

Storage:  This work did not include moving the VM files to different datastores/LUNs or upgrading datastores, since the storage changes could be handled separately at a later time and do not require an outage.

Time constraints:  The migration could take no longer than about 1 hour, while the VMs were already down.  Due to my limited PowerCLI experience, I ended up dividing the VMs into three separate groups/scripts and running them concurrently.  This met the 1 hour requirement, which made everyone happy!

Networking:  The old cluster and new cluster were on different distributed vswitches, and the old cluster could not be added to the newer 5.x dvswitch.  I knew that this would cause a problem with even cold migrations.  Therefore, as part of the move, I needed to create a temporary 4.x “transfer” dvswitch w/ a “transfer” dvportgroup.   The idea of using a “transfer” dvswitch was taken from this awesome blog post:  I prepared three separate CSV files for the three different groups, with the old and new dvportgroup info.

Disclaimer:  I’m a PowerCLI noob.  The PowerCLI scripts I share may be incredibly simple or not that exciting (to you).  However, if I’m posting it, that means it worked for me and I found it useful.  🙂

After searching the Interwebs to see if others had performed similar work, I found the following discussion in the VMware Communities Forum:  Thanks to “bentech201110141” and Luc Dekens, I was able to use the PowerCLI script mentioned and modify it to fit the solution.

The Logical:

For each VM in the AMD cluster, specified in CSV file –

  • Change dvportgroup for VM to port group on “transfer” dvswitch
  • Move VM to destination host
  • Change dvportgroup for VM to the correct port group on destination dvswitch
  • Start VM

PowerCLI Script:

$vms = Import-CSV c:\csmoveinput
foreach ($vm in $vms){
$VMdestination = Get-VMHost $vm.VMhost
$Network = $vm.VLAN
Get-VM -Name $ | Get-NetworkAdapter | Set-NetworkAdapter -StartConnected:$true -Confirm:$false -NetworkName dvTransfer
Move-VM -vm $ -Destination $VMdestination
Get-VM -Name $ | Get-NetworkAdapter | Set-NetworkAdapter -StartConnected:$true -Confirm:$false -NetworkName $Network
Start-VM -vm $

Last year, I stumbled upon a great post from VCDX/vExpert Blogger Chris Wahl about the steps required to upgrade firmware in a HP C7000 BladeSystem using HP SUM.  This is a simple option for updating the Onboard Administrators (OA), Virtual Connects (VC), or iLO firmware in an HP C7000 Blade enclosure.  However, after several months of using HP SUM to update enclosures, I noticed something strange.

The Issue:

HP SUM was no longer discovering the Virtual Connect modules in any of my enclosures when using the OAs as targets.  The rest of the enclosure components were being discovered just fine, and since the VCs were being discovered via the OA in HP SUM before, I wondered…what changed?

I confirmed that I could add the VCs separately as targets in HP SUM successfully.   I also confirmed that the behavior was seen when using various versions of HP SUM, including the October 2012 SPP release and the February 2013 SPP release.

The Cause:

After a bit of troubleshooting with HP, the cause was identified:  When a “Virtual Connect Domain IP Address” has been enabled for a Virtual Connect Domain, the VCs are no longer discovered when using the OA as a target.  If “Virtual Connect Domain IP Address” was unchecked, HP SUM was again able to discover the VCs by using the OA as a target.  Supposedly, a fix for this bug will be included in a future release of HP SUM.  Until then, a workaround of temporarily disabling the setting of adding the VCs separately in HP SUM can be used.


The idea of doing a little blogging first crossed my mind around this time last year.  In August 2012, I attended VMworld for the first time ever.  That amazing trip was made possible by a contest I won, thanks to vExpert Greg Stuart, a panel of judges, and a few generous sponsors.  For those that are looking for a way to get to VMworld, I encourage you in the future to keep an eye on social media and websites like  Don’t pass up the chance to enter these contests, because after all, if I can win, so can you!

If you’re interested in reading about that experience from last year, here are some links:

During the conference, I was able to meet quite a few bloggers/vExperts in the VMware community and talk with them.  It was great being able to share and learn from others working with virtualization and server infrastructure.  That trip inspired me, and after almost a year, I decided it was time to join the online blogging community.  I’ll be writing about my experiences with virtualization, servers, and…since this is my personal blog, whatever else I find is blog worthy.  Like many others with personal tech blogs, I’m hoping one of my posts will save someone the time it took for me to find a resolution to a technical issue.  If nothing else, it will be great to have my own personalized tech knowledgebase available, whenever I need it.  Don’t hesitate to leave a comment or give feedback on a post.  Keep comments polite, no trolls or IT snobbery please – ain’t nobody got time for that!  Welcome all, and thanks for reading.