Archives For HP

If you happen to run at least some of your ESXi clusters on blades, and you have multiple chassis/enclosures,  you may choose to distribute the hosts in these clusters across multiple enclosures.  This is a good idea, in general, for many environments.  After all, though this type of blade enclosure failure is rare, you don’t want a critical issue with a blade enclosure taking down an entire cluster.

Recently, I saw one of these rare events impact an enclosure, and it was not pretty.    For Sys Admins – You know, that feeling you get when alerts about multiple hosts being “down” come streaming into your inbox.  You think – which hosts, which cluster, perhaps even which site…any commonality?  In this case, it was due to the fact that the following bug had just hit an HP Enclosure:  http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c02623029&lang=en&cc=us&taskId=135&prodSeriesId=3794423

This enclosure was NOT extremely out-of-date in terms of firmware and drivers.  The firmware was at a February 2013 SPP level, and the hosts were built from the latest HP ESXi 5.0 U2 customized ISO.

Here is a summary of what was seen when troubleshooting the issue for the impacted enclosure:

  • Both the Onboard Administrators and the Virtual Connect Manager were still accessible – somewhat.  See next bullet.
  • Virtual Connect Manager could be logged into, but was slow to respond.
  • Virtual Connect Manager showed the “stacking link” in a critical state.
  • Virtual Connect Manager also showed the 10Gb aggregated (LAG) uplinks were in an active/passive state as opposed to active/active, which is how they were originally configured.
  • None of the hosts in the enclosure could be pinged.  That is, every single blade lost network connectivity.  They still had FC connectivity to the FC switches.
  • Some of the ESXi hosts were still running, and some had suffered PSODs as a result of the bug.
  • Hosts that were still up eventually saw themselves as “isolated”.  Since the isolation response was set to “shutdown”, impacted VMs (luckily, not that many) were shutdown and restarted on non-isolated hosts.
  • Exported a log from the Virtual Connect Manager, and HP helped to identify the blade triggering the bug.  Host was shutdown, and blade itself was also “reset”, however this did not restore normal functionality.
  • Reset one of the Virtual Connect modules. This restored network connectivity for some of the blades, but not all.
  • Some of the blades had to be rebooted in order for network connectivity to be completely restored.

My plan for preventing this bug from re-occurring on any enclosure, based on the HP advisory (and in general to update everything to the September 2013 HP “recipe” from vibsdepot.hp.com):

  • Using the Enclosure Firmware Management feature to apply the HP September 2013 Service Pack for Proliant (SPP) to each blade
  • Running HPSUM from the latest SPP to update the OA and VC firmware
  • Using Update Manager to apply recommended ESXi nic and HBA drivers, as well as the latest HP Offline bundle.

As a side note, it appears so far that a different, minor HP bug still remains even with these latest updates– as described in Ben Loveday’s blog post: http://bensjibberjabber.wordpress.com/2013/01/09/storage-alert-bug-running-vsphere-on-hp-bl465c-g7-blades/

Sigh….

Advertisements

Last year, I stumbled upon a great post from VCDX/vExpert Blogger Chris Wahl about the steps required to upgrade firmware in a HP C7000 BladeSystem using HP SUM.  This is a simple option for updating the Onboard Administrators (OA), Virtual Connects (VC), or iLO firmware in an HP C7000 Blade enclosure.  However, after several months of using HP SUM to update enclosures, I noticed something strange.

The Issue:

HP SUM was no longer discovering the Virtual Connect modules in any of my enclosures when using the OAs as targets.  The rest of the enclosure components were being discovered just fine, and since the VCs were being discovered via the OA in HP SUM before, I wondered…what changed?

I confirmed that I could add the VCs separately as targets in HP SUM successfully.   I also confirmed that the behavior was seen when using various versions of HP SUM, including the October 2012 SPP release and the February 2013 SPP release.

The Cause:

After a bit of troubleshooting with HP, the cause was identified:  When a “Virtual Connect Domain IP Address” has been enabled for a Virtual Connect Domain, the VCs are no longer discovered when using the OA as a target.  If “Virtual Connect Domain IP Address” was unchecked, HP SUM was again able to discover the VCs by using the OA as a target.  Supposedly, a fix for this bug will be included in a future release of HP SUM.  Until then, a workaround of temporarily disabling the setting of adding the VCs separately in HP SUM can be used.

vc