That Day An Entire HP C7000 Enclosure Got Hit With a Nasty Bug

October 3, 2013 — 6 Comments

If you happen to run at least some of your ESXi clusters on blades, and you have multiple chassis/enclosures,  you may choose to distribute the hosts in these clusters across multiple enclosures.  This is a good idea, in general, for many environments.  After all, though this type of blade enclosure failure is rare, you don’t want a critical issue with a blade enclosure taking down an entire cluster.

Recently, I saw one of these rare events impact an enclosure, and it was not pretty.    For Sys Admins – You know, that feeling you get when alerts about multiple hosts being “down” come streaming into your inbox.  You think – which hosts, which cluster, perhaps even which site…any commonality?  In this case, it was due to the fact that the following bug had just hit an HP Enclosure:  http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c02623029&lang=en&cc=us&taskId=135&prodSeriesId=3794423

This enclosure was NOT extremely out-of-date in terms of firmware and drivers.  The firmware was at a February 2013 SPP level, and the hosts were built from the latest HP ESXi 5.0 U2 customized ISO.

Here is a summary of what was seen when troubleshooting the issue for the impacted enclosure:

  • Both the Onboard Administrators and the Virtual Connect Manager were still accessible – somewhat.  See next bullet.
  • Virtual Connect Manager could be logged into, but was slow to respond.
  • Virtual Connect Manager showed the “stacking link” in a critical state.
  • Virtual Connect Manager also showed the 10Gb aggregated (LAG) uplinks were in an active/passive state as opposed to active/active, which is how they were originally configured.
  • None of the hosts in the enclosure could be pinged.  That is, every single blade lost network connectivity.  They still had FC connectivity to the FC switches.
  • Some of the ESXi hosts were still running, and some had suffered PSODs as a result of the bug.
  • Hosts that were still up eventually saw themselves as “isolated”.  Since the isolation response was set to “shutdown”, impacted VMs (luckily, not that many) were shutdown and restarted on non-isolated hosts.
  • Exported a log from the Virtual Connect Manager, and HP helped to identify the blade triggering the bug.  Host was shutdown, and blade itself was also “reset”, however this did not restore normal functionality.
  • Reset one of the Virtual Connect modules. This restored network connectivity for some of the blades, but not all.
  • Some of the blades had to be rebooted in order for network connectivity to be completely restored.

My plan for preventing this bug from re-occurring on any enclosure, based on the HP advisory (and in general to update everything to the September 2013 HP “recipe” from vibsdepot.hp.com):

  • Using the Enclosure Firmware Management feature to apply the HP September 2013 Service Pack for Proliant (SPP) to each blade
  • Running HPSUM from the latest SPP to update the OA and VC firmware
  • Using Update Manager to apply recommended ESXi nic and HBA drivers, as well as the latest HP Offline bundle.

As a side note, it appears so far that a different, minor HP bug still remains even with these latest updates– as described in Ben Loveday’s blog post: http://bensjibberjabber.wordpress.com/2013/01/09/storage-alert-bug-running-vsphere-on-hp-bl465c-g7-blades/

Sigh….

Advertisements

6 responses to That Day An Entire HP C7000 Enclosure Got Hit With a Nasty Bug

  1. 

    Hi,
    Were the blades at the Feb SPP as well? The HP article noted for 554flb adapter for example, the issue was seen in firmware older than 4.1.xxx. Feb SPP comes with the the 4.2.x firmware. Hp custom ISO came with 4.2 drivers as well.

    We standardized on the feb spp but we stayed at 3.6 vc and oa.

    • 

      Hi Chris – Good question. My understanding is that the bug was triggered by the nic firmware and not so much the drivers. Yes, the blades were all updated with the Feb SPP (firmware) using the HP EFM. The OAs and VCs were also updated at that time using the same SPP. The blades have HP 553i Emulex network adapters. Still actually in the process of completing updates. Once I return to the office, will have to double check and confirm what the nic firmware version was for the blades (can’t recall at the moment).

    • 

      Either way, you raise a good point that the Feb SPP appears to include firmware newer than the version mentioned in the advisory. Has me wondering if either (a) the EFM did not successfully update the nic firmware earlier this year using the Feb 2013 SPP, (b) the nic firmware update did not fix the bug, or (c) HP Support misdiagnosed which bug was affecting the enclosure. Hmmmmmm…will update after looking into it.

      One thing worth mentioning is that the newer 4.01 VC firmware includes a fix to limit the impact to just the blade that triggers the pause frame issue, so that it doesn’t impact the whole enclosure.

    • 

      Hi Chris – Just verified that the blades in the enclosure were running nc553i firmware version 4.2.401.6, newer than what was mentioned in the HP Advisory. Just opened another case with HP to get clarification on why we hit the bug even with newer nic firmware on all blades.

  2. 

    After escalating this issue, I was told that the bug can still occur if you have newer firmware but older drivers (ie. if you’ve updated firmware but kept the inbox drivers that come with the HP customized ESXi ISO). I asked if the advisory could be updated to make this more clear.

  3. 

    Hey there! I could have sworn I’ve been to this site before but after reading through some of the post I
    realized it’s new to me. Anyways, I’m definitely happy I found it
    and I’ll be bookmarking and checking back often!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s