If you happen to run at least some of your ESXi clusters on blades, and you have multiple chassis/enclosures, you may choose to distribute the hosts in these clusters across multiple enclosures. This is a good idea, in general, for many environments. After all, though this type of blade enclosure failure is rare, you don’t want a critical issue with a blade enclosure taking down an entire cluster.
Recently, I saw one of these rare events impact an enclosure, and it was not pretty. For Sys Admins – You know, that feeling you get when alerts about multiple hosts being “down” come streaming into your inbox. You think – which hosts, which cluster, perhaps even which site…any commonality? In this case, it was due to the fact that the following bug had just hit an HP Enclosure: http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c02623029&lang=en&cc=us&taskId=135&prodSeriesId=3794423
This enclosure was NOT extremely out-of-date in terms of firmware and drivers. The firmware was at a February 2013 SPP level, and the hosts were built from the latest HP ESXi 5.0 U2 customized ISO.
Here is a summary of what was seen when troubleshooting the issue for the impacted enclosure:
- Both the Onboard Administrators and the Virtual Connect Manager were still accessible – somewhat. See next bullet.
- Virtual Connect Manager could be logged into, but was slow to respond.
- Virtual Connect Manager showed the “stacking link” in a critical state.
- Virtual Connect Manager also showed the 10Gb aggregated (LAG) uplinks were in an active/passive state as opposed to active/active, which is how they were originally configured.
- None of the hosts in the enclosure could be pinged. That is, every single blade lost network connectivity. They still had FC connectivity to the FC switches.
- Some of the ESXi hosts were still running, and some had suffered PSODs as a result of the bug.
- Hosts that were still up eventually saw themselves as “isolated”. Since the isolation response was set to “shutdown”, impacted VMs (luckily, not that many) were shutdown and restarted on non-isolated hosts.
- Exported a log from the Virtual Connect Manager, and HP helped to identify the blade triggering the bug. Host was shutdown, and blade itself was also “reset”, however this did not restore normal functionality.
- Reset one of the Virtual Connect modules. This restored network connectivity for some of the blades, but not all.
- Some of the blades had to be rebooted in order for network connectivity to be completely restored.
My plan for preventing this bug from re-occurring on any enclosure, based on the HP advisory (and in general to update everything to the September 2013 HP “recipe” from vibsdepot.hp.com):
- Using the Enclosure Firmware Management feature to apply the HP September 2013 Service Pack for Proliant (SPP) to each blade
- Running HPSUM from the latest SPP to update the OA and VC firmware
- Using Update Manager to apply recommended ESXi nic and HBA drivers, as well as the latest HP Offline bundle.
As a side note, it appears so far that a different, minor HP bug still remains even with these latest updates– as described in Ben Loveday’s blog post: http://bensjibberjabber.wordpress.com/2013/01/09/storage-alert-bug-running-vsphere-on-hp-bl465c-g7-blades/