Archives For Servers

Ran into a networking-related issue the other day, and while it wasn’t anything new or necessarily exciting, it taught me a lesson about cut-through switching vs store and forward switching.  Sharing this one for my fellow non-Network engineers:

Symptoms:

Realized that (from the Server Engineer perspective) we were seeing frequent packet receive CRC errors/drops on multiple server interfaces (ESXi hosts, HP virtual connects) in a particular datacenter row.  In this case, these devices were connected to Nexus 5Ks, either directly or via Nexus 2Ks.

Lesson Learned:

Depending on the network switch, it can either use the store and forwarding method or the cut-through method to forward frames.  I believe Cisco Catalyst switches and Cisco Nexus 7000 series typically use store and forwarding.  I believe the Cisco Nexus 5000 and 2000 series use cut-through.  (Network engineers: please correct me if that is incorrect)

Store and Forwarding – When there are CRC errors generated, ie. from a device network interface with a bad cable, this corruption will get detected before the frames are sent and would not get propagated across the network or “switching domain”.

Cut-Through – When there are CRC errors generated, ie. from a device network interface with a bad cable, these corrupted frames may get propagated because cut-through switches aren’t able to verify the integrity of an incoming packet before forwarding it.  This is good because it reduces switching latency, but it can cause CRC errors to “spread like the plague”

Resolution:

Worked with a friendly Network Engineer to identify the device and interface that was the source of these propagating errors, and then replaced the cable.

There were a couple of good sources of info that were very helpful for this one:

 

 

Advertisements

When migrating virtual or physical servers from one data center to another, especially if you are moving from Cisco Catalyst to Nexus switches, it’s helpful to be aware of the concept of Proxy ARP.  Here is a link to a Cisco article that explains Proxy ARP:

http://www.cisco.com/c/en/us/support/docs/ip/dynamic-address-allocation-resolution/13718-5.html.

If Proxy ARP is enabled on a switch/router, it can hide or mask misconfigured default gateways/subnet masks on servers.  A switch/router with this setting enabled can help servers reach devices in other subnets, even if the configured default gateway on a server is incorrect. Once the misconfigured servers are moved to network equipment that has Proxy ARP disabled, the servers will no longer be able to communicate with devices in other subnets.  Proxy ARP is enabled by default on Catalyst switches and disabled by default on Nexus switches.  Make sure to review the Proxy ARP settings in both the originating and destination data center.  If this setting will be disabled at the destination site, run a script to check default gateways and subnet masks on servers before beginning a migration.

My commute to work, like many others in the DC/Maryland/Virginia area, is a long one.  After years of sticking to music or local news radio as a distraction during the drive, I finally decided to give podcasts a try (always late to the party, but better late than never).  In the past couple of months, I’ve started listening to several virtualization/VMware/tech podcasts on a regular basis.  Like Twitter, it’s another great way to keep current on the latest tech trends and hear what peers and experts have to say.  Here is a short list of the podcasts I’ve listened to so far.  I’ll probably be updating this list as time goes on or creating a new “Podcasts” tab on my site to keep track of the podcasts I enjoy listening to.

VMware Communities Roundtable Podcast, John Mark Troyer, Mike Laverick – http://www.talkshoe.com/tc/19367.  This was the first podcast I checked out, and I can see why it’s so popular!  Listen to the latest from the VMware perspective. Still trying to catch it live…

VUPaaS (Virtualization User Podcast as a Service), Gurusimran “GS” Khalsa, Chris Wahl, Josh Atwell – http://vupaas.com.  This is an awesome, newer podcast.  It’s really easy to relate to the discussions from a Sys Admin point of view.  Very technical and useful info. For example, after hearing GS mention “The Phoenix Project”, I ended up reading this IT novel about DevOps over the holidays.  Was also led to the “Packet Pushers” podcast after it was mentioned in one of the episodes.

Adapting IT, Lauren Malhoit – http://www.adaptingit.com.  Great conversations about tech!  Relatable and very easy to listen to.  It’s also great to see a podcast that is showcasing women in IT.

Geek Whisperers, John Mark Troyer, Amy Lewis, Matthew Brender – http://geek-whisperers.com/.  This is an interesting podcast and  it provides perspectives about Social Media and tech that are new to me.

Packet Pushers, Greg Ferro, Ethan Banks – http://packetpushers.net.  Listening to what those Network folks talk about.  No, I’m not a Network Engineer, so some of the conversation flies above my head.  But I’m always interested in getting the big picture when it comes to datacenter infrastructure, and it helps to get familiar with the network technologies and terms that are being discussed.  I found the two part deep dive on OTV to be very interesting.

vSoup, Christian Mohn, Chris Dearden, Ed Czerwin- http://vsoup.net/ – One more podcast I checked out after listening to VUPaaS (thanks again GS).

If there are other tech podcasts that you would like to recommend, please leave a comment.  Would also like to start listening to some non-tech podcasts, as time allows….

If you run VMware on HP Proliant servers, then you are probably familiar with http://vibsdepot.hp.com.  In addition to HP customized VMware ESXi ISOs and software bundles, this site also has what HP refers to as VMware firmware and software “recipes”.  The “recipes” list the drivers and firmware that HP recommends running along with a specified Service Pack for Proliant (SPP) and certain ESXi versions.  While applying newer firmware and drivers to HP Blade enclosures can be a pain, it’s a good idea to perform these updates 1-2 times a year since each SPP is only supported for 1 year.

Stacy’s Example:

In the following example, I used the September 2013 “recipe” to apply updates to HP C7000 Blade Enclosures that were already running ESXi 5.0 Update 2 hosts.  There is more than one way to apply these updates, but this is the method I found the easiest.

  • Each HP Blade Enclosure was updated one at a time.
  • For each enclosure, updates were applied to the Onboard Administrators, and Virtual Connect Flex-10 Ethernet modules, and the blades themselves.  (FC switches in enclosures handled separately)
  • Performed the steps detailed below for each enclosure.
  • Note: If your hosts have FC HBAs, check with your storage vendor as well to see if they support the new HBA firmware/drivers.
Blade Driver Updates – VUM
  • Created new VMware Update Manager (VUM) HP Extension/driver baselines based on the September 2013 HP “recipe” (vibsdepot.hp.com)   Reviewed host hardware for each cluster (ie looked at network adapters, RAID controllers, latest offline bundle, etc) to determine the appropriate drivers to include in the baselines.
  • Attached the appropriate baselines to appropriate clusters (again based on hardware for each cluster and the “recipe”, and scanned.
  • Placed all ESXi hosts in the enclosure to be updated in maintenance mode. (It’s great if you are able to shut down and update all blades in the enclosure at once, but not everyone will have this luxury)
  • Suspend alerting for hosts.
  • Remediated the hosts in the blade enclosure using the VUM baselines (Host Extensions).
Blade Firmware Updates – EFM
  • Used the Enclosure Firmware Management (EFM) feature to update blade firmware.  EFM can mount an SPP ISO via URL, where it is hosted on an internal server running IIS.  Prior to updating blade firmware, updated the SPP ISO on the IIS server and re-mounted the ISO in EFM.
  • Shutdown hosts (which were still in maintenance mode) using the vSphere client.
  • Once hosts were shutdown, used the HP EFM feature to manually apply firmware updates.
  • After the firmware updates completed (could take an hour), clicked on Rack Firmware in the OA and reviewed the current version/Firmware ISO version.
Virtual Connect (VC) and Onboard Administrator (OA) Updates – HPSUM from desktop
  • Temporarily disabled the Virtual Connect Domain IP Address (optional setting) in the Virtual Connect Manager in order for HPSUM to discover the Virtual Connects when the Onboard Administrator is added as a target (yes, HP bug workaround).
  • Ran HP SUM from the appropriate HP SPP from desktop.
  • Added Active OA hostname OR IP address as a target, chose Onboard Administrator as type.
  • Blade iLO interfaces, Virtual Connect Manager, and FC Switches were all discovered as associated targets by adding the OA.  For associated targets, de-selected everything except for the Virtual Connect Manager and clicked OK (the iLO interfaces for the blades were updated along with the rest of their firmware using the EFM, and the FC Switch firmware is handled separately).
  • The Virtual Connect Manager may then show as unknown in HPSUM.  Edited that target and changed target type to Virtual Connect, and entered the appropriate credentials.
  • After applying updates to the OAs and VCs, verified they updated to the correct firmware levels.
  • Re-enabled the Virtual Connect Domain IP Address setting.
  • Re-enabled alerting.

If you happen to run at least some of your ESXi clusters on blades, and you have multiple chassis/enclosures,  you may choose to distribute the hosts in these clusters across multiple enclosures.  This is a good idea, in general, for many environments.  After all, though this type of blade enclosure failure is rare, you don’t want a critical issue with a blade enclosure taking down an entire cluster.

Recently, I saw one of these rare events impact an enclosure, and it was not pretty.    For Sys Admins – You know, that feeling you get when alerts about multiple hosts being “down” come streaming into your inbox.  You think – which hosts, which cluster, perhaps even which site…any commonality?  In this case, it was due to the fact that the following bug had just hit an HP Enclosure:  http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c02623029&lang=en&cc=us&taskId=135&prodSeriesId=3794423

This enclosure was NOT extremely out-of-date in terms of firmware and drivers.  The firmware was at a February 2013 SPP level, and the hosts were built from the latest HP ESXi 5.0 U2 customized ISO.

Here is a summary of what was seen when troubleshooting the issue for the impacted enclosure:

  • Both the Onboard Administrators and the Virtual Connect Manager were still accessible – somewhat.  See next bullet.
  • Virtual Connect Manager could be logged into, but was slow to respond.
  • Virtual Connect Manager showed the “stacking link” in a critical state.
  • Virtual Connect Manager also showed the 10Gb aggregated (LAG) uplinks were in an active/passive state as opposed to active/active, which is how they were originally configured.
  • None of the hosts in the enclosure could be pinged.  That is, every single blade lost network connectivity.  They still had FC connectivity to the FC switches.
  • Some of the ESXi hosts were still running, and some had suffered PSODs as a result of the bug.
  • Hosts that were still up eventually saw themselves as “isolated”.  Since the isolation response was set to “shutdown”, impacted VMs (luckily, not that many) were shutdown and restarted on non-isolated hosts.
  • Exported a log from the Virtual Connect Manager, and HP helped to identify the blade triggering the bug.  Host was shutdown, and blade itself was also “reset”, however this did not restore normal functionality.
  • Reset one of the Virtual Connect modules. This restored network connectivity for some of the blades, but not all.
  • Some of the blades had to be rebooted in order for network connectivity to be completely restored.

My plan for preventing this bug from re-occurring on any enclosure, based on the HP advisory (and in general to update everything to the September 2013 HP “recipe” from vibsdepot.hp.com):

  • Using the Enclosure Firmware Management feature to apply the HP September 2013 Service Pack for Proliant (SPP) to each blade
  • Running HPSUM from the latest SPP to update the OA and VC firmware
  • Using Update Manager to apply recommended ESXi nic and HBA drivers, as well as the latest HP Offline bundle.

As a side note, it appears so far that a different, minor HP bug still remains even with these latest updates– as described in Ben Loveday’s blog post: http://bensjibberjabber.wordpress.com/2013/01/09/storage-alert-bug-running-vsphere-on-hp-bl465c-g7-blades/

Sigh….

Last year, I stumbled upon a great post from VCDX/vExpert Blogger Chris Wahl about the steps required to upgrade firmware in a HP C7000 BladeSystem using HP SUM.  This is a simple option for updating the Onboard Administrators (OA), Virtual Connects (VC), or iLO firmware in an HP C7000 Blade enclosure.  However, after several months of using HP SUM to update enclosures, I noticed something strange.

The Issue:

HP SUM was no longer discovering the Virtual Connect modules in any of my enclosures when using the OAs as targets.  The rest of the enclosure components were being discovered just fine, and since the VCs were being discovered via the OA in HP SUM before, I wondered…what changed?

I confirmed that I could add the VCs separately as targets in HP SUM successfully.   I also confirmed that the behavior was seen when using various versions of HP SUM, including the October 2012 SPP release and the February 2013 SPP release.

The Cause:

After a bit of troubleshooting with HP, the cause was identified:  When a “Virtual Connect Domain IP Address” has been enabled for a Virtual Connect Domain, the VCs are no longer discovered when using the OA as a target.  If “Virtual Connect Domain IP Address” was unchecked, HP SUM was again able to discover the VCs by using the OA as a target.  Supposedly, a fix for this bug will be included in a future release of HP SUM.  Until then, a workaround of temporarily disabling the setting of adding the VCs separately in HP SUM can be used.

vc