Archives For VMware vSphere

…written from the perspective of a Virtualization Engineer.  A very special thanks to Networking Guru @cjordanVA for being a key contributor on this post.

Overlay Transport Virtualization (OTV), which is a Cisco feature that was released in 2010, can be used to extend Layer 2 traffic between distributed data centers.  The extension of Layer 2 domains across data centers may be required to support certain high availability solutions, such as stretched clusters or application mobility.  Instead of traffic being sent as Layer 2 across a Data Center Interconnect (DCI), OTV encapsulates the Layer 2 traffic in Layer 3 packets.  There are some benefits to using OTV for Layer 2 extension between sites, such as limiting the impact of unknown-unicast flooding.  OTV also allows for FHRP Isolation, which allows the same default gateway to exist in the distributed data centers at the same time.  This can help reduce traffic tromboning between sites.

When planning an OTV implementation in an enterprise environment with existing production systems, here are a few things to include in the testing phase when collaborating with other teams:

  • Setup a conference call for the OTV implementation day and share this information with the Infrastructure groups involved in the implementation and testing, ie. Network, Storage, Server, and Virtualization engineers.  This will allow staff involved to easily communicate when performing testing following the change.
  • Test pinging physical server interfaces by IP address at one datacenter from the other datacenter, and from various subnets.  Can you ping the interface from the same site, but not from the other site? (Make sure to establish a baseline before implementation day.)  Is your monitoring software at one site randomly alerting that it cannot ping devices at the other site?
  • If your vCenter Server manages hosts located in multiple data centers, was vCenter able to reconnect to ESXi hosts at the other datacenter (across the DCI) after OTV was enabled?
  • If you have systems that replicate storage/data between the data centers, test this replication after OTV is enabled and verify it completes successfully.

 

Be aware of a couple of gotchas:

ARP aging timer/CAM aging timer – Make sure to set the ARP aging timer lower than the CAM aging timer to prevent traffic from getting randomly blackholed.  This is an issue to watch out for if OTV is being implemented in a mixed Catalyst/Nexus environment, and will not likely be an issue if the environment is all Nexus.  The default times for the aging timer depend on the Cisco platform.  The default for a Catalyst 6500 is different than the default for a Nexus 7000.

Symptoms of an aging timer issue:  You will more than likely see failures during the pings tests mentioned above or you may see intermittent issues with establishing connectivity to certain hosts.

MTU Settings – Since OTV adds additional bytes to IP header packets and also sets the do not fragment “DF” bit, a larger MTU will need to be configured on any interfaces along the path of an OTV encapsulated packet.  Check the MTU settings prior to implementation, and again if issues arise when OTV is rolled out.  If MTU settings were properly configured, consider rebooting the OTV edge devices as a troubleshooting step if issues are encountered to verify the MTU settings actually applied properly and did not get stuck — (it’s happened).

Symptoms of an MTU-related issue:  If you have a vCenter server in one data center that manages hosts at the other datacenter, it may not be able to reconnect to the hosts at the other data center.  Storage replication may not complete successfully after OTV has been enabled.

The other day while logged into my vCenter Orchestrator 5.5 client, I noticed that there were several packages and workflows missing from the library.  The images below show how it looked with the missing packages/workflows.  For example, you can see most of the subfolders are missing from the vCenter folder, and the vCenter package is completely gone from the library.

Missingworkflows Missingpackages

I logged into the Orchestrator configuration web interface and verified that everything looked correct and “green”, and then I tried rebooting the appliance to see if that would make the missing contents appear in the GUI.  After searching VMware communities and talking with GSS, I found out that this is apparently a common issue with vCO 5.5.

The following steps resolved the issue for me:  Log into the vCO configuration web interface.  Go to the ‘Troubleshooting’ tab, select ‘reset to current version’ to reinstall the plugins.  Then go into ‘Startup Options’.  Select ‘Restart the Configuration server’.  Log back into the vCO configuration web interface, go to ‘Startup Options’ again, and select ‘Restart Service’.

Link to communities thread related to this issue: https://communities.vmware.com/thread/468000

The images below show how it all looked after completing these steps.  Back to normal!

Workflowspresent Packagespresent

Oh, and by the way, if you’re impatient like me and you try to login to vCO immediately after completing the steps above, you may get this following error.  If you do, wait a few minutes and try again.

nodeerror

If you run VMware on HP Proliant servers, then you are probably familiar with http://vibsdepot.hp.com.  In addition to HP customized VMware ESXi ISOs and software bundles, this site also has what HP refers to as VMware firmware and software “recipes”.  The “recipes” list the drivers and firmware that HP recommends running along with a specified Service Pack for Proliant (SPP) and certain ESXi versions.  While applying newer firmware and drivers to HP Blade enclosures can be a pain, it’s a good idea to perform these updates 1-2 times a year since each SPP is only supported for 1 year.

Stacy’s Example:

In the following example, I used the September 2013 “recipe” to apply updates to HP C7000 Blade Enclosures that were already running ESXi 5.0 Update 2 hosts.  There is more than one way to apply these updates, but this is the method I found the easiest.

  • Each HP Blade Enclosure was updated one at a time.
  • For each enclosure, updates were applied to the Onboard Administrators, and Virtual Connect Flex-10 Ethernet modules, and the blades themselves.  (FC switches in enclosures handled separately)
  • Performed the steps detailed below for each enclosure.
  • Note: If your hosts have FC HBAs, check with your storage vendor as well to see if they support the new HBA firmware/drivers.
Blade Driver Updates – VUM
  • Created new VMware Update Manager (VUM) HP Extension/driver baselines based on the September 2013 HP “recipe” (vibsdepot.hp.com)   Reviewed host hardware for each cluster (ie looked at network adapters, RAID controllers, latest offline bundle, etc) to determine the appropriate drivers to include in the baselines.
  • Attached the appropriate baselines to appropriate clusters (again based on hardware for each cluster and the “recipe”, and scanned.
  • Placed all ESXi hosts in the enclosure to be updated in maintenance mode. (It’s great if you are able to shut down and update all blades in the enclosure at once, but not everyone will have this luxury)
  • Suspend alerting for hosts.
  • Remediated the hosts in the blade enclosure using the VUM baselines (Host Extensions).
Blade Firmware Updates – EFM
  • Used the Enclosure Firmware Management (EFM) feature to update blade firmware.  EFM can mount an SPP ISO via URL, where it is hosted on an internal server running IIS.  Prior to updating blade firmware, updated the SPP ISO on the IIS server and re-mounted the ISO in EFM.
  • Shutdown hosts (which were still in maintenance mode) using the vSphere client.
  • Once hosts were shutdown, used the HP EFM feature to manually apply firmware updates.
  • After the firmware updates completed (could take an hour), clicked on Rack Firmware in the OA and reviewed the current version/Firmware ISO version.
Virtual Connect (VC) and Onboard Administrator (OA) Updates – HPSUM from desktop
  • Temporarily disabled the Virtual Connect Domain IP Address (optional setting) in the Virtual Connect Manager in order for HPSUM to discover the Virtual Connects when the Onboard Administrator is added as a target (yes, HP bug workaround).
  • Ran HP SUM from the appropriate HP SPP from desktop.
  • Added Active OA hostname OR IP address as a target, chose Onboard Administrator as type.
  • Blade iLO interfaces, Virtual Connect Manager, and FC Switches were all discovered as associated targets by adding the OA.  For associated targets, de-selected everything except for the Virtual Connect Manager and clicked OK (the iLO interfaces for the blades were updated along with the rest of their firmware using the EFM, and the FC Switch firmware is handled separately).
  • The Virtual Connect Manager may then show as unknown in HPSUM.  Edited that target and changed target type to Virtual Connect, and entered the appropriate credentials.
  • After applying updates to the OAs and VCs, verified they updated to the correct firmware levels.
  • Re-enabled the Virtual Connect Domain IP Address setting.
  • Re-enabled alerting.

If you happen to run at least some of your ESXi clusters on blades, and you have multiple chassis/enclosures,  you may choose to distribute the hosts in these clusters across multiple enclosures.  This is a good idea, in general, for many environments.  After all, though this type of blade enclosure failure is rare, you don’t want a critical issue with a blade enclosure taking down an entire cluster.

Recently, I saw one of these rare events impact an enclosure, and it was not pretty.    For Sys Admins – You know, that feeling you get when alerts about multiple hosts being “down” come streaming into your inbox.  You think – which hosts, which cluster, perhaps even which site…any commonality?  In this case, it was due to the fact that the following bug had just hit an HP Enclosure:  http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c02623029&lang=en&cc=us&taskId=135&prodSeriesId=3794423

This enclosure was NOT extremely out-of-date in terms of firmware and drivers.  The firmware was at a February 2013 SPP level, and the hosts were built from the latest HP ESXi 5.0 U2 customized ISO.

Here is a summary of what was seen when troubleshooting the issue for the impacted enclosure:

  • Both the Onboard Administrators and the Virtual Connect Manager were still accessible – somewhat.  See next bullet.
  • Virtual Connect Manager could be logged into, but was slow to respond.
  • Virtual Connect Manager showed the “stacking link” in a critical state.
  • Virtual Connect Manager also showed the 10Gb aggregated (LAG) uplinks were in an active/passive state as opposed to active/active, which is how they were originally configured.
  • None of the hosts in the enclosure could be pinged.  That is, every single blade lost network connectivity.  They still had FC connectivity to the FC switches.
  • Some of the ESXi hosts were still running, and some had suffered PSODs as a result of the bug.
  • Hosts that were still up eventually saw themselves as “isolated”.  Since the isolation response was set to “shutdown”, impacted VMs (luckily, not that many) were shutdown and restarted on non-isolated hosts.
  • Exported a log from the Virtual Connect Manager, and HP helped to identify the blade triggering the bug.  Host was shutdown, and blade itself was also “reset”, however this did not restore normal functionality.
  • Reset one of the Virtual Connect modules. This restored network connectivity for some of the blades, but not all.
  • Some of the blades had to be rebooted in order for network connectivity to be completely restored.

My plan for preventing this bug from re-occurring on any enclosure, based on the HP advisory (and in general to update everything to the September 2013 HP “recipe” from vibsdepot.hp.com):

  • Using the Enclosure Firmware Management feature to apply the HP September 2013 Service Pack for Proliant (SPP) to each blade
  • Running HPSUM from the latest SPP to update the OA and VC firmware
  • Using Update Manager to apply recommended ESXi nic and HBA drivers, as well as the latest HP Offline bundle.

As a side note, it appears so far that a different, minor HP bug still remains even with these latest updates– as described in Ben Loveday’s blog post: http://bensjibberjabber.wordpress.com/2013/01/09/storage-alert-bug-running-vsphere-on-hp-bl465c-g7-blades/

Sigh….

Now that I’ve had a few days to recover, I wanted to share my experience from my trip to VMworld 2013.  After having such an amazing time last year, I decided that attending the 10th annual VMworld would be in my best interest, even if it meant “paying my own way”.  It turns out this was a good call!  Unlike last year, I had a better idea of what to expect during my time at the conference.

IMG_0411 IMG_0408I arrived in San Francisco on Sunday, August 25th, just in time to participate in the v0dgeball fundraising event.  (Thanks again to @CommsNinja (Amy Lewis) for letting me play for the Cloudbunnies #FearTheEars!).  This was a good opportunity to help raise money for The Wounded Warrior Project playing dodgeball with a bunch of folks in the VMware community.  It turns out that my dodgeball skills are as about as good as they were back in 3rd grade, not much improvement there 🙂 Thankfully, I made it through without injury and had a fun time in the tournament. Congrats to the EMC team on the victory!

I missed the VMworld Opening Ceremony, but fortunately after getting a bit lost I made it to the VMunderground party.  Great event for networking and catching up with everyone in the community!

On Monday, I attended the 1st Keynote where VMware announced the release of vSphere 5.5 and vCloud Suite 5.5.  VMware continued to talk about the path toward the Software-Defined Datacenter (SDDC) and the latest features included with 5.5.  I won’t go into the details since there are several bloggers that did a great job posting live blogs of the keynotes.  (Check out http://blog.scottlowe.org/ for example).  I will say that many of the announcements made during this keynote were not a surprise.  The rest of the day I spent attending sessions for the most part.  I really enjoyed the “group discussion” on HA with Duncan Epping and Keith Farcas; it was nice to give feedback, hear from peers and learn about possible futures. Later that evening, I made my way to CXIParty.  @CXI (Christopher Kusek) did a great job putting this together for the community.

IMG_0410On Tuesday, I was able to catch the last part of the 2nd Keynote with “Carl and Kit”.  My favorite part of this general session was the vCAC demo, since I will be building out a Proof Of Concept environment for vCAC when I return to work.  Like many other VMware customers, I am looking at how certain automation and management tools can bring an organization beyond basic virtualization and into a private cloud solution.  Attended the “Ask the Expert vBloggers” session, which I enjoyed just as much as last year.  Later that evening, I had a great time attending the Veeam and vBacon parties.

Most of my Wednesday was spent preparing for and taking the VCAP5-DCA exam.  I’ll save that experience for a different post, but this may be the last time I mix cert exams with VMworld (a bit too much excitement all at one time).

Wednesday night was the VMworld 2013 party.  VMware did an awesome job putting this party together!  Imagine Dragons and Train performed at SF Giants stadium (AT&T park).  They basically threw a carnival in the stadium, along with a huge concert, and then topped it all off with fireworks at the end.  I was not too familiar with either of the bands before the party, but I became a fan of Imagine Dragons during their performance.  Not sure how VMware is going to top this one!

IMG_0417I was able to work on some Hands on Labs on Thursday morning before I left for SFO to head back home.  I did BYOD this year and would highly recommend going that route if you can.  Though I haven’t looked into it yet, I’m assuming I’ll (hopefully) be able to do many of these labs eventually online via Project Nee.

Overall, it was an outstanding VMworld trip!  Very grateful that I was able to catch up with friends I made last year and make new ones.