Archives For VMware

For those with networks that use Cisco OTV with Nexus 7Ks to extend Layer 2 connectivity between sites, be aware that there is a bug (CSCuq54506) that may cause brief network connectivity issues for VMs that are vMotioned between the sites. The first symptom you may notice is that a VM appears to drop ping or lose connectivity for almost 1-2 minutes after it is vMotioned between sites.  Following a vMotion, a destination ESXi host will send RARP traffic to notify switches and update the MAC tables. When this bug occurs, the RARP traffic/updates basically don’t make it to all of the switches at the source site.  (Note: Since not having portfast enabled on the source or destination host switch ports can cause symptoms that may look a bit similar, it’s a good idea to confirm portfast is enabled on all of the ports.)

Troubleshooting that can be used from the VMware side to help identify if you are hitting the bug:

  • Start running two continuous pings to a test VM:  one continuous ping from the site you are vMotioning from, and one continuous ping from the site you are vMotioning to.
  • vMotion the test VM from one site to the other.
  • If you see the continuous ping at the source site (site VM was vMotioned from) drop for 30-60 seconds, but the continuous ping at the destination site (site VM was vMotioned to) stays up or only drops a ping packet, then you may want to work with Cisco TAC to determine if the root cause is this bug.

 

Tips when preparing for the 6.1 upgrade:

  • In addition, you can typically find some good blog posts with step-by-step guides for this type of work. http://virtumaster.com/ is just one example.  It can also be helpful to do a quick search to see what existing KBs have come out so far for the new release. Get to googling!
  • There are some very friendly folks in the virtualization community that enjoy sharing their tech experiences with others, so keep an eye on social media chatter regarding the upgrade (ahem, Twitter).
  • Test the upgrade in the lab. (Though, unfortunately, if your lab is not identical to production, you may not be able to test for all potential issues. In my case, the biggest bug I hit did not occur in the lab but was seen in the production environment)
  • Snapshot, snapshot, snapshot. Especially the IaaS server. And backup the IaaS DB.

 
A few issues to be aware of when upgrading to 6.1:

Issue #1 –  If you try to upgrade the ID appliance and get “Error: Failed to install updates (Error while running installation tests…” and you see errors in the logs about conflicts with the certificate updates, try the following:

SSH to the ID appliance and run rpm -e vmware-certificate-client. Then try running the update again. Thanks to @stvkpln for sharing the fix.

Issue #2 – If you are going through the IaaS upgrade and get the following error near the end of the wizard (before the upgrade install even begins):

exception

Check to see how many DEM registry keys you have in the following location:  HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\VMware, Inc.\VMware vCloud Automation Center DEM

If you see extra DEM or DEO keys (ie. you only have 2 DEM workers install on the server but you see 3 DEM worker keys), this may be related to your issue.

Workaround:

Option 1 (remove duplicate keys):

  • Export the DEM registry key to back it up: HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\VMware, Inc.\VMware vCloud Automation Center DEM
  • Check the contents of the registry keys that match your installed DEMs for version and install path information.
  • Remove the duplicate DEMInstance keys for the DEO and DEM.
  • Run the upgrade.

Option 2 (remove/reinstall):

  • Remove all DEMS from machine
  • Remove DEM registry keys
  • Run Upgrade
  • Install 6.1 DEMs with the Same name as the 6.0 DEMs

I would recommend going with Option 2 (especially if it is difficult to confirm by looking at the contents which keys match the installed DEMs). Thanks to @virtualjad and VMware engineering for sharing the workaround.

Issue #3 – Make sure to import the vCO package update for vCAC 6.1, as mentioned in KB 2088838, especially if you use stub workflows.

 

If you are importing the HP Smart Array (hpsa) controller driver in Update Manager (VUM), be aware of the following issue, especially if your VUM repository already contains a previous version of the HP hpsa driver.

Recently, I imported the latest HP Smart Array (hpsa version 5.0.0.60-1) driver in VUM, created a new baseline and attached it to hosts. I noticed the compliance for this new baseline changed to green/compliant right away, though I knew these hosts did not have the latest hpsa driver update. After scanning the hosts again, the baseline still showed as green and the details showed “Not Applicable”. When I took a further look at the VUM repository, it appeared that the old version of the hpsa driver, version 5.0.0.44, which I then recalled was in the repository before I imported the newer version, was no longer there. There was only one HP hpsa host extension that could be seen in the VUM repository GUI, and its release date was consistent with the hpsa 5.0.0.60-1 driver.  It was almost as if the two hpsa driver versions had merged in VUM.

The root cause? It appears that the HP hpsa patch name and patch ID in the metadata was the same across various hpsa driver version releases, and had not been made unique by the vendor (hpsa driver for ESX, Patch ID hpsa-500-5.0.0 for multiple versions). In addition, VUM did not warn me that I was trying to import a patch with a patch name and patch ID that was already in the VUM repository.

Screen Shot 2014-07-10 at 12.05.20 AM

Since Update Manager 5.0 does not let you remove items from the repository, the solution that is often proposed is to reinstall Update Manager and start clean. However, I was provided with a workaround for the issue that did not require an immediate reinstall of VUM, though eventually a reinstall will be required to clean up the repository. Hopefully, HP and VMware engineers will make the improvement needed to prevent this type of issue (and make it easier to remove items from the repository).

Workaround – Before you import the latest hpsa driver bundle, extract the files from the bundle. Tweak the following lines in the .xml files below. Zip the files again.

For example, for the hpsa 5.0.0.60-1 bundle, modify the tag in the following two files (ie. add the exact version number to the end of the id):

• hpsa-500-5.0.0-1874739\hpsa-500-5.0.0-offline_bundle-1874739\metadata\vmware.xml
• hpsa-500-5.0.0-1874739\hpsa-500-5.0.0-offline_bundle-1874739\metadata\bulletins\ hpsa-500-5.0.0.xml

Screen Shot 2014-07-10 at 12.05.41 AM

As you import the modified bundle in VUM, you should be able to see in the import wizard if this worked because the patch ID you see in the wizard should match the patch ID you assigned in the steps above.  Try these steps in the lab, use at your own risk, and let me know if you find any issues with the workaround.

Speaking of the HP Smart Array controller ESXi driver, make sure to check out this advisory if you have not seen it already: http://bit.ly/1mVFHQK

When migrating virtual or physical servers from one data center to another, especially if you are moving from Cisco Catalyst to Nexus switches, it’s helpful to be aware of the concept of Proxy ARP.  Here is a link to a Cisco article that explains Proxy ARP:

http://www.cisco.com/c/en/us/support/docs/ip/dynamic-address-allocation-resolution/13718-5.html.

If Proxy ARP is enabled on a switch/router, it can hide or mask misconfigured default gateways/subnet masks on servers.  A switch/router with this setting enabled can help servers reach devices in other subnets, even if the configured default gateway on a server is incorrect. Once the misconfigured servers are moved to network equipment that has Proxy ARP disabled, the servers will no longer be able to communicate with devices in other subnets.  Proxy ARP is enabled by default on Catalyst switches and disabled by default on Nexus switches.  Make sure to review the Proxy ARP settings in both the originating and destination data center.  If this setting will be disabled at the destination site, run a script to check default gateways and subnet masks on servers before beginning a migration.

If you are a VMware user and have not checked out your local VMUG, I highly encourage you to check out an event in your area (www.vmug.com).  It’s a great way to connect with other users in the VMware community and hear about the latest products/solutions from vendors.

They have recently launched a new program called FeedForward.  Mike Laverick, who has been promoting this initiative in the community, has a blog post about it here:  http://www.mikelaverick.com/2014/04/coming-now-feedforward/.  It’s a mentoring program for users who are interested in sharing a presentation at a VMUG event.  While encouraging users to share with others in the VMUG, the program also provides users with an opportunity to hear feedback on their presentation before sharing it with others.  It’s great to see an initiative that would help support and drive user participation, since I found this to be lacking in my own limited experience attending VMUGs.  Sure, hearing from vendors is important, but it is extremely helpful to hear directly from other admins and engineers who are using the technology in the field (the good and the bad!).  After all, why spend unnecessary time trying to “reinvent the wheel” when there may be others out there that have encountered similar issues.

Full Disclosure: I’ve never presented or co-presented at a VMUG, and unfortunately due to upcoming obligations, it looks like I won’t be able to volunteer anytime soon. However, it’s definitely something I would consider in the future, and I’ll definitely blog here about my experience if that happens.

If you are interesting in presenting at a VMUG or being a mentor, you can sign up at the following page:  http://www.vmug.com/feedforward.

…written from the perspective of a Virtualization Engineer.  A very special thanks to Networking Guru @cjordanVA for being a key contributor on this post.

Overlay Transport Virtualization (OTV), which is a Cisco feature that was released in 2010, can be used to extend Layer 2 traffic between distributed data centers.  The extension of Layer 2 domains across data centers may be required to support certain high availability solutions, such as stretched clusters or application mobility.  Instead of traffic being sent as Layer 2 across a Data Center Interconnect (DCI), OTV encapsulates the Layer 2 traffic in Layer 3 packets.  There are some benefits to using OTV for Layer 2 extension between sites, such as limiting the impact of unknown-unicast flooding.  OTV also allows for FHRP Isolation, which allows the same default gateway to exist in the distributed data centers at the same time.  This can help reduce traffic tromboning between sites.

When planning an OTV implementation in an enterprise environment with existing production systems, here are a few things to include in the testing phase when collaborating with other teams:

  • Setup a conference call for the OTV implementation day and share this information with the Infrastructure groups involved in the implementation and testing, ie. Network, Storage, Server, and Virtualization engineers.  This will allow staff involved to easily communicate when performing testing following the change.
  • Test pinging physical server interfaces by IP address at one datacenter from the other datacenter, and from various subnets.  Can you ping the interface from the same site, but not from the other site? (Make sure to establish a baseline before implementation day.)  Is your monitoring software at one site randomly alerting that it cannot ping devices at the other site?
  • If your vCenter Server manages hosts located in multiple data centers, was vCenter able to reconnect to ESXi hosts at the other datacenter (across the DCI) after OTV was enabled?
  • If you have systems that replicate storage/data between the data centers, test this replication after OTV is enabled and verify it completes successfully.

 

Be aware of a couple of gotchas:

ARP aging timer/CAM aging timer – Make sure to set the ARP aging timer lower than the CAM aging timer to prevent traffic from getting randomly blackholed.  This is an issue to watch out for if OTV is being implemented in a mixed Catalyst/Nexus environment, and will not likely be an issue if the environment is all Nexus.  The default times for the aging timer depend on the Cisco platform.  The default for a Catalyst 6500 is different than the default for a Nexus 7000.

Symptoms of an aging timer issue:  You will more than likely see failures during the pings tests mentioned above or you may see intermittent issues with establishing connectivity to certain hosts.

MTU Settings – Since OTV adds additional bytes to IP header packets and also sets the do not fragment “DF” bit, a larger MTU will need to be configured on any interfaces along the path of an OTV encapsulated packet.  Check the MTU settings prior to implementation, and again if issues arise when OTV is rolled out.  If MTU settings were properly configured, consider rebooting the OTV edge devices as a troubleshooting step if issues are encountered to verify the MTU settings actually applied properly and did not get stuck — (it’s happened).

Symptoms of an MTU-related issue:  If you have a vCenter server in one data center that manages hosts at the other datacenter, it may not be able to reconnect to the hosts at the other data center.  Storage replication may not complete successfully after OTV has been enabled.

The other day while logged into my vCenter Orchestrator 5.5 client, I noticed that there were several packages and workflows missing from the library.  The images below show how it looked with the missing packages/workflows.  For example, you can see most of the subfolders are missing from the vCenter folder, and the vCenter package is completely gone from the library.

Missingworkflows Missingpackages

I logged into the Orchestrator configuration web interface and verified that everything looked correct and “green”, and then I tried rebooting the appliance to see if that would make the missing contents appear in the GUI.  After searching VMware communities and talking with GSS, I found out that this is apparently a common issue with vCO 5.5.

The following steps resolved the issue for me:  Log into the vCO configuration web interface.  Go to the ‘Troubleshooting’ tab, select ‘reset to current version’ to reinstall the plugins.  Then go into ‘Startup Options’.  Select ‘Restart the Configuration server’.  Log back into the vCO configuration web interface, go to ‘Startup Options’ again, and select ‘Restart Service’.

Link to communities thread related to this issue: https://communities.vmware.com/thread/468000

The images below show how it all looked after completing these steps.  Back to normal!

Workflowspresent Packagespresent

Oh, and by the way, if you’re impatient like me and you try to login to vCO immediately after completing the steps above, you may get this following error.  If you do, wait a few minutes and try again.

nodeerror

This post is a bit late since vCAC 6.0.1 (Service Pack 1) was just released.  However, I wanted to share some of the issues I came across during the installation and setup of vCloud Automation Center (vCAC) 6.0.  I have not yet had the opportunity to upgrade to 6.0.1, but I’m hoping one or more of the issues below has been fixed or at least identified.

      • After setting up two identity stores on the vCAC Identity/SSO appliance, one for a parent domain and one for a child domain, I had an issue authenticating to the parent domain when identity stores used LDAP ports 389 or 636.  The issue only occurred when the user had an account in both domains and the username was the same for both.  No longer had this issue when switching to LDAP Global Catalog ports 3268 or 3269.  (Verified that there was no issue authenticating and binding to the same domain controller using the same service account via ports 389 and 636 when testing with ldp.exe.)
      • Have not found documentation for changing vCAC service account password.  This is assuming the same service account is being used for four vCAC IaaS services, one or more vCAC identity stores, and vCAC endpoint credentials. When I needed to attempt to change the password for all of these, it broke vCAC forcing me to revert the IaaS server back to it’s original state and reinstall the IaaS components.  Note**This brings me to some of the best advice I can give someone performing a vCAC installation – SNAPSHOT THE IaaS SERVER!!  I usually take a snapshot once before the pre-reqs, and once before installing the actual components. 
      • Service Account used for vCAC endpoint credentials cannot use a password containing ‘=’ sign at the end.
      • Cannot add Active Directory security group that contains spaces to vCAC for assigning permissions.
      • When adding Active Directory security groups to vCAC to assign permissions for Business Groups, vCAC is not able to “pull up”/discover the group  (like it does for domain user accounts).  It does, however, work, provided the group really exists and the group name does not contain spaces.
      • When using a vCloud Suite Standard license, there is no option in the GUI to add a vCO Endpoint.  This was a big one for me.

My commute to work, like many others in the DC/Maryland/Virginia area, is a long one.  After years of sticking to music or local news radio as a distraction during the drive, I finally decided to give podcasts a try (always late to the party, but better late than never).  In the past couple of months, I’ve started listening to several virtualization/VMware/tech podcasts on a regular basis.  Like Twitter, it’s another great way to keep current on the latest tech trends and hear what peers and experts have to say.  Here is a short list of the podcasts I’ve listened to so far.  I’ll probably be updating this list as time goes on or creating a new “Podcasts” tab on my site to keep track of the podcasts I enjoy listening to.

VMware Communities Roundtable Podcast, John Mark Troyer, Mike Laverick – http://www.talkshoe.com/tc/19367.  This was the first podcast I checked out, and I can see why it’s so popular!  Listen to the latest from the VMware perspective. Still trying to catch it live…

VUPaaS (Virtualization User Podcast as a Service), Gurusimran “GS” Khalsa, Chris Wahl, Josh Atwell – http://vupaas.com.  This is an awesome, newer podcast.  It’s really easy to relate to the discussions from a Sys Admin point of view.  Very technical and useful info. For example, after hearing GS mention “The Phoenix Project”, I ended up reading this IT novel about DevOps over the holidays.  Was also led to the “Packet Pushers” podcast after it was mentioned in one of the episodes.

Adapting IT, Lauren Malhoit – http://www.adaptingit.com.  Great conversations about tech!  Relatable and very easy to listen to.  It’s also great to see a podcast that is showcasing women in IT.

Geek Whisperers, John Mark Troyer, Amy Lewis, Matthew Brender – http://geek-whisperers.com/.  This is an interesting podcast and  it provides perspectives about Social Media and tech that are new to me.

Packet Pushers, Greg Ferro, Ethan Banks – http://packetpushers.net.  Listening to what those Network folks talk about.  No, I’m not a Network Engineer, so some of the conversation flies above my head.  But I’m always interested in getting the big picture when it comes to datacenter infrastructure, and it helps to get familiar with the network technologies and terms that are being discussed.  I found the two part deep dive on OTV to be very interesting.

vSoup, Christian Mohn, Chris Dearden, Ed Czerwin- http://vsoup.net/ – One more podcast I checked out after listening to VUPaaS (thanks again GS).

If there are other tech podcasts that you would like to recommend, please leave a comment.  Would also like to start listening to some non-tech podcasts, as time allows….

If you run VMware on HP Proliant servers, then you are probably familiar with http://vibsdepot.hp.com.  In addition to HP customized VMware ESXi ISOs and software bundles, this site also has what HP refers to as VMware firmware and software “recipes”.  The “recipes” list the drivers and firmware that HP recommends running along with a specified Service Pack for Proliant (SPP) and certain ESXi versions.  While applying newer firmware and drivers to HP Blade enclosures can be a pain, it’s a good idea to perform these updates 1-2 times a year since each SPP is only supported for 1 year.

Stacy’s Example:

In the following example, I used the September 2013 “recipe” to apply updates to HP C7000 Blade Enclosures that were already running ESXi 5.0 Update 2 hosts.  There is more than one way to apply these updates, but this is the method I found the easiest.

  • Each HP Blade Enclosure was updated one at a time.
  • For each enclosure, updates were applied to the Onboard Administrators, and Virtual Connect Flex-10 Ethernet modules, and the blades themselves.  (FC switches in enclosures handled separately)
  • Performed the steps detailed below for each enclosure.
  • Note: If your hosts have FC HBAs, check with your storage vendor as well to see if they support the new HBA firmware/drivers.
Blade Driver Updates – VUM
  • Created new VMware Update Manager (VUM) HP Extension/driver baselines based on the September 2013 HP “recipe” (vibsdepot.hp.com)   Reviewed host hardware for each cluster (ie looked at network adapters, RAID controllers, latest offline bundle, etc) to determine the appropriate drivers to include in the baselines.
  • Attached the appropriate baselines to appropriate clusters (again based on hardware for each cluster and the “recipe”, and scanned.
  • Placed all ESXi hosts in the enclosure to be updated in maintenance mode. (It’s great if you are able to shut down and update all blades in the enclosure at once, but not everyone will have this luxury)
  • Suspend alerting for hosts.
  • Remediated the hosts in the blade enclosure using the VUM baselines (Host Extensions).
Blade Firmware Updates – EFM
  • Used the Enclosure Firmware Management (EFM) feature to update blade firmware.  EFM can mount an SPP ISO via URL, where it is hosted on an internal server running IIS.  Prior to updating blade firmware, updated the SPP ISO on the IIS server and re-mounted the ISO in EFM.
  • Shutdown hosts (which were still in maintenance mode) using the vSphere client.
  • Once hosts were shutdown, used the HP EFM feature to manually apply firmware updates.
  • After the firmware updates completed (could take an hour), clicked on Rack Firmware in the OA and reviewed the current version/Firmware ISO version.
Virtual Connect (VC) and Onboard Administrator (OA) Updates – HPSUM from desktop
  • Temporarily disabled the Virtual Connect Domain IP Address (optional setting) in the Virtual Connect Manager in order for HPSUM to discover the Virtual Connects when the Onboard Administrator is added as a target (yes, HP bug workaround).
  • Ran HP SUM from the appropriate HP SPP from desktop.
  • Added Active OA hostname OR IP address as a target, chose Onboard Administrator as type.
  • Blade iLO interfaces, Virtual Connect Manager, and FC Switches were all discovered as associated targets by adding the OA.  For associated targets, de-selected everything except for the Virtual Connect Manager and clicked OK (the iLO interfaces for the blades were updated along with the rest of their firmware using the EFM, and the FC Switch firmware is handled separately).
  • The Virtual Connect Manager may then show as unknown in HPSUM.  Edited that target and changed target type to Virtual Connect, and entered the appropriate credentials.
  • After applying updates to the OAs and VCs, verified they updated to the correct firmware levels.
  • Re-enabled the Virtual Connect Domain IP Address setting.
  • Re-enabled alerting.