Archives For NetApp

During the setup of a new vSphere cluster, I had to troubleshoot an issue that was causing latency for the NFS datastores. This new vSphere cluster was attached to a newly setup NetApp storage array with essentially no workloads yet hosted on it.

One of the first symptoms of the latency was noticed when browsing the NFS datastores in the vSphere client GUI. After clicking on Browse Datastore, and then clicking on any VM folder, the window would display “Searching datastore…” and often take 40-50 seconds to display the files. Further testing with the NFS datastores confirmed that slowness was also seen when performing file copy operations, or certain disk intensive operations in the guest OS.

Several weeks were spent troubleshooting and working with vendors to try and determine the cause. It was found that when the configuration on the NetApp storage array side (and switches) was changed to use access ports versus trunk ports, the issue went away. In addition, the issue did not occur when one of the hosts and NetApp storage array were connected to the same datacenter switch. No jumbo frames were in this equation.

The cause of the issue was found to be a conflict between the default behavior of NetApp when using VLAN tagging and the Nexus core switch QoS configuration. By default, NetApp assigns a default CoS value of 4 when using VLAN tagging (trunk ports). This caused the NFS storage traffic to get placed in a queue on the router that was limited in terms of bandwidth. A workaround was implemented on the switches for the storage array interfaces that essentially changed the CoS value to fit with the network configuration in the environment.

Here are some links that helped to connect the dots when researching the issue:

https://kb.netapp.com/support/index?page=content&id=3013708&actp=RSS

http://keepingitclassless.net/2013/12/default-cos-value-in-netapp-cluster-mode/

https://communities.cisco.com/message/137658

http://jeffostermiller.blogspot.com/2012/03/common-nexus-qos-problem.html

http://www.beyondvm.com/2014/07/ucs-networking-adventure-a-tale-of-cos-and-the-vanishing-frames/

Advertisements

Hi folks – It’s time to take a quick break from the excitement of the vSphere 5.5 and VSAN announcement to read a blog post about vSphere Metro Storage Clusters (vMSCs aka stretched clusters)!  Specifically, this post is about what I’ve learned in regards to vMSC workload mobility between sites, or downtime avoidance.  Since my vMSC experience is solely based on the NetApp MetroCluster solution, the content below is NetApp-centric.

To take a step back – When you look at the the scenarios that would cause all VMs running on a stretched cluster to completely vacate all hosts at one of the sites and (eventually) end up at the other site, I see two major types of events:

Unplanned Site Failover (Disaster Recovery)

  • Example:  Site A, which hosts half of a stretched cluster, goes completely down. This is an unplanned event, which results in a hard shutdown of all systems in the Site A datacenter.  Once the Site A LUNs are taken over by the Site B controller and fully available, VMs that were running at Site A need to be started at Site B.  Some would argue the DR process should be triggered manually (ie without MetroCluster TieBreaker).  The following doc is a great reference for testing vMSC failure or DR scenarios if you’re doing a proof of concept:  http://www.vmware.com/files/pdf/techpaper/vSPHR-CS-MTRO-STOR-CLSTR-USLET-102-HI-RES.pdf.

Planned Site Failover (Disaster/Downtime Avoidance)

  • Proactive non-disruptive migration of VM workloads and storage from Site A to Site B.  Performing this work non-disruptively is one of the benefits of a vSphere Metro Storage Cluster.  If equipment needs to be powered down at one of the physical sites (ie. for site maintenance or impending power outage scenario described in Duncan Epping’s blog post), this can be done without downtime for VMs on a stretched cluster.

If you have hundreds of VMs and multiple stretched clusters, it is important to plan and document the steps for these scenarios.  Since I could not find specific VMware documentation discussing the Planned Failover scenario in detail, I wanted to share an example of how this can be performed.  These steps happen to be for a 5.0 stretched cluster environment with one or more NetApp Fabric Metroclusters on the backend.

The following is an example of the process that can be used to non-disruptively failover storage and VMs from site A to site B, and then fail everything back to site A.  This process could be different depending on your storage solution, or how many VMs you have hosted on your stretched cluster(s).  The steps on the VMware side could of course be scripted, but I am listing out the manual steps.  If you have multiple stretched clusters, you can perform VM migrations for the clusters simultaneously, depending on available resources/bandwidth.  *Note – If it’s within the budget, 10Gb nics can make a huge difference in how quickly you can complete the vMotions.

vMSC_diagram

Preparation – Document the steps beforehand, including an estimated timeline.  If you are in an environment where infrastructure management responsibilities are divided between various teams, meet with other teams to discuss plans and share documentation. Review NetApp KB 1014120 titled “How to perform MetroCluster failover method for a planned site-wide maintenance not requiring CFOD”.

Failover

  1. Failover Metrocluster(s) from site A to site B using steps in NetApp KB, including offlining plexes and performing Cf takeover.
  2. Once it is confirmed that storage has successfully been failed over, you can begin the VM migrations.
  3. Verify that DRS is in fact set to Fully Automated.
  4. For each stretched cluster, edit the cluster settings and modify the DRS Affinity “should” rule that keeps VMs at site A.  Change the Affinity rule so that it contains the Site B Host Affinity group instead of Site A Host Affinity group.  Within 5 minutes, DRS should kick of the vMotions for the VMs in the associated VM Affinity group.  You can Run DRS manually if short on time.
  5. Once you confirm all vMotions were successful, place the hosts in site A in maintenance mode.
  6. Document failover times and keep an eye on monitoring software for any VM network connectivity issues.

Failback

  1. Failover Metrocluster(s) from site B to site A using steps in NetApp KB, including Online Plexes/Resync and Giveback.
  2. Once it is confirmed that storage has been successfully failed back and synced, you can begin the VM migrations.
  3. Remove the hosts in site A from maintenance mode.
  4. For each stretched cluster, edit the cluster settings and modify the same DRS Affinity “should” rule that was modified during Failover.  Change the Affinity rule so that it contains the original Site A Host Affinity group.  Within 5 minutes, DRS should kick off the vMotions.
  5. Document failover times and keep an eye on monitoring software for any VM network connectivity issues.

For those in IT that remember life before virtualization, it is exciting to see this in action and confirm that storage and compute for hundreds of VMs can be non-disruptively failed over to a site kilometers away in just an hour.  As always, feel free to leave a comment if you have feedback.

Without a doubt, troubleshooting storage performance issues can be a challenging part of any VMware admin’s job.  The potential cause of a VMware storage-related issue, in particular on a SAN, can be difficult to identify on infrastructure when the problem could be anywhere on the storage fabric.  Take your pick: host driver/firmware bug, bad host HBA, bad cable, bad FC switch port, wrong FC switch port setting, FC switch firmware bug, controller HBA bug, storage OS bug, misalignment…and the list goes on.  Here is an example of one experience I had when working on a VMware storage issue.

Issue:

Intermittent, brief storage disconnects seen occurring on all VMware clusters/hosts attached via FC to two NetApp storage arrays.  When the disconnects occurred, they were seen across the hosts at the same time.  Along with the brief disconnects, very high latency spikes and a tremendous amount of SCSI resets were other symptoms seen on the hosts.  There was no obvious pattern – though it often seemed that the symptoms occurred more during the overnight hours, this behavior would also occur during the day.

The storage disconnects in vCenter looked like this in Tasks/Events for the hosts:

Lost access to volume xxxxxxxxxxxxxxxx (datastore name) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

Seconds later…

Successfully restored access to volume xxxxxxxxxxxxxxxxx (datastore name) following connectivity issues.

These events were considered “informational”, so they were not errors that triggered vCenter email notifications, and if you weren’t monitoring logs, these could easily get missed.

Host logs:

A few different event types in the logs, including hIgh number of –

H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0  DID_RESET

Latency spikes up to 1600 milliseconds via ESXTOP….

latency

Troubleshooting: 

In collaboration with my colleagues on the storage side, we went through several troubleshooting steps using the tools we had available.  Cases were opened with both VMware and storage vendor.  Steps included:

  • Using the NetApp Virtual Storage Console (VSC) plugin, we used the online “functional” alignment feature to fix storage alignment for certain VMs in the short-term
  • Comparing VMware host and NetApp logs
  • Uploading tons of system bundles and running ESXTOP
  • Closer look at the environment – was anything being triggered by VM activity at those times?  IE. antivirus, resource intensive jobs
  • Reviewed HBA driver/firmware versions
  • Worked with DBAs to try and stagger SQL maintenance plans
  • Verified all aggregate snapshots except for aggr0 were still disabled on the NetApp side, since this had caused similar symptoms in the past. (Yep, still disabled)

After all of the troubleshooting steps above, the issue remained.  Here are the steps that finally led to the solution:

  • Get both vendors and customer together on status call – make sure everyone is on the same page
  • Run perfstats on the NetApp side and ESXTOP in batch mode on VMware side to capture data as the issue is occurring.
  • ESXTOP command: esxtop -b -a -d 10 -n 3000 | gzip -9c > /vmfs/volumes/datastorename/esxtopoutput.csv.gz  And of course, Duncan’s handy ESXTOP page to help analyze:  http://www.yellow-bricks.com/esxtop/
  • Provide storage vendor with perfstats, esxtop files, and examples times for when the disconnects occurred during the captures.

Resolution:

NetApp bug #253964 was identified as the cause: http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=253964.  The “Long-running “s” type CP after a “z” (sync) type CP” seen in the perfstat indicated we were hitting this bug.  The fix was to upgrade Data OnTAP to a newer version – 8.0.4, 8.0.5, or 8.1.2.  I am happy to confirm that upgrading Data OnTAP did resolve the issue.