Archives For vSphere Metro Storage Cluster

During the setup of a new vSphere cluster, I had to troubleshoot an issue that was causing latency for the NFS datastores. This new vSphere cluster was attached to a newly setup NetApp storage array with essentially no workloads yet hosted on it.

One of the first symptoms of the latency was noticed when browsing the NFS datastores in the vSphere client GUI. After clicking on Browse Datastore, and then clicking on any VM folder, the window would display “Searching datastore…” and often take 40-50 seconds to display the files. Further testing with the NFS datastores confirmed that slowness was also seen when performing file copy operations, or certain disk intensive operations in the guest OS.

Several weeks were spent troubleshooting and working with vendors to try and determine the cause. It was found that when the configuration on the NetApp storage array side (and switches) was changed to use access ports versus trunk ports, the issue went away. In addition, the issue did not occur when one of the hosts and NetApp storage array were connected to the same datacenter switch. No jumbo frames were in this equation.

The cause of the issue was found to be a conflict between the default behavior of NetApp when using VLAN tagging and the Nexus core switch QoS configuration. By default, NetApp assigns a default CoS value of 4 when using VLAN tagging (trunk ports). This caused the NFS storage traffic to get placed in a queue on the router that was limited in terms of bandwidth. A workaround was implemented on the switches for the storage array interfaces that essentially changed the CoS value to fit with the network configuration in the environment.

Here are some links that helped to connect the dots when researching the issue:

Hi folks – It’s time to take a quick break from the excitement of the vSphere 5.5 and VSAN announcement to read a blog post about vSphere Metro Storage Clusters (vMSCs aka stretched clusters)!  Specifically, this post is about what I’ve learned in regards to vMSC workload mobility between sites, or downtime avoidance.  Since my vMSC experience is solely based on the NetApp MetroCluster solution, the content below is NetApp-centric.

To take a step back – When you look at the the scenarios that would cause all VMs running on a stretched cluster to completely vacate all hosts at one of the sites and (eventually) end up at the other site, I see two major types of events:

Unplanned Site Failover (Disaster Recovery)

  • Example:  Site A, which hosts half of a stretched cluster, goes completely down. This is an unplanned event, which results in a hard shutdown of all systems in the Site A datacenter.  Once the Site A LUNs are taken over by the Site B controller and fully available, VMs that were running at Site A need to be started at Site B.  Some would argue the DR process should be triggered manually (ie without MetroCluster TieBreaker).  The following doc is a great reference for testing vMSC failure or DR scenarios if you’re doing a proof of concept:

Planned Site Failover (Disaster/Downtime Avoidance)

  • Proactive non-disruptive migration of VM workloads and storage from Site A to Site B.  Performing this work non-disruptively is one of the benefits of a vSphere Metro Storage Cluster.  If equipment needs to be powered down at one of the physical sites (ie. for site maintenance or impending power outage scenario described in Duncan Epping’s blog post), this can be done without downtime for VMs on a stretched cluster.

If you have hundreds of VMs and multiple stretched clusters, it is important to plan and document the steps for these scenarios.  Since I could not find specific VMware documentation discussing the Planned Failover scenario in detail, I wanted to share an example of how this can be performed.  These steps happen to be for a 5.0 stretched cluster environment with one or more NetApp Fabric Metroclusters on the backend.

The following is an example of the process that can be used to non-disruptively failover storage and VMs from site A to site B, and then fail everything back to site A.  This process could be different depending on your storage solution, or how many VMs you have hosted on your stretched cluster(s).  The steps on the VMware side could of course be scripted, but I am listing out the manual steps.  If you have multiple stretched clusters, you can perform VM migrations for the clusters simultaneously, depending on available resources/bandwidth.  *Note – If it’s within the budget, 10Gb nics can make a huge difference in how quickly you can complete the vMotions.


Preparation – Document the steps beforehand, including an estimated timeline.  If you are in an environment where infrastructure management responsibilities are divided between various teams, meet with other teams to discuss plans and share documentation. Review NetApp KB 1014120 titled “How to perform MetroCluster failover method for a planned site-wide maintenance not requiring CFOD”.


  1. Failover Metrocluster(s) from site A to site B using steps in NetApp KB, including offlining plexes and performing Cf takeover.
  2. Once it is confirmed that storage has successfully been failed over, you can begin the VM migrations.
  3. Verify that DRS is in fact set to Fully Automated.
  4. For each stretched cluster, edit the cluster settings and modify the DRS Affinity “should” rule that keeps VMs at site A.  Change the Affinity rule so that it contains the Site B Host Affinity group instead of Site A Host Affinity group.  Within 5 minutes, DRS should kick of the vMotions for the VMs in the associated VM Affinity group.  You can Run DRS manually if short on time.
  5. Once you confirm all vMotions were successful, place the hosts in site A in maintenance mode.
  6. Document failover times and keep an eye on monitoring software for any VM network connectivity issues.


  1. Failover Metrocluster(s) from site B to site A using steps in NetApp KB, including Online Plexes/Resync and Giveback.
  2. Once it is confirmed that storage has been successfully failed back and synced, you can begin the VM migrations.
  3. Remove the hosts in site A from maintenance mode.
  4. For each stretched cluster, edit the cluster settings and modify the same DRS Affinity “should” rule that was modified during Failover.  Change the Affinity rule so that it contains the original Site A Host Affinity group.  Within 5 minutes, DRS should kick off the vMotions.
  5. Document failover times and keep an eye on monitoring software for any VM network connectivity issues.

For those in IT that remember life before virtualization, it is exciting to see this in action and confirm that storage and compute for hundreds of VMs can be non-disruptively failed over to a site kilometers away in just an hour.  As always, feel free to leave a comment if you have feedback.