vSphere Metro Storage Cluster – steps for non-disruptive site failover

September 16, 2013 — 2 Comments

Hi folks – It’s time to take a quick break from the excitement of the vSphere 5.5 and VSAN announcement to read a blog post about vSphere Metro Storage Clusters (vMSCs aka stretched clusters)!  Specifically, this post is about what I’ve learned in regards to vMSC workload mobility between sites, or downtime avoidance.  Since my vMSC experience is solely based on the NetApp MetroCluster solution, the content below is NetApp-centric.

To take a step back – When you look at the the scenarios that would cause all VMs running on a stretched cluster to completely vacate all hosts at one of the sites and (eventually) end up at the other site, I see two major types of events:

Unplanned Site Failover (Disaster Recovery)

  • Example:  Site A, which hosts half of a stretched cluster, goes completely down. This is an unplanned event, which results in a hard shutdown of all systems in the Site A datacenter.  Once the Site A LUNs are taken over by the Site B controller and fully available, VMs that were running at Site A need to be started at Site B.  Some would argue the DR process should be triggered manually (ie without MetroCluster TieBreaker).  The following doc is a great reference for testing vMSC failure or DR scenarios if you’re doing a proof of concept:  http://www.vmware.com/files/pdf/techpaper/vSPHR-CS-MTRO-STOR-CLSTR-USLET-102-HI-RES.pdf.

Planned Site Failover (Disaster/Downtime Avoidance)

  • Proactive non-disruptive migration of VM workloads and storage from Site A to Site B.  Performing this work non-disruptively is one of the benefits of a vSphere Metro Storage Cluster.  If equipment needs to be powered down at one of the physical sites (ie. for site maintenance or impending power outage scenario described in Duncan Epping’s blog post), this can be done without downtime for VMs on a stretched cluster.

If you have hundreds of VMs and multiple stretched clusters, it is important to plan and document the steps for these scenarios.  Since I could not find specific VMware documentation discussing the Planned Failover scenario in detail, I wanted to share an example of how this can be performed.  These steps happen to be for a 5.0 stretched cluster environment with one or more NetApp Fabric Metroclusters on the backend.

The following is an example of the process that can be used to non-disruptively failover storage and VMs from site A to site B, and then fail everything back to site A.  This process could be different depending on your storage solution, or how many VMs you have hosted on your stretched cluster(s).  The steps on the VMware side could of course be scripted, but I am listing out the manual steps.  If you have multiple stretched clusters, you can perform VM migrations for the clusters simultaneously, depending on available resources/bandwidth.  *Note – If it’s within the budget, 10Gb nics can make a huge difference in how quickly you can complete the vMotions.

vMSC_diagram

Preparation – Document the steps beforehand, including an estimated timeline.  If you are in an environment where infrastructure management responsibilities are divided between various teams, meet with other teams to discuss plans and share documentation. Review NetApp KB 1014120 titled “How to perform MetroCluster failover method for a planned site-wide maintenance not requiring CFOD”.

Failover

  1. Failover Metrocluster(s) from site A to site B using steps in NetApp KB, including offlining plexes and performing Cf takeover.
  2. Once it is confirmed that storage has successfully been failed over, you can begin the VM migrations.
  3. Verify that DRS is in fact set to Fully Automated.
  4. For each stretched cluster, edit the cluster settings and modify the DRS Affinity “should” rule that keeps VMs at site A.  Change the Affinity rule so that it contains the Site B Host Affinity group instead of Site A Host Affinity group.  Within 5 minutes, DRS should kick of the vMotions for the VMs in the associated VM Affinity group.  You can Run DRS manually if short on time.
  5. Once you confirm all vMotions were successful, place the hosts in site A in maintenance mode.
  6. Document failover times and keep an eye on monitoring software for any VM network connectivity issues.

Failback

  1. Failover Metrocluster(s) from site B to site A using steps in NetApp KB, including Online Plexes/Resync and Giveback.
  2. Once it is confirmed that storage has been successfully failed back and synced, you can begin the VM migrations.
  3. Remove the hosts in site A from maintenance mode.
  4. For each stretched cluster, edit the cluster settings and modify the same DRS Affinity “should” rule that was modified during Failover.  Change the Affinity rule so that it contains the original Site A Host Affinity group.  Within 5 minutes, DRS should kick off the vMotions.
  5. Document failover times and keep an eye on monitoring software for any VM network connectivity issues.

For those in IT that remember life before virtualization, it is exciting to see this in action and confirm that storage and compute for hundreds of VMs can be non-disruptively failed over to a site kilometers away in just an hour.  As always, feel free to leave a comment if you have feedback.

2 responses to vSphere Metro Storage Cluster – steps for non-disruptive site failover

  1. 

    It’s in reality a great and helpful piece of info. I’m satisfied
    that you simply shared this helpful information with us.

    Please stay us up to date like this. Thank you for sharing.

Trackbacks and Pingbacks:

  1. vSphere Metro Storage Cluster Links » Welcome to vSphere-land! - October 12, 2014

    […] (Plain Virtualization) New VMware HCL category: vSphere Metro Stretched Cluster (Virtual Geek) vSphere Metro Storage Cluster – steps for non-disruptive site failover (Virtual Stace) VMware Metro Storage Cluster Explained – Part 1: The Challenge (Virtualization […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s