…written from the perspective of a Virtualization Engineer. A very special thanks to Networking Guru @cjordanVA for being a key contributor on this post.
Overlay Transport Virtualization (OTV), which is a Cisco feature that was released in 2010, can be used to extend Layer 2 traffic between distributed data centers. The extension of Layer 2 domains across data centers may be required to support certain high availability solutions, such as stretched clusters or application mobility. Instead of traffic being sent as Layer 2 across a Data Center Interconnect (DCI), OTV encapsulates the Layer 2 traffic in Layer 3 packets. There are some benefits to using OTV for Layer 2 extension between sites, such as limiting the impact of unknown-unicast flooding. OTV also allows for FHRP Isolation, which allows the same default gateway to exist in the distributed data centers at the same time. This can help reduce traffic tromboning between sites.
When planning an OTV implementation in an enterprise environment with existing production systems, here are a few things to include in the testing phase when collaborating with other teams:
- Setup a conference call for the OTV implementation day and share this information with the Infrastructure groups involved in the implementation and testing, ie. Network, Storage, Server, and Virtualization engineers. This will allow staff involved to easily communicate when performing testing following the change.
- Test pinging physical server interfaces by IP address at one datacenter from the other datacenter, and from various subnets. Can you ping the interface from the same site, but not from the other site? (Make sure to establish a baseline before implementation day.) Is your monitoring software at one site randomly alerting that it cannot ping devices at the other site?
- If your vCenter Server manages hosts located in multiple data centers, was vCenter able to reconnect to ESXi hosts at the other datacenter (across the DCI) after OTV was enabled?
- If you have systems that replicate storage/data between the data centers, test this replication after OTV is enabled and verify it completes successfully.
Be aware of a couple of gotchas:
ARP aging timer/CAM aging timer – Make sure to set the ARP aging timer lower than the CAM aging timer to prevent traffic from getting randomly blackholed. This is an issue to watch out for if OTV is being implemented in a mixed Catalyst/Nexus environment, and will not likely be an issue if the environment is all Nexus. The default times for the aging timer depend on the Cisco platform. The default for a Catalyst 6500 is different than the default for a Nexus 7000.
Symptoms of an aging timer issue: You will more than likely see failures during the pings tests mentioned above or you may see intermittent issues with establishing connectivity to certain hosts.
MTU Settings – Since OTV adds additional bytes to IP header packets and also sets the do not fragment “DF” bit, a larger MTU will need to be configured on any interfaces along the path of an OTV encapsulated packet. Check the MTU settings prior to implementation, and again if issues arise when OTV is rolled out. If MTU settings were properly configured, consider rebooting the OTV edge devices as a troubleshooting step if issues are encountered to verify the MTU settings actually applied properly and did not get stuck — (it’s happened).
Symptoms of an MTU-related issue: If you have a vCenter server in one data center that manages hosts at the other datacenter, it may not be able to reconnect to the hosts at the other data center. Storage replication may not complete successfully after OTV has been enabled.