Without a doubt, troubleshooting storage performance issues can be a challenging part of any VMware admin’s job. The potential cause of a VMware storage-related issue, in particular on a SAN, can be difficult to identify on infrastructure when the problem could be anywhere on the storage fabric. Take your pick: host driver/firmware bug, bad host HBA, bad cable, bad FC switch port, wrong FC switch port setting, FC switch firmware bug, controller HBA bug, storage OS bug, misalignment…and the list goes on. Here is an example of one experience I had when working on a VMware storage issue.
Intermittent, brief storage disconnects seen occurring on all VMware clusters/hosts attached via FC to two NetApp storage arrays. When the disconnects occurred, they were seen across the hosts at the same time. Along with the brief disconnects, very high latency spikes and a tremendous amount of SCSI resets were other symptoms seen on the hosts. There was no obvious pattern – though it often seemed that the symptoms occurred more during the overnight hours, this behavior would also occur during the day.
The storage disconnects in vCenter looked like this in Tasks/Events for the hosts:
Lost access to volume xxxxxxxxxxxxxxxx (datastore name) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
Successfully restored access to volume xxxxxxxxxxxxxxxxx (datastore name) following connectivity issues.
These events were considered “informational”, so they were not errors that triggered vCenter email notifications, and if you weren’t monitoring logs, these could easily get missed.
A few different event types in the logs, including hIgh number of –
H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0 DID_RESET
Latency spikes up to 1600 milliseconds via ESXTOP….
In collaboration with my colleagues on the storage side, we went through several troubleshooting steps using the tools we had available. Cases were opened with both VMware and storage vendor. Steps included:
- Using the NetApp Virtual Storage Console (VSC) plugin, we used the online “functional” alignment feature to fix storage alignment for certain VMs in the short-term
- Comparing VMware host and NetApp logs
- Uploading tons of system bundles and running ESXTOP
- Closer look at the environment – was anything being triggered by VM activity at those times? IE. antivirus, resource intensive jobs
- Reviewed HBA driver/firmware versions
- Worked with DBAs to try and stagger SQL maintenance plans
- Verified all aggregate snapshots except for aggr0 were still disabled on the NetApp side, since this had caused similar symptoms in the past. (Yep, still disabled)
After all of the troubleshooting steps above, the issue remained. Here are the steps that finally led to the solution:
- Get both vendors and customer together on status call – make sure everyone is on the same page
- Run perfstats on the NetApp side and ESXTOP in batch mode on VMware side to capture data as the issue is occurring.
- ESXTOP command: esxtop -b -a -d 10 -n 3000 | gzip -9c > /vmfs/volumes/datastorename/esxtopoutput.csv.gz And of course, Duncan’s handy ESXTOP page to help analyze: http://www.yellow-bricks.com/esxtop/
- Provide storage vendor with perfstats, esxtop files, and examples times for when the disconnects occurred during the captures.
NetApp bug #253964 was identified as the cause: http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=253964. The “Long-running “s” type CP after a “z” (sync) type CP” seen in the perfstat indicated we were hitting this bug. The fix was to upgrade Data OnTAP to a newer version – 8.0.4, 8.0.5, or 8.1.2. I am happy to confirm that upgrading Data OnTAP did resolve the issue.