We are having some weird issues that are occurring during backups of a VMware backup. One of our File Servers (we can call B2) is set to perform an incremental vmware backup every hour during business hours. This server is also using DFS-R to replicate data from multiple remote site servers that we then backup using Commvault. We are also using FSRM to monitor some specific files and locations for signs of Crytolocker type attacks.
It is also important to note that the VMDK's are all stored on a SAN and all backups use SAN-Transport mode and the snapshots are being created on the SAN using one of the Media Agents as the VM-Proxy.
There are three VM's in the backup set. Two are Windows 2012 R2 and the third is Windows 2008 R2. The Media Agent is Windows 2008 R2 running the latest V11SP6+ CV software.
Anyway, whenever a backup job starts, we have noticed a few things (some of which may be normal):
1. As the VM snapshot is being generated, there are several volumes that are visible to Windows for about a minute or two. It appears they are the VSS volumes, but we are getting errors from the FSRS of "unexpected error.....Error: FlushFileBuffers....incorrect function and it references one of these VSS snap volumes that normally disappear within 30 seconds from system view. The error is EventID 8197 from source SRMSVC. It is also important to note that this error does not occur on every backup cycle, but happens randomly thoughout the day on only a few of the backup cycles. These occur about 1 minuteafter the backup process starts.
2. We also see in the "system" log a warning for three different disks being "suprise removed". This occurs each time at about 1 minute after the process starts. EventID 157 from Source disk.
3. We also see the DFS-R showing it has temporarily stopped because of another application performing a backup or restore operation. EventID 1102 from Source DFSR. Again this occurs within a minute of the backup process starting. This is normally followed by another event within a few seconds stating it has resumed. BUT, this again does not happen on every backup cycle. AND the bigger issue is sometimes the DFS-R stops working....the service is still working, but a health check script we run shows it failing. A restart of the service corrects it.....this is happening every couple days.
1. Temporarily disabled Volume Shadows Copies on the drives from within Windows
2. Decreased the number of writers for this VM subclient from 10 to 2.
3. Recently upgraded our VMware Tools to 10.0.9 due to a known bug in the previous 10.0.6 we were using. We are also on ESXi 6 (have been for 6 months)
4. Checked the resources of all boxes in question and found no issues. All have plenty of resources and are not using the ones they have.
1. This doesn't happen every time
2. We have two other File Servers in the same subclient that are not having the same issues. Both have FSRM running, but not DFS-R
3. We have another server (physical) that uses the Windows Client and has DFS-R, but it is not having issues....may not be relevent, but I thought I would include for details.
4. Backups running without errors
It appears the only thing the separates the VM servers is the DFS-R, but I have not yet found anything online that references this specific issue on our software versions. I found some similiar things going back to ESXi 3.5 days, but we have moved away from that version years ago.
I have been beating my head against this for about two weeks and have run out of ideas to pursue. I hesitate to put in a support ticket due to the fact it could be any of three vendors (Commvault, VMware or Microsoft) who has the issue.
I appreciate any help you can provide. If you have further questions, just let me know.