A bit of speculation here, but I'll toss it out there.
When this issue occured yesterday evening, there was a Tape AUX copy job which had already been running for about 2 hours. The disk library and the tape library are both fiber attached to the Media Agent, so all data flow would be self-contained to the MA. It should be noted that there were no indications that this Tape AUX copy operation went into pending/failed state at any time. CV support also confirmed that our CVD services continued to process data through the outage. I must assume the data being processed was this tape AUX copy.
Conversely, any new backup operation hung with a message that the the MA was unreachable, or the Library was offline/unavailable.
Consider that there was also an AUX copy scheduled to run during this outage. This was a network based AUX copy from our Main MA to our DR MA. It failed to start, due to inability to contact our Production MA.
The MA continues to process data within itself (tape aux copy). Any attempts to contact it via network for new backup or Aux copies are denied. Almost like the MA services are stopped or the network connection has been disrupted.
Our monitoring software reports no network disruption (i.e. it was pingable for the duration of the outage). So we must assume our NICs were active and functional. The NIC reports it has been up the same amount of time as the server (38+ days).
Let's consider that during this outage, we see repeated messages in our Windows Event log: unable to obtain file lock on a .LOG file in the CommVault install directory.
Speculation - If CommVault is unable to obtain a lock for a .LOG file, the CV services that handle network connectivity go into a failed state. This prohibits the MA from responding to any network requests for CommVault. The result is Loss of Control and/or Library Offline error messages.
I'm interested in some feedback regarding these assumptions.
Final note - all Local Hard Disks in the Media Agent (OS, IndexCache, and DDB disks) are all SSD.