How to troubleshoot Always On Synchronization Issue ?
I have seen this questions in multiple forums or user group. Let me try to answer the on how to troubleshoot Always On synchronization issue. There are multiple reasons for the database status ranging for intermittent network issue, huge transaction, # of transactions, not ample space left on Secondary etc. The most common issue I have seen so far is, intermittent network issue and not ample space left on the secondary. For example, a weekly maintenance job run and crate a huge log file, some customer have configured log backup to occur every 1 hour based on there RPO and/or RTO. And, some times, at the secondary server not much space left for database file to grow and the Always On database synchronization status changes to one of above.
Most of the time, when you see an issue with Always On availability group for Synchronization it may be one of below status for the Availability Database, I will cover on Status 1 to 4, and will have separate blog on Status 5 and Status 6:
- Database is Restoring
- Database is Recovering
- Database is in Recovery pending
- Database is in Suspect
- Database is in Emergency
- Database is in Offline
In recent past, I have an interesting error where Always On Database Synchronization status is changed to recovery pending, the error message read like Msg 35220, Level 16, State 1, Line 1. Since Always On sits on Windows Server Failover Clustering (WSFC), and I do always in WSFC case, did reviewed the Cluster Logs where I have observed errors like below:
Msg 35220, Level 16, State 1, Line 1 Could not process the operation. AlwaysOn Availability Groups replica manager is waiting for the host computer to start a Windows Server Failover Clustering (WSFC) cluster and join it. Either the local computer is not a cluster node, or the local cluster node is not online. If the computer is a cluster node, wait for it to join the cluster. If the computer is not a cluster node, add the computer to a WSFC cluster. Then, retry the operation.
Cluster node 'SSCHA2' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
Let us see how did I resolve the Msg 35220, Level 16, State 1, Line 1 and How to Troubleshoot Always On Synchronization Issue.
Step 1: Restarted the Windows Cluster service on SSCHA2 node
Step 2: Restarted the SQL Server service on SSCHA2 Node
Step 3: Waited for couple of minutes and I see Always On is in Synchronized again
In any of the above situation one has to take a judgement based on what has cause the Always On Synchronization status to change and resolve it based on it. Most of the time, manually pausing and resuming Always On / HADR will work, in other case one has to review the logs and take the call. Here is how to How to Troubleshoot Always On Synchronization Issue in general, the command is to run on Replica.
Code to SUSPEND Data Movement in Always ON
ALTER DATABASE SSCDB SET HADR SUSPEND
Code to RESUME Data Movement in Always ON
ALTER DATABASE SSCDB SET HADR RESUME
You may also want to download my script to monitor AlwaysOn Health.
Let me know if I have missed anything here, or you may want to test and provide answer for any specific scenario. You can always send me an email or use tweeter to contact me, the details in the Author details section below.