SQL Server 2012 (11.0.5058.0) Enterprise Edition
We have 8 Availability Groups in a 2(HA)+1(DR) cluster and our monitoring DMVs are reporting results that confuse me. 6 Availability Groups are configured for HA and DR, 1 is configured for HA only, and 1 is configured for DR only.
Each of the 6 HA/DR Availability Groups have "SQLB" as a primary and "SQLA" as a secondary (synchronous) HA replica and "SQLC" as a secondary (async) replica.
On both secondaries:
SELECT dhags.group_id, dhags.synchronization_health_desc FROM sys.dm_hadr_availability_group_states dhags
reports that all Availability Group replication sync health are
select replica_id,synchronization_health_desc from sys.dm_hadr_availability_replica_states
reports that all replicas have a sync health of
The primary replica reports all Availability Groups and replicas with a sync health of
While I understand that one reports on replica sync health and the other reports on AG sync health, it seems logical to me that if the more granular (AG) state was not healthy, that would affect the overall health of the broader context (replica). I cannot find MSDN documentation that describes how the health is determined at each level.
Why would the secondaries report
NOT_HEALTHY for Availability Group sync health, but
HEALTHY for replica sync health, and why does this differ from the primary's report?
Sadly, sys.dm_hadr_availability_replica states is not a reliable indicator of replica health. Here's the Connect item on one of the bugs we've run into where that DMV stops refreshing - note in the comments that log_send_queue_size in the DMV sys.dm_hadr_database_replica_states shows 0 even when there's log data to be sent.
Note that the Connect item is marked as Won't Fix. Sad trombone.