Have to remove DB and join DB back to AG to fix out of sync issue

  • i1888

    SSCommitted

    Points: 1810

    Hello friends,

    I found several times, I cannot fix the AlwaysON issue then I have to delete database and join it back to fix.

    This is usually how I fix alwayson issue:

    1. Firstly, try resume the database.

    ALTER DATABASE [DBName] SET HADR RESUME;

    2. If item 1 doesn't work, set the ag from sync mode to async mode, then change it back

    USE [master]

    GO

    ALTER AVAILABILITY GROUP [AGName]

    MODIFY REPLICA ON N'A\INSTANCE3' WITH (AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT)

    GO

    USE [master]

    GO

    ALTER AVAILABILITY GROUP [AGName]

    MODIFY REPLICA ON N'A\INSTANCE3' WITH (AVAILABILITY_MODE = SYNCHRONOUS_COMMIT)

    GO

    3. If item 2 doesn't work, on the secondary replica, remove db from ag then add it back

    ALTER DATABASE [DBName] SET HADR OFF;

    GO

    alter database [DBName] set HADR AVAILABILITY GROUP = [AGName]

    go

    4. If 3 doesn't work,

    on primary, make full backup and log backup if you don't have recent backups,

    on secondary,

    ALTER DATABASE [DBName] SET HADR OFF;

    Restore database to latest log backup.

    on primary,

    ALTER AVAILABILITY GROUP [AGName]

    REMOVE DATABASE [DBName];

    Join the database in wizard.

    Usually, item 1-3 are enough to fix not syncing issue, but sometimes, there is no luck, no matter how you play the not syncing database, there is just no luck, for example:

    0. Two nodes are both on latest version.

    Microsoft SQL Server 2014 - 12.0.4416.0 (X64)

    Jun 11 2015 19:18:41

    Copyright (c) Microsoft Corporation

    Enterprise Edition (64-bit) on Windows NT 6.3 <X64> (Build 9600: )

    Two databases in AG, and both sync mode for two nodes, set to automatic failover mode.

    1. I was advised there is a maintenance work on server, the AG is automatically failedover, now found one database is not syncing on secondary.

    2. Tried step 1-3 all didn't work, at step 3, keep failing by "The connection to the primary replica is not active. The command cannot be processed."

    Server Name: A\INSTANC3

    Error Number: 35250

    Severity: 16

    State: 7

    Checked the event log, not useful information provided:

    -

    Nonqualified transactions are being rolled back in database DBNamefor an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.

    -

    The database 'DBName' is marked RESTORING and is in a state that does not allow recovery to be run.

    -

    Starting up database 'DBName'.

    -

    State information for database 'DBName' - Hardended Lsn: '(268:12520:1)' Commit LSN: '(0:0:0)' Commit Time: 'Jan 1 1900 12:00AM'

    GOOGLED, all pointed to the endpoint, firewall etc, which is not helpful at all, since there is another database working fine in same availability group... I have crosschecked, firewall is configured, port is open, the endpoint url status is connected on all nodes.

    A bit sad... thus have to try solution 4 --- which fixed the issue in 1 minute.

    So, does anyone knows why "The connection to the primary replica is not active. The command cannot be processed." pops up even the connection is super good?

    Regards,

    Albert

  • Perry Whittle

    SSC Guru

    Points: 233779

    find out why the data movement is being suspended in the first place, can you provide more detail as to what exact messages you receive, stop the sync issues to resolve the problem

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • i1888

    SSCommitted

    Points: 1810

    Hello Perry,

    Thanks for your reply.

    I found another database out of sync on another server, I tried to apply the log backup but there are not enough log backup since then, so I restored the full backup made last week and applied log backup, then run the command:

    alter database DBNameset hadr availability group =[AGName]

    Got error again:

    Msg 35250, Level 16, State 7, Line 1

    The connection to the primary replica is not active. The command cannot be processed.

    Because this is production database, I have to fix this in weekend.

    re the first place where the issue happens, I can only find some connection issue like below:

    A connection timeout has occurred on a previously established connection to availability replica 'C\INSTANCE2' with id [119BD57D-BDE1-454A-9A87-31BE2C99C94E]. Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role.

    A connection for availability group 'ORD20-I2G10' from availability replica 'A\INSTANCE2' with id [5F1D5F50-44D8-4529-9592-5041A9D60855] to 'C\INSTANCE2' with id [30A1B59E-EDF3-4142-ABC1-FC1D03E3EFBC] has been successfully established. This is an informational message only. No user action is required.

    Those 3 nodes are directly connected and the speed is very fast( we spend million dollars on hardware), because we have multiple instances running on each node(4 instances on this HA environment), each instance has configured cpu affinity, I suspect the overlapped cpu cores are busy on one instance and could not handle the request for another instance, which causing AG connection dropped, but I haven't approved this yet.

  • Perry Whittle

    SSC Guru

    Points: 233779

    are your backups preferred on the primary or a secondary?

    how often do the full backups run?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • i1888

    SSCommitted

    Points: 1810

    Our application does the backup, so all through primary, and there are both primary replicas on both node A and B, so it is balanced...

  • Perry Whittle

    SSC Guru

    Points: 233779

    i1888 (8/20/2015)


    there are both primary replicas on both node A and B, so it is balanced...

    Yes, but they'll be members of different AlwaysOn groups. Keep your AG backups segregated from the remaining instance database backups, a simple check when the backup job runs to see if HADR is set and if so which group it's in then backup to a set share would be my preference.

    If no full backup has run since the last sync between the primary and the secondary you may use a differential backup on the primary to bring the secondary in sync then resume data movement

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

Viewing 6 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply