Availability Group primary/secondary issue

  • I have a failover cluster called Server1. There are 2 nodes - NodeA and NodeB

    NodeA should always be the primary. In the cluster's current state, if NodeB is the primary, there are processes that break (an issue for another day).

    This has all been working fine for a couple weeks with NodeA as the primary. I go to Server1 and run SELECT @@SERVERNAME and it comes up NodeA as expected. In the SSMS tree, NodeA has said (primary) and NodeB has said (secondary). All rolling along.

    Friday night was update night. I ran windows and SQL updates on B (the secondary). Then I failed over to B and ran the same updates on A. Then I failed back to A. Nothing has been right since.

    If I look at Always On in SSMS on A, the AGs are listed as Primary

    If I look at Always On in SSMS on B, the AGs are listed as Secondary

    If I SSMS to Server1  - SELECT @@SERVERNAME returns NodeB and the AGs are Secondary

    Back at the application, it works if we force it to connect directly to NodeA, but throws a read-access error if we point it at Server1. Explicitly forcing it to NodeA isn't feasible ongoing.

    I cannot figure out why my Server1 IP is forcing itself to the secondary instead of the primary

    HELP!!!!!

  • Check the SQLServer errorlog file !

    Try the failover to A manually in SSMS and see what error comes up.

    Implement failover monitoring so you can see which fases it passes or fails.

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • I have manually failed over back and forth several times now to no avail. The overall servername lands on the secondary node, whichever that happens to be. Shouldn't the listener ALWAYS be (Primary)?

    I'm going to go ahead and attach a pic of my Explorer window with real server names for the sake of eliminating confusion.

    NodeStatuses

  • BTW What's the advantage of an AG if you really need it to be hosted on a given server ?

    It should be consumable in either situation.

     

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • So the 1st 2 nodes are exactly what I expect to see. A is primary. B is secondary. The problem is in the last one. I need the listener to present the primary instead of the secondary. This isn't a mirror. It's a failover with a non-readable secondary. If the listener defaults there, it's not working.

  • Redundancy. Disaster recovery. There's 2 reasons. And as I said, it's not intended that B can't be functional as the primary. We'll get there, but that is not my immediate problem.

  • did you check with all prerequisites ?

    Prerequisites, Restrictions, and Recommendations for Always On availability groups

    • This reply was modified 4 years, 2 months ago by  Johan Bijnens.

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • Are both in the same subnet or are they in different subnets?  Do you have register all IP's enabled? Any DNS caching gone a miss either where your local caches think the listener IP is still pointing to the other subnet/ip address?

    Any read only routing that's gone a miss since the patching?

  • Did you enable the AlwaysOn XEvents session on all involved instances?

    Did you install a monitoring solution to get an overview of failover events,..

    I have a sqlagent job that is launched by SQLServer event 1480 and that collects this information.

    SET QUOTED_IDENTIFIER ON ;

    DECLARE @FileName NVARCHAR(4000)
    SELECT @FileName = target_data.value('(EventFileTarget/File/@name)[1]','nvarchar(4000)')
    FROM (
    SELECT
    CAST(target_data AS XML) target_data
    FROM sys.dm_xe_sessions s
    JOIN sys.dm_xe_session_targets t
    ON s.address = t.event_session_address
    WHERE s.name = N'AlwaysOn_health'
    ) ft
    ;with cteXEData
    as (
    SELECT XEData.value( '(event/@timestamp)[1]', 'datetime2(3)') AS event_timestamp
    , XEData.value( '(event/data[@name="error_number"]/value)[1]', 'int') AS error_number
    , XEData.value( '(event/data[@name="severity"]/value)[1]', 'int') AS severity
    , XEData.value( '(event/data[@name="message"]/value)[1]', 'varchar(max)') AS message
    FROM
    (
    SELECT CAST(event_data AS XML) XEData
    , *
    FROM sys.fn_xe_file_target_read_file
    ( @FileName, NULL, NULL, NULL
    )
    WHERE object_name = 'error_reported'
    ) event_data


    WHERE XEData.value('(event/data[@name="error_number"]/value)[1]', 'int') IN (1480, 35264, 35265, 41404, 41405)
    /*
    OR XEData.value('(event/data[@name="severity"]/value)[1]', 'int') >= 19
    */
    )
    insert into dbo.T_DBA_AvailabilityGroup_FailoverEvents ( [event_timestamp], [error_number], [severity], [message], [DbName], [RoleOld] , [RoleNew] )
    Select *
    -- The availability group database "Koudwals" is changing roles from "RESOLVING" to "PRIMARY" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.
    , case error_number when 1480 then substring( message, 34, charindex('"', message, 35) - 34 )
    else NULL
    end as DbName
    , case error_number when 1480 then substring( message, charindex('is changing roles from "', message, 35) + 24, charindex('" to "', message, 35) - (charindex('is changing roles from "', message, 35) + 24) )
    else NULL
    end as RoleOld
    , case error_number when 1480 then substring( message, charindex('" to "', message, 35) + 6, charindex('" because ', message, 35) - ( charindex('" to "', message, 35) + 6 ) )
    else NULL
    end as RoleNew
    /* calculated column in table
    , case error_number when 1480 then substring( message, charindex('" because', message, 35) + 2 , 500)
    else NULL
    end as RoleChangeDetails
    */

    from cteXEData xe
    where not exists ( select 1
    from dbo.T_DBA_AvailabilityGroup_FailoverEvents DAE
    where DAE.event_timestamp = xe.event_timestamp
    )
    ORDER BY event_timestamp DESC ;

    Print '[' + cast(@@rowcount as varchar(15)) + '] rows inserted into dbo.T_DBA_AvailabilityGroup_FailoverEvents ';

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • Johan Bijnens - as I said before, this cluster has been functioning fine for months, as long as B isn't the primary. This problem started 3 days ago.

    Again, I'm not trying to solve B's issues right now. I just want the DNS name DUB-ALAT-SQL01 to go to the primary node, whatever that node may be.

    Anthony.Greene - everything you said checks out except I don't know what "Do you have register all IP's enabled" means.

  • No readonly routing configured. B is just for DR.

  • Sorry to add to all the questions:

    Has NodeB ever worked correctly as the primary node since the Availability Group was setup?

    When you failover to hostB is the IP for the availability group showing on NodeA or NodeB? I'm trying to work out if it's a DNS issue or a failover issue.

     

  • We figured it out kind of. We rebooted both nodes and things fell back into synch. Do I know the initial cause? No I don't.

    For the sake of answering the outstanding questions,  B does work as the primary for 2 of 3 apps we have aimed at the db. So I guess to say it didn't function as the primary was inaccurate. From an AO standpoint, B works fine as the primary. The transaction tokenizer just doesn't like it (we're on to a potential solution already), so we can't perform all necessary business functions on it.

  • I would suspect the AD SPN.

    I recall our cluster losing full control over that object during our tests.

    Broke it down and rebuilt it. Never happened again. We built 10+ 2node clusters for AGs.

    ( finally had to reverse to dbmirroring because one of the specific communications softwares couldn't handle being hosted on a clustered OS.)

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply