AG Wizard - Cannot bring resource online, error codes 5942 and 41066

  • Hi,

    I know this error and why it happens - very well in fact.

    Usually this is my config (simplified):

    2 SQL hosts, SQL01 and SQL02

    WSFC between them, SQLWSFC01

    WSFC Security Group, WSFCSG01, contains SQLWSFC01

    WSFCSG01 security group is given permission in DNS to be able to manage the Listener DNS record

    WSFCSG01 security group object is given permissions to the OU where SQLWSFC01 resides:

    ...to allow it to 'Create/Delete' computer objects on 'This and descendent objects"

    ...to allow it to have Full Control to computer objects

    This ALWAYS works, I've even checked current deployment against a previous one I did and it's the same.

    What am I missing... where can I look?

    Thanks

  • You say you know this error very well, but sometimes it helps to go back to the basics and just double check the VERY basic configuration.

    Have you come across this article before:

    https://www.eugenechiang.com/2020/07/01/create-failed-for-availability-group-listener-error-41066/#:~:text=Cannot%20bring%20the%20Windows%20server%20fail%20over%20cluster,in%20a%20state%20that%20could%20accept%20the%20request.

    It has steps to check and things to do to try to resolve the error code you mentioned.  Not trying to second guess your work, but usually when I hit odd snags like this, going back and double checking the "simple" stuff can lead me in the right direction.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Yes I've read that article before. By "very well", I mean the "flow" of things in terms of what happens, i.e. the WSFC computer object is what needs access to create the AG objects, not the current user creating the AG etc. etc.

    I've asked the AD team to pre-stage everything anyway to test - even though this isn't needed since the right objects have the right permissions to create things in AD/OU/DNS etc.

  • Weirdly, last week the AG Wizard was successful in creating the AG, but failed as above with the Listener. Yesterday I manually created the Listener and it worked (nothing was pre-staged ever).

    Today, a new AG was created but again the Listener failed as part of the Wizard. AND failed when doing it manually after the Wizard. Towards the end of March (during testing), the AG Wizard was successful in both the AG and the Listener - not even a blink of an issue.

    It's not my AD environment and I have no control over AD so am feeding back to the AD team. They have AD auditing tools and comparing end of March to now, nothing has changed as far as the WSFC object is concerned, for permissions to DNS and AD/OU.

    So it seems hit and miss.

  • PS: The permissions to AD/OU/DNS are exactly the same I have set in previous environments, and there it worked 100% of the time. I had to build about 20 AGs.

  • "Not trying to second guess your work, but usually when I hit odd snags like this, going back and double checking the "simple" stuff can lead me in the right direction."

    Didn't think you were, and I know what you mean about the "basic" stuff 🙂

  • This was removed by the editor as SPAM

  • I personally find that the wizards are hit and miss and I prefer to set things up myself where possible.  I've had the wizards say "everything is great" only to discover that one step failed, but it kept going.  Or it'll say it failed and roll the whole thing back when it is a simple thing to correct.

    With intermittent issues like that, my thought is usually something along the lines of the network.  I've seen faulty ethernet cables cause all sorts of strange issues for example.  I've seen faulty patch cables pass along enough good packets for most simple test related thing to pass successfully, but as soon as a load is put on it, it fails.  Not 100% sure what was wrong with the cable as it would pass the gigabit tests with our cable tester, but when trying to stream video across that cable, it would have TONS of packet loss.

    Might not be the network, but since nothing changed on your end and it just suddenly worked at one point, I would be thinking it is either some intermittently faulty hardware OR someone is changing configurations on something (firewall for example) and not letting you know.

    What may be interesting to do is a constant ping between all boxes participating in the AG and let that run for a few hours and then cancel it and look how much packet loss you had.  If you see a lot of packet loss, I'd bring in the network team.

    The main reason I don't think you are doing anything wrong is that you changed nothing, just tried it again, and it sometimes succeeds.  To me this indicates the problem is NOT with the setup or configuration you have in place, but is something likely related to the network (in this case).

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Hi!

    Thanks for the replies.

    Have an update on this...

    created 2 new servers (VMs)

    added Failover Clustering

    created Cluster

    installed SQL

    enabled AlwaysOn in Configuration Manager

    boom - creating an AG with an AGL during the wizard no issues at all. The new setup has exactly the same configuration/permissions in AD and DNS to create the Listener computer object and DNS record.

    Is there anything I can check on the original Windows hosts themselves. And a way to "undo" all the AG config., including things like removing Endpoint URLs so the SQL are like they never had AG attempted on them before.

    Thanks again

  • The hosts the VMs are running on are connected to the same network switches as the current physical SQL servers. Cluster.log is giving false information as well since it's complaining that the chosen Listener IP is a duplicate when it's 100% not!

    I'm really stumped with this one!

  • I'm planning to trash all the Windows Failover Cluster config. (none of the SQL is using it yet) and start again, but would ideally like to know the root cause.

  • Another update;

    we have stacked instances on the Windows hosts. I've just tried with what was the first named instance to be installed and it worked perfectly first time.

    So I then tried on a different instance - a different one to what I've been trying so far - and it has the same error as the instance we've previously tried.

    So it seems like any instances installed after the first instance have this issue.

    Each instance service port is 100% unique (set statically and manually) - and is the same port across both hosts

    Each listener port we're trying to use is 100% unique for that instance

    Each instance endpoint ports are unique to that instance

    Each other instance fails with the same error in cluster.log ("duplicate IP address for listener")

    There is definitely not duplicate IPs though!

  • Actually - I've just done this last test:

    The AGL IP that worked in INSTANCE1 was x.x.x.100

    The AGL IP that didn't work in INSTANCE2 was x.x.x.101

    x.x.x.101 does not respond to a ping.

    If I use .100 in INSTANCE2, it works. If I use .101 in INSTANCE1 it fails. So it's actually looking like a duplicate IP afterall. How does SQL/Windows Clustering determine duplicate IP?

  • This was removed by the editor as SPAM

Viewing 14 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply