Problems with SQL Server in Failover Cluster

  • English Version (sorry for translation errors 😉 ):

    Hi everyone,

    i´m facing an interesting but also really stressy situation with a SQL-Failover-Cluster. Maybe someone can help me out or give some useful tip.

    Setup:

    two Windows Server 2008 R2 with Cluster Services, two nodes, Active / Passive Cluster

    SQL Server 2008 R2, Version 10.50.4000

    two Instances, MSSQL-extern and MSSQL-intern\FA

    in addition NSClient++ and MSDTC as Clusterservices

    Network:

    10.58.68.17/24 (nsc-service.fa.de) - NSClient++ Monitoring Service

    10.58.68.19/24 (mssql-extern.fa.de) - SQL Server Failover IP Default Instanz: mssql-extern

    10.58.68.20/24 (mssql-intern.fa.de) - SQL Server Failover IP FA Instanz: mssql-intern\fa

    10.58.68.21/24 (mssql-cluster.fa.de) - Microsoft Cluster Services

    10.58.68.26/24 (mssql-DTC Service) - DTC Interconnect

    10.58.68.23/24 (mssql01-1.fa.de) - Host RZ

    172.16.1.23/24 (mssql01-1.fa.de) - Host Storage

    10.58.68.24/24 (mssql01-2.fa.de) - Host RZ

    172.16.1.24/24 (mssql01-2.fa.de) - Host Storage

    Problem:

    If i move both instances from node 1 to node 2, then instance mssql-extern is running fine, instance mssql-intern\fa doesn't

    If i move only the instance mssql-intern\fa to node 2, then everything runs ok

    if i also move the instance mssql-extern to node 2, then mssql-extern runs, mssql-intern\fa stops working

    If i leave mssql-extern on node 1 and mssql-intern\fa on node 2, then everything runs fine for maybe one hour. After that, the network from the management on node 1 turns unaviable, the virtual ip´s from the instances stays up and running. Management isn´t reachable through nagios oder icmp. the Problem only solves, when node 1 is restarted. after the reboot from node 1 i have to migrate both instances back to node 1

    Solutions:

    restart all necessary services after some migrations

    deactivate the NIC on node 1 after the failover

    monitor the network and traffic, nothing useful found

    Thoughts:

    It seem, that the instance mssql-intern\fa is blocked after the migration from the instance mssql-extern to node 2. MAC-Binding or cache problems don´t seem to be the cause.

    The theory, that node 1 is some sort of "master" in this setup has been dropped because there is no change in behavior after a restart from node 1 when the problem occurs.

    Next steps:

    Either fix the problem itself or put all database from instance mssql-intern\fa to mssql-extern and test again. Otherwise i would have to build a two standalone system setup, but that would then be the last option (and, for sure, not my preferred one)

    Every hint is welcome

    German Version:

    Hallo allerseits,

    ich habe hier ein interessantes und zugleich ziemlich nerviges Problem mit einem SQL-Failover-Cluster. Ich hoffe, hier vielleicht noch den einen oder anderen Tipp zu bekommen, denn ich bin mittlerweile wirklich ratlos.

    Zum Setup:

    zwei Windows Server 2008 R2 mit Cluster Services, somit zwei Knoten, Active / Passive Cluster

    SQL Server 2008 R2, Version 10.50.4000

    zwei Instanzen, MSSQL-extern und MSSQL-intern\FA

    zusätzlich NSClient++ und MSDTC als Clusterdienste

    Netzwerk:

    10.58.68.17/24 (nsc-service.fa.de) - NSClient++ Monitoring Service für diesen Cluster

    10.58.68.19/24 (mssql-extern.fa.de) - SQL Server Failover IP für Default Instanz: mssql-extern

    10.58.68.20/24 (mssql-intern.fa.de) - SQL Server Failover IP für FA Instanz: mssql-intern\fa

    10.58.68.21/24 (mssql-cluster.fa.de) - Microsoft Cluster Dienste

    10.58.68.26/24 (mssql-DTC Service) - DTC Interconnect

    10.58.68.23/24 (mssql01-1.fa.de) - Host RZ

    172.16.1.23/24 (mssql01-1.fa.de) - Host Storage

    10.58.68.24/24 (mssql01-2.fa.de) - Host RZ

    172.16.1.24/24 (mssql01-2.fa.de) - Host Storage

    Problem:

    Verschiebe ich beide Instanzen von Knoten 1 auf Knoten 2, dann sind die Datenbanken der Instanz mssql-extern sofort wieder erreichbar, die Instanzen von mssql-intern\fa nicht

    Verschiebe ich nur die Instanz mssql-intern\fa auf Knoten 2, funktioniert diese auf Anhieb

    Verschiebe ich die Instanz mssql-extern dann auch auf Knoten 2, dann funktioniert die Instanz mssql-extern, aber mssql-intern\fa nicht mehr

    Lasse ich mssql-extern auf Knoten 1 und mssql-intern\fa auf Knoten 2, dann läuft das für ca 1 Stunde sauber, danach stirbt die Netzwerkverbindung vom Management Interface auf Knoten 1. Die virtuellen Adressen funktionieren weiter, aber die Management IP 10.58.68.23 ist nicht mehr erreichbar durch Nagios oder ICMP. Das Problem lässt sich nur durch einen Neustart beheben. Nach dem Neustart ist aber die Instanz mssql-intern\fa auf Knoten 2 nicht mehr erreichbar, also muss alles wieder auf den ersten Knoten.

    Lösungsansätze:

    alle notwendigen Dienste nach diversen Verschiebungen neu gestartet

    NIC auf Knoten 1 nach Ausfall deaktivert, wieder aktiviert

    Netzwerk Monitoring laufen lassen, keine Fehler zu finden

    Gedanken:

    Scheinbar verklemmt sich die Instanz mssql-intern\fa, sobald die andere Instanz auf Knoten 2 mitläuft, auf dem ersten Knoten stellt dies jedoch kein Problem dar. MAC-Binding oder Cache Probleme schliesse ich aus, da es sich ja bei den Instanznamen und virtuelle Interfaces bzw DNS handelt

    Der Ansatz, das Knoten 1 so eine Art Master-Rolle spielt konnte ich durch Tests auch "eigentlich" ausschliessen

    Nächste Schritte:

    Entweder das Problem lösen oder zum Test alle Datenbanken in eine Instanz stecken, sinnvollerweise in die mssql-extern, denn diese lässt sich ja beliebig verschieben. Wenn das nichts hilft, dann müsste ich wohl oder übel den Cluster auflösen und zwei Standalone System bauen, das möchte ich aber eigentlich verhindern.

    Jede Hilfe ist willkommen…

  • are you using any trace flags on the sql server service startup?

    Can you provide more detail on the host spec and storage setup too?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • currently no trace flags, but i will handle this tonight and post errors here. the setup is:

    two ibm x3650 M3 servers, one cpu (4 core) each, 72GB ram, 4 onboard nic, 2 10GE nics for the storage, switched via an arista 10GE fan out. storage is a netapp 3020, servers got 9 luns on the storage, one for each sysdb, db, log (x2, for both instances) plus backup, monitoring and application transfers. the crazy thing is, that the services are all up and running in each scenario but can´t get reached from nowhere. if there was a problem with storage than the sql server services wouldnt start. i think that there is a problem with either the cluster services or the sql services, maybe could be dns, also

  • Firstly run a cluster validation and review the output.

    post details of any errors.

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • cluster validation runs fine, only warnings about unsigned drives for the ramdisk we are using for the temp db. the other warning is about ressource monitor for sql server and agent, the validiation tool complains about sql server and agent are not configured to run in a shared monitor. i guess, this is a sql server specific config?

  • so, what trace number should i use, there are so many

  • this could be a solution: http://www.mssqltips.com/sqlservertip/2928/updating-the-tcp-port-on-a-sql-server-cluster-node/

    the named instance uses dynamic ports. what do you think? it seems, that the native client got´s a problem with dynamic mappings

  • alexander.karls (10/14/2013)


    so, what trace number should i use, there are so many

    none, just want to check if you have any enabled.

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • check the information in the application log when the service fails to start

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Here comes a small update on the problem case after some testing yesterday:

    -DNS, ICMP, Portstatus during Failover on both nodes > no problems, only 2 pakets lost

    -Eventlogs during Failover > no errors

    -Cluster Validation with Disks > no errors

    -Connect via TCP and / or NP > no changes, Webapp spit our error 26 (ASP)

    -change the dynamic TCP Ports for Named Instance from 60485 to 1433 > no change

    -Test open UDP Port 1434 with portqry > check, ok

    -check User Access rights for services and wepapp > ok

    -SQL Browser restarts > no change

    new in the situation is, that the databases in the default instance are all reachable, but with poor performance (on node 2). this only changes if i stop the sql brwoser on the passive node. also new: nearly all databases from the named instance are reachable now (including poor performance, sometimes), only one older web app cant reach their databases at all, no matter if tcp or np. this web app uses the native client, that can not be changed. error stays on 26 - SQL Network Interfaces, error: 26 - Error Locating Server/Instance Specified

    The "split brain" scenario can be dropped, for that wasn´t in our mind. we want to run the instances either on node 1 OR node 2, besides with any other constellation there would be a problem with the quorum.

    the question is: whats the difference between node 1 and node 2? they´re the modell, same build, same everything. the services are starting fine, there are no errors, theres no network performance issue to measure...

    i´m really stuck with this problem

    Die Frage ist also nun: Was ist an dem zweiten Knoten so anders als an dem ersten?

  • Have you checked the cluster events to see if you have any issues reported there?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • jep, no issues at all

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply