Problems with SQL Server in Failover Cluster

Question

Post reply

Problems with SQL Server in Failover Cluster

alexander.karls

Old Hand

Points: 361
More actions
October 11, 2013 at 4:50 am

#281332

English Version (sorry for translation errors 😉 ):
Hi everyone,
i´m facing an interesting but also really stressy situation with a SQL-Failover-Cluster. Maybe someone can help me out or give some useful tip.
Setup:
two Windows Server 2008 R2 with Cluster Services, two nodes, Active / Passive Cluster
SQL Server 2008 R2, Version 10.50.4000
two Instances, MSSQL-extern and MSSQL-intern\FA
in addition NSClient++ and MSDTC as Clusterservices
Network:
10.58.68.17/24 (nsc-service.fa.de) - NSClient++ Monitoring Service
10.58.68.19/24 (mssql-extern.fa.de) - SQL Server Failover IP Default Instanz: mssql-extern
10.58.68.20/24 (mssql-intern.fa.de) - SQL Server Failover IP FA Instanz: mssql-intern\fa
10.58.68.21/24 (mssql-cluster.fa.de) - Microsoft Cluster Services
10.58.68.26/24 (mssql-DTC Service) - DTC Interconnect
10.58.68.23/24 (mssql01-1.fa.de) - Host RZ
172.16.1.23/24 (mssql01-1.fa.de) - Host Storage
10.58.68.24/24 (mssql01-2.fa.de) - Host RZ
172.16.1.24/24 (mssql01-2.fa.de) - Host Storage
Problem:
If i move both instances from node 1 to node 2, then instance mssql-extern is running fine, instance mssql-intern\fa doesn't
If i move only the instance mssql-intern\fa to node 2, then everything runs ok
if i also move the instance mssql-extern to node 2, then mssql-extern runs, mssql-intern\fa stops working
If i leave mssql-extern on node 1 and mssql-intern\fa on node 2, then everything runs fine for maybe one hour. After that, the network from the management on node 1 turns unaviable, the virtual ip´s from the instances stays up and running. Management isn´t reachable through nagios oder icmp. the Problem only solves, when node 1 is restarted. after the reboot from node 1 i have to migrate both instances back to node 1
Solutions:
restart all necessary services after some migrations
deactivate the NIC on node 1 after the failover
monitor the network and traffic, nothing useful found
Thoughts:
It seem, that the instance mssql-intern\fa is blocked after the migration from the instance mssql-extern to node 2. MAC-Binding or cache problems don´t seem to be the cause.
The theory, that node 1 is some sort of "master" in this setup has been dropped because there is no change in behavior after a restart from node 1 when the problem occurs.
Next steps:
Either fix the problem itself or put all database from instance mssql-intern\fa to mssql-extern and test again. Otherwise i would have to build a two standalone system setup, but that would then be the last option (and, for sure, not my preferred one)
Every hint is welcome
German Version:
Hallo allerseits,
ich habe hier ein interessantes und zugleich ziemlich nerviges Problem mit einem SQL-Failover-Cluster. Ich hoffe, hier vielleicht noch den einen oder anderen Tipp zu bekommen, denn ich bin mittlerweile wirklich ratlos.
Zum Setup:
zwei Windows Server 2008 R2 mit Cluster Services, somit zwei Knoten, Active / Passive Cluster
SQL Server 2008 R2, Version 10.50.4000
zwei Instanzen, MSSQL-extern und MSSQL-intern\FA
zusätzlich NSClient++ und MSDTC als Clusterdienste
Netzwerk:
10.58.68.17/24 (nsc-service.fa.de) - NSClient++ Monitoring Service für diesen Cluster
10.58.68.19/24 (mssql-extern.fa.de) - SQL Server Failover IP für Default Instanz: mssql-extern
10.58.68.20/24 (mssql-intern.fa.de) - SQL Server Failover IP für FA Instanz: mssql-intern\fa
10.58.68.21/24 (mssql-cluster.fa.de) - Microsoft Cluster Dienste
10.58.68.26/24 (mssql-DTC Service) - DTC Interconnect
10.58.68.23/24 (mssql01-1.fa.de) - Host RZ
172.16.1.23/24 (mssql01-1.fa.de) - Host Storage
10.58.68.24/24 (mssql01-2.fa.de) - Host RZ
172.16.1.24/24 (mssql01-2.fa.de) - Host Storage
Problem:
Verschiebe ich beide Instanzen von Knoten 1 auf Knoten 2, dann sind die Datenbanken der Instanz mssql-extern sofort wieder erreichbar, die Instanzen von mssql-intern\fa nicht
Verschiebe ich nur die Instanz mssql-intern\fa auf Knoten 2, funktioniert diese auf Anhieb
Verschiebe ich die Instanz mssql-extern dann auch auf Knoten 2, dann funktioniert die Instanz mssql-extern, aber mssql-intern\fa nicht mehr
Lasse ich mssql-extern auf Knoten 1 und mssql-intern\fa auf Knoten 2, dann läuft das für ca 1 Stunde sauber, danach stirbt die Netzwerkverbindung vom Management Interface auf Knoten 1. Die virtuellen Adressen funktionieren weiter, aber die Management IP 10.58.68.23 ist nicht mehr erreichbar durch Nagios oder ICMP. Das Problem lässt sich nur durch einen Neustart beheben. Nach dem Neustart ist aber die Instanz mssql-intern\fa auf Knoten 2 nicht mehr erreichbar, also muss alles wieder auf den ersten Knoten.
Lösungsansätze:
alle notwendigen Dienste nach diversen Verschiebungen neu gestartet
NIC auf Knoten 1 nach Ausfall deaktivert, wieder aktiviert
Netzwerk Monitoring laufen lassen, keine Fehler zu finden
Gedanken:
Scheinbar verklemmt sich die Instanz mssql-intern\fa, sobald die andere Instanz auf Knoten 2 mitläuft, auf dem ersten Knoten stellt dies jedoch kein Problem dar. MAC-Binding oder Cache Probleme schliesse ich aus, da es sich ja bei den Instanznamen und virtuelle Interfaces bzw DNS handelt
Der Ansatz, das Knoten 1 so eine Art Master-Rolle spielt konnte ich durch Tests auch "eigentlich" ausschliessen
Nächste Schritte:
Entweder das Problem lösen oder zum Test alle Datenbanken in eine Instanz stecken, sinnvollerweise in die mssql-extern, denn diese lässt sich ja beliebig verschieben. Wenn das nichts hilft, dann müsste ich wohl oder übel den Cluster auflösen und zwei Standalone System bauen, das möchte ich aber eigentlich verhindern.
Jede Hilfe ist willkommen…

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply

Perry Whittle SSC Guru Points: 233915 More actions · Answer 1

are you using any trace flags on the sql server service startup?

Can you provide more detail on the host spec and storage setup too?

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

alexander.karls Old Hand Points: 361 More actions · Answer 2

currently no trace flags, but i will handle this tonight and post errors here. the setup is:

two ibm x3650 M3 servers, one cpu (4 core) each, 72GB ram, 4 onboard nic, 2 10GE nics for the storage, switched via an arista 10GE fan out. storage is a netapp 3020, servers got 9 luns on the storage, one for each sysdb, db, log (x2, for both instances) plus backup, monitoring and application transfers. the crazy thing is, that the services are all up and running in each scenario but can´t get reached from nowhere. if there was a problem with storage than the sql server services wouldnt start. i think that there is a problem with either the cluster services or the sql services, maybe could be dns, also

Perry Whittle SSC Guru Points: 233915 More actions · Answer 3

Firstly run a cluster validation and review the output.

post details of any errors.

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

alexander.karls Old Hand Points: 361 More actions · Answer 4

cluster validation runs fine, only warnings about unsigned drives for the ramdisk we are using for the temp db. the other warning is about ressource monitor for sql server and agent, the validiation tool complains about sql server and agent are not configured to run in a shared monitor. i guess, this is a sql server specific config?

alexander.karls Old Hand Points: 361 More actions · Answer 5

so, what trace number should i use, there are so many

alexander.karls Old Hand Points: 361 More actions · Answer 6

this could be a solution: http://www.mssqltips.com/sqlservertip/2928/updating-the-tcp-port-on-a-sql-server-cluster-node/

the named instance uses dynamic ports. what do you think? it seems, that the native client got´s a problem with dynamic mappings

Perry Whittle SSC Guru Points: 233915 More actions · Answer 7

alexander.karls (10/14/2013)
so, what trace number should i use, there are so many

none, just want to check if you have any enabled.

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

Perry Whittle SSC Guru Points: 233915 More actions · Answer 8

check the information in the application log when the service fails to start

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

alexander.karls Old Hand Points: 361 More actions · Answer 9

Here comes a small update on the problem case after some testing yesterday:

-DNS, ICMP, Portstatus during Failover on both nodes > no problems, only 2 pakets lost

-Eventlogs during Failover > no errors

-Cluster Validation with Disks > no errors

-Connect via TCP and / or NP > no changes, Webapp spit our error 26 (ASP)

-change the dynamic TCP Ports for Named Instance from 60485 to 1433 > no change

-Test open UDP Port 1434 with portqry > check, ok

-check User Access rights for services and wepapp > ok

-SQL Browser restarts > no change

new in the situation is, that the databases in the default instance are all reachable, but with poor performance (on node 2). this only changes if i stop the sql brwoser on the passive node. also new: nearly all databases from the named instance are reachable now (including poor performance, sometimes), only one older web app cant reach their databases at all, no matter if tcp or np. this web app uses the native client, that can not be changed. error stays on 26 - SQL Network Interfaces, error: 26 - Error Locating Server/Instance Specified

The "split brain" scenario can be dropped, for that wasn´t in our mind. we want to run the instances either on node 1 OR node 2, besides with any other constellation there would be a problem with the quorum.

the question is: whats the difference between node 1 and node 2? they´re the modell, same build, same everything. the services are starting fine, there are no errors, theres no network performance issue to measure...

i´m really stuck with this problem

Die Frage ist also nun: Was ist an dem zweiten Knoten so anders als an dem ersten?

Perry Whittle SSC Guru Points: 233915 More actions · Answer 10

Have you checked the cluster events to see if you have any issues reported there?

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

alexander.karls Old Hand Points: 361 More actions · Answer 11

alexander.karls

Old Hand

Points: 361

October 15, 2013 at 4:17 am

#1658452

jep, no issues at all