Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 12»»

Problems with SQL Server in Failover Cluster Expand / Collapse
Author
Message
Posted Friday, October 11, 2013 4:50 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Monday, October 21, 2013 3:16 AM
Points: 7, Visits: 11
English Version (sorry for translation errors ;) ):

Hi everyone,
i´m facing an interesting but also really stressy situation with a SQL-Failover-Cluster. Maybe someone can help me out or give some useful tip.
Setup:

two Windows Server 2008 R2 with Cluster Services, two nodes, Active / Passive Cluster
SQL Server 2008 R2, Version 10.50.4000
two Instances, MSSQL-extern and MSSQL-intern\FA
in addition NSClient++ and MSDTC as Clusterservices

Network:
10.58.68.17/24 (nsc-service.fa.de) - NSClient++ Monitoring Service
10.58.68.19/24 (mssql-extern.fa.de) - SQL Server Failover IP Default Instanz: mssql-extern
10.58.68.20/24 (mssql-intern.fa.de) - SQL Server Failover IP FA Instanz: mssql-intern\fa
10.58.68.21/24 (mssql-cluster.fa.de) - Microsoft Cluster Services
10.58.68.26/24 (mssql-DTC Service) - DTC Interconnect
10.58.68.23/24 (mssql01-1.fa.de) - Host RZ
172.16.1.23/24 (mssql01-1.fa.de) - Host Storage
10.58.68.24/24 (mssql01-2.fa.de) - Host RZ
172.16.1.24/24 (mssql01-2.fa.de) - Host Storage

Problem:
If i move both instances from node 1 to node 2, then instance mssql-extern is running fine, instance mssql-intern\fa doesn't
If i move only the instance mssql-intern\fa to node 2, then everything runs ok
if i also move the instance mssql-extern to node 2, then mssql-extern runs, mssql-intern\fa stops working
If i leave mssql-extern on node 1 and mssql-intern\fa on node 2, then everything runs fine for maybe one hour. After that, the network from the management on node 1 turns unaviable, the virtual ip´s from the instances stays up and running. Management isn´t reachable through nagios oder icmp. the Problem only solves, when node 1 is restarted. after the reboot from node 1 i have to migrate both instances back to node 1

Solutions:
restart all necessary services after some migrations
deactivate the NIC on node 1 after the failover
monitor the network and traffic, nothing useful found

Thoughts:
It seem, that the instance mssql-intern\fa is blocked after the migration from the instance mssql-extern to node 2. MAC-Binding or cache problems don´t seem to be the cause.
The theory, that node 1 is some sort of "master" in this setup has been dropped because there is no change in behavior after a restart from node 1 when the problem occurs.

Next steps:
Either fix the problem itself or put all database from instance mssql-intern\fa to mssql-extern and test again. Otherwise i would have to build a two standalone system setup, but that would then be the last option (and, for sure, not my preferred one)

Every hint is welcome


German Version:
Hallo allerseits,
ich habe hier ein interessantes und zugleich ziemlich nerviges Problem mit einem SQL-Failover-Cluster. Ich hoffe, hier vielleicht noch den einen oder anderen Tipp zu bekommen, denn ich bin mittlerweile wirklich ratlos.
Zum Setup:

zwei Windows Server 2008 R2 mit Cluster Services, somit zwei Knoten, Active / Passive Cluster
SQL Server 2008 R2, Version 10.50.4000
zwei Instanzen, MSSQL-extern und MSSQL-intern\FA
zusätzlich NSClient++ und MSDTC als Clusterdienste

Netzwerk:
10.58.68.17/24 (nsc-service.fa.de) - NSClient++ Monitoring Service für diesen Cluster
10.58.68.19/24 (mssql-extern.fa.de) - SQL Server Failover IP für Default Instanz: mssql-extern
10.58.68.20/24 (mssql-intern.fa.de) - SQL Server Failover IP für FA Instanz: mssql-intern\fa
10.58.68.21/24 (mssql-cluster.fa.de) - Microsoft Cluster Dienste
10.58.68.26/24 (mssql-DTC Service) - DTC Interconnect
10.58.68.23/24 (mssql01-1.fa.de) - Host RZ
172.16.1.23/24 (mssql01-1.fa.de) - Host Storage
10.58.68.24/24 (mssql01-2.fa.de) - Host RZ
172.16.1.24/24 (mssql01-2.fa.de) - Host Storage

Problem:
Verschiebe ich beide Instanzen von Knoten 1 auf Knoten 2, dann sind die Datenbanken der Instanz mssql-extern sofort wieder erreichbar, die Instanzen von mssql-intern\fa nicht
Verschiebe ich nur die Instanz mssql-intern\fa auf Knoten 2, funktioniert diese auf Anhieb
Verschiebe ich die Instanz mssql-extern dann auch auf Knoten 2, dann funktioniert die Instanz mssql-extern, aber mssql-intern\fa nicht mehr
Lasse ich mssql-extern auf Knoten 1 und mssql-intern\fa auf Knoten 2, dann läuft das für ca 1 Stunde sauber, danach stirbt die Netzwerkverbindung vom Management Interface auf Knoten 1. Die virtuellen Adressen funktionieren weiter, aber die Management IP 10.58.68.23 ist nicht mehr erreichbar durch Nagios oder ICMP. Das Problem lässt sich nur durch einen Neustart beheben. Nach dem Neustart ist aber die Instanz mssql-intern\fa auf Knoten 2 nicht mehr erreichbar, also muss alles wieder auf den ersten Knoten.

Lösungsansätze:
alle notwendigen Dienste nach diversen Verschiebungen neu gestartet
NIC auf Knoten 1 nach Ausfall deaktivert, wieder aktiviert
Netzwerk Monitoring laufen lassen, keine Fehler zu finden

Gedanken:
Scheinbar verklemmt sich die Instanz mssql-intern\fa, sobald die andere Instanz auf Knoten 2 mitläuft, auf dem ersten Knoten stellt dies jedoch kein Problem dar. MAC-Binding oder Cache Probleme schliesse ich aus, da es sich ja bei den Instanznamen und virtuelle Interfaces bzw DNS handelt
Der Ansatz, das Knoten 1 so eine Art Master-Rolle spielt konnte ich durch Tests auch "eigentlich" ausschliessen

Nächste Schritte:
Entweder das Problem lösen oder zum Test alle Datenbanken in eine Instanz stecken, sinnvollerweise in die mssql-extern, denn diese lässt sich ja beliebig verschieben. Wenn das nichts hilft, dann müsste ich wohl oder übel den Cluster auflösen und zwei Standalone System bauen, das möchte ich aber eigentlich verhindern.

Jede Hilfe ist willkommen…


Post #1503946
Posted Saturday, October 12, 2013 10:36 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Today @ 3:00 PM
Points: 6,618, Visits: 14,182
are you using any trace flags on the sql server service startup?
Can you provide more detail on the host spec and storage setup too?


-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs"
Post #1504213
Posted Monday, October 14, 2013 12:44 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Monday, October 21, 2013 3:16 AM
Points: 7, Visits: 11
currently no trace flags, but i will handle this tonight and post errors here. the setup is:
two ibm x3650 M3 servers, one cpu (4 core) each, 72GB ram, 4 onboard nic, 2 10GE nics for the storage, switched via an arista 10GE fan out. storage is a netapp 3020, servers got 9 luns on the storage, one for each sysdb, db, log (x2, for both instances) plus backup, monitoring and application transfers. the crazy thing is, that the services are all up and running in each scenario but can´t get reached from nowhere. if there was a problem with storage than the sql server services wouldnt start. i think that there is a problem with either the cluster services or the sql services, maybe could be dns, also
Post #1504327
Posted Monday, October 14, 2013 12:56 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Today @ 3:00 PM
Points: 6,618, Visits: 14,182
Firstly run a cluster validation and review the output.
post details of any errors.


-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs"
Post #1504331
Posted Monday, October 14, 2013 1:33 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Monday, October 21, 2013 3:16 AM
Points: 7, Visits: 11
cluster validation runs fine, only warnings about unsigned drives for the ramdisk we are using for the temp db. the other warning is about ressource monitor for sql server and agent, the validiation tool complains about sql server and agent are not configured to run in a shared monitor. i guess, this is a sql server specific config?
Post #1504337
Posted Monday, October 14, 2013 1:46 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Monday, October 21, 2013 3:16 AM
Points: 7, Visits: 11
so, what trace number should i use, there are so many
Post #1504338
Posted Monday, October 14, 2013 2:54 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Monday, October 21, 2013 3:16 AM
Points: 7, Visits: 11
this could be a solution: http://www.mssqltips.com/sqlservertip/2928/updating-the-tcp-port-on-a-sql-server-cluster-node/
the named instance uses dynamic ports. what do you think? it seems, that the native client got´s a problem with dynamic mappings
Post #1504364
Posted Monday, October 14, 2013 5:29 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Today @ 3:00 PM
Points: 6,618, Visits: 14,182
alexander.karls (10/14/2013)
so, what trace number should i use, there are so many

none, just want to check if you have any enabled.


-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs"
Post #1504407
Posted Monday, October 14, 2013 5:34 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Today @ 3:00 PM
Points: 6,618, Visits: 14,182
check the information in the application log when the service fails to start

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs"
Post #1504410
Posted Tuesday, October 15, 2013 2:25 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Monday, October 21, 2013 3:16 AM
Points: 7, Visits: 11
Here comes a small update on the problem case after some testing yesterday:
-DNS, ICMP, Portstatus during Failover on both nodes > no problems, only 2 pakets lost
-Eventlogs during Failover > no errors
-Cluster Validation with Disks > no errors
-Connect via TCP and / or NP > no changes, Webapp spit our error 26 (ASP)
-change the dynamic TCP Ports for Named Instance from 60485 to 1433 > no change
-Test open UDP Port 1434 with portqry > check, ok
-check User Access rights for services and wepapp > ok
-SQL Browser restarts > no change

new in the situation is, that the databases in the default instance are all reachable, but with poor performance (on node 2). this only changes if i stop the sql brwoser on the passive node. also new: nearly all databases from the named instance are reachable now (including poor performance, sometimes), only one older web app cant reach their databases at all, no matter if tcp or np. this web app uses the native client, that can not be changed. error stays on 26 - SQL Network Interfaces, error: 26 - Error Locating Server/Instance Specified

The "split brain" scenario can be dropped, for that wasn´t in our mind. we want to run the instances either on node 1 OR node 2, besides with any other constellation there would be a problem with the quorum.

the question is: whats the difference between node 1 and node 2? they´re the modell, same build, same everything. the services are starting fine, there are no errors, theres no network performance issue to measure...

i´m really stuck with this problem
Die Frage ist also nun: Was ist an dem zweiten Knoten so anders als an dem ersten?
Post #1504645
« Prev Topic | Next Topic »

Add to briefcase 12»»

Permissions Expand / Collapse