I've got a set of servers which are providing data to a web service for display on a web page.
They're all running SQL Server Express 2012. They are Windows 2012 Server VM's running on 2 VMWare hosts (2 Express Servers per host).
I then have a server running SQL Server Standard 2012 which is used as the data processor. This is running as a Windows 2012 VM on the same VMWare host as Express servers 1 & 2.
To get the data from the main processing server to the Express instances I'm using linked servers (using TCP/IP).
The process is:
The SQL Standard machine calls a procedure on each SQL Express machine in turn, which copies the latest copy of the data into a parallel database (i.e. not the one currently being accessed by the web service). Once all are complete, the views that access the data are altered automatically to point at the new location.
Originally, the SQL Standard machine was just running a set of INSERTs, but the performance was so bad that I swapped it over to make the Express machines control the INSERTs.
This part of the process is part of a larger group of processes which are managed from a series of SSIS packages.
Clearly, as I'm writing this, there is a problem...
Every now and again, the copy over fails on the 4th server (it's always the 4th server, and never any of the other 3).
The messages I get are:
SMux Provider: The data is invalid.
OLE DB provider "SQLNCLI11" for linked server "<MyServer>" returned message "Communication link failure".
Session Provider: Physical connection is not usable [xFFFFFFFF].
To me, this smacks of a network failure. My concern is that it always happens on this server.
The other complication is that servers 1&2 (on VMWare host 1) are in one NLB pool. Servers 3&4 (VMWare host 2) are on a second NLB pool. The system that runs the updates doesn't use this though, and only uses the actual machine network connections.
Thank you for reading this far. It's a pain that it's an intermittent problem, and a pain that I have to explain a work around. I was hoping that there might be a quick fix!
Any thoughts? Other than - "wow that's a ridiculously over-complicated system' :-D