September 15, 2009 at 4:58 am
Hi all,
Anybody dealt with MASSIVE amounts of transactions going through replication over a VPN before?
Stats -
2 servers. 1 is pushing data to the other.
7 meg line hosting a VPN between the 2.
There are 50 tables ranging in size from 500k to 2gb and 5 records to 15,000,000 records. These tables are controller by a third party product that extracts the data from an old mainframe system and keeps it in sync. However, due to the nature of this product, sometimes it needs to "resync" and this involves deleting ALL records and repopulating them. Therefore this procedure involves millions and millions of commands being sent up to the external server over our VPN.
Additionally we have monthly jobs that release a few million other records up to the external server via the replication.
So in short, as you can imagine this is tough on the servers. The product takes hours to "repopulate" the tables. It then takes hours for the log reader to process all the work (maxcmdsintrans set to 10000). The distribution agent then takes forever to send the data (and often timesout unless I set the querytimeout to something silly like 2000).
Has anyone had to deal with this kind of movement before? How did you handle it? How did you improve performance? Would snapshot replication be better maybe?!? I don't know.
Cheers,
Shark
September 15, 2009 at 6:33 am
More information...
This was the finishing time on the Log Reader for a test I did with one of the tables...(this was the insert stage so doesn't include the delete)
Transactions 869
Commands 8679188
Delivery 1067.5200 cmds/sec
Latency 0 ms ?????
Duration 02:40:53
Is that bad?
September 15, 2009 at 7:41 am
Distribution times out on
{call sp_MSadd_distribution_history(9, 3, ?, ?, 0, 0, 0.00, 0x00, 1, ?, 0, 0x00, 0x01)}
With querytimeout set to 3000! Madness.
September 15, 2009 at 9:58 am
Now trying to get my test distribution job to not timeout using a querytimeout of 3600 and commitbatch and threshold of 100 and 1000. Will post results.
At the moment it is ridiculous. I know you have the overhead of the transaction logging and so forth but transferring the equivalent of 1.1gb of data should take 2 hours tops I've been told by my network team on the line speed we have.
For dr and added security I sync replication with backups - maybe if during these big runs I go with simple mode?!?! Set to full at the moment.
September 15, 2009 at 11:19 am
Another question to go with the above - What order does Transactional Replication run stuff? Because if it is timing out running sp_MSadd_distribution_history against the publication distribution db....
What does that sp do? I feel it hasn;t even gotten onto the actual distribution of the data to the subscriber yet and already it times out.
September 16, 2009 at 1:20 am
I had a similar problem on a system years ago that I eventually traced to blocking at the subscriber.
With the distribution profiles left at the default, SQL Server groups a number of transactions at the publisher into one large transaction at the subscriber. This caused a huge amount of blocking to users at the subscriber and to replication itself, resulting in timeouts. This just compounded the problem because the timeouts rolled back the uncommitted transactions at the subscriber, forcing replication to start the same transaction all over again.
I tackled this by a combination of asking the developers to reduce the size of unnecessarily large transactions at the publisher (they were just wrapping BEGIN/COMMIT TRAN round a whole process that didn't need to be done as a transaction) and also tweaking the replication profiles (lowering the settings for CommitBatchSize and CommitBatchThreshold) until I found a balance between blocking and the extra network traffic generated by having a greater number of transactions (we only had 64K lines at the time).
September 16, 2009 at 2:41 am
Well this is ridiculous. I am testing over a 3mb line at present. I left the job running overnight. The querytimeout set at 7200 let it through - so took 2 hours to what exactly? Is that pass all the queries up to the subscriber?
Then it has taken 13 hours so far to run 50 of the 869 transactions. That's 600,000 out of 8.9 million records...................
And I still have another 10 big tables. So in reality this will take WEEKS to replicate.
I can't prevent this kind of delete/insert process as it is the way this product works and we have no other alternative products.
I sp_who2 on both servers involved and there is no locking.
I am at a loss. I really am! It'd be quicker to BCP out the tables and copy the files over and BCP in! Surely thats not right.
Would it be better if I pulled from Server B? That way the transactions would be pulled and the work with Distribution db would be local to the subscriber rather than across the 3mb line??
September 16, 2009 at 8:29 pm
Shark Energy (9/16/2009)
It'd be quicker to BCP out the tables and copy the files over and BCP in! Surely thats not right.
Heh... Ironic isn't it? Replication uses BCP so I'm thinking something else is wrong. Try doing the BCP thing once and see what happens. It should be very fast.
I've seen things like this go wrong with in-house networks. "Auto-negotiate" should be turned off and the duplex should be set to "Full", for example. If either of those are not true, in-house speeds tank and collisions go off the chart. Similar bad things can happen with VPN's. Have someone check it.
--Jeff Moden
Change is inevitable... Change for the better is not.
September 17, 2009 at 2:46 am
I can transfer a large file normally and it takes as long as it should over the same line. Although I guess the settings causing collisions would be more so with multiple batches of data as opposed to one file!
I'll find out. Thanks for the help.
Viewing 9 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply