Mass transactional replication performance PLUS over a VPN

Question

Mass transactional replication performance PLUS over a VPN

Shark Energy

SSCrazy Eights

Points: 8389
More actions
September 15, 2009 at 4:58 am

#95993

Hi all,
Anybody dealt with MASSIVE amounts of transactions going through replication over a VPN before?
Stats -
2 servers. 1 is pushing data to the other.
7 meg line hosting a VPN between the 2.
There are 50 tables ranging in size from 500k to 2gb and 5 records to 15,000,000 records. These tables are controller by a third party product that extracts the data from an old mainframe system and keeps it in sync. However, due to the nature of this product, sometimes it needs to "resync" and this involves deleting ALL records and repopulating them. Therefore this procedure involves millions and millions of commands being sent up to the external server over our VPN.
Additionally we have monthly jobs that release a few million other records up to the external server via the replication.
So in short, as you can imagine this is tough on the servers. The product takes hours to "repopulate" the tables. It then takes hours for the log reader to process all the work (maxcmdsintrans set to 10000). The distribution agent then takes forever to send the data (and often timesout unless I set the querytimeout to something silly like 2000).
Has anyone had to deal with this kind of movement before? How did you handle it? How did you improve performance? Would snapshot replication be better maybe?!? I don't know.
Cheers,
Shark

Viewing 9 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

Shark Energy SSCrazy Eights Points: 8389 More actions · Answer 1

More information...

This was the finishing time on the Log Reader for a test I did with one of the tables...(this was the insert stage so doesn't include the delete)

Transactions 869

Commands 8679188

Delivery 1067.5200 cmds/sec

Latency 0 ms ?????

Duration 02:40:53

Is that bad?

Shark Energy SSCrazy Eights Points: 8389 More actions · Answer 2

Distribution times out on

{call sp_MSadd_distribution_history(9, 3, ?, ?, 0, 0, 0.00, 0x00, 1, ?, 0, 0x00, 0x01)}

With querytimeout set to 3000! Madness.

Shark Energy SSCrazy Eights Points: 8389 More actions · Answer 3

Now trying to get my test distribution job to not timeout using a querytimeout of 3600 and commitbatch and threshold of 100 and 1000. Will post results.

At the moment it is ridiculous. I know you have the overhead of the transaction logging and so forth but transferring the equivalent of 1.1gb of data should take 2 hours tops I've been told by my network team on the line speed we have.

For dr and added security I sync replication with backups - maybe if during these big runs I go with simple mode?!?! Set to full at the moment.

Shark Energy SSCrazy Eights Points: 8389 More actions · Answer 4

Another question to go with the above - What order does Transactional Replication run stuff? Because if it is timing out running sp_MSadd_distribution_history against the publication distribution db....

What does that sp do? I feel it hasn;t even gotten onto the actual distribution of the data to the subscriber yet and already it times out.

Ian Scarlett SSC-Insane Points: 23197 More actions · Answer 5

I had a similar problem on a system years ago that I eventually traced to blocking at the subscriber.

With the distribution profiles left at the default, SQL Server groups a number of transactions at the publisher into one large transaction at the subscriber. This caused a huge amount of blocking to users at the subscriber and to replication itself, resulting in timeouts. This just compounded the problem because the timeouts rolled back the uncommitted transactions at the subscriber, forcing replication to start the same transaction all over again.

I tackled this by a combination of asking the developers to reduce the size of unnecessarily large transactions at the publisher (they were just wrapping BEGIN/COMMIT TRAN round a whole process that didn't need to be done as a transaction) and also tweaking the replication profiles (lowering the settings for CommitBatchSize and CommitBatchThreshold) until I found a balance between blocking and the extra network traffic generated by having a greater number of transactions (we only had 64K lines at the time).

Shark Energy SSCrazy Eights Points: 8389 More actions · Answer 6

Well this is ridiculous. I am testing over a 3mb line at present. I left the job running overnight. The querytimeout set at 7200 let it through - so took 2 hours to what exactly? Is that pass all the queries up to the subscriber?

Then it has taken 13 hours so far to run 50 of the 869 transactions. That's 600,000 out of 8.9 million records...................

And I still have another 10 big tables. So in reality this will take WEEKS to replicate.

I can't prevent this kind of delete/insert process as it is the way this product works and we have no other alternative products.

I sp_who2 on both servers involved and there is no locking.

I am at a loss. I really am! It'd be quicker to BCP out the tables and copy the files over and BCP in! Surely thats not right.

Would it be better if I pulled from Server B? That way the transactions would be pulled and the work with Distribution db would be local to the subscriber rather than across the 3mb line??

Jeff Moden SSC Guru Points: 1004664 More actions · Answer 7

Shark Energy (9/16/2009)
It'd be quicker to BCP out the tables and copy the files over and BCP in! Surely thats not right.

Heh... Ironic isn't it? Replication uses BCP so I'm thinking something else is wrong. Try doing the BCP thing once and see what happens. It should be very fast.

I've seen things like this go wrong with in-house networks. "Auto-negotiate" should be turned off and the duplex should be set to "Full", for example. If either of those are not true, in-house speeds tank and collisions go off the chart. Similar bad things can happen with VPN's. Have someone check it.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Shark Energy SSCrazy Eights Points: 8389 More actions · Answer 8

I can transfer a large file normally and it takes as long as it should over the same line. Although I guess the settings causing collisions would be more so with multiple batches of data as opposed to one file!

I'll find out. Thanks for the help.