Slow merge replication over a WAN link - only downloads

  • Hi all,

    We've been using SQL Server merge replication for a few years to synchronise data between our data centres, but we are now suffering with a big performance issue. This may be because the amount of data we are synchronising has increased a lot this year.

    Our publisher is an always-on data centre in the UK. Our subscriber is a mobile data centre that travels around the world and is on for periods of up to a week at a time, approx. 25 times a year. However, it also spends the same amount of time (if not more) switched off whilst on its travels - it is a well travelled data centre!

    We have 5 database that we synchronise on these servers. However, one of our databases has high numbers of data changes between periods of subscriber downtime and our issue is that it take days to catch up when the server is powered up - the other databases are fine.

    Downloads from publisher to subscriber run at about 1.5 rows a second (which is annoying when we have hundreds of thousands of rows) but strangely uploads from subscriber to publisher run about ten times faster.

    Things I have checked / tried:

    • all tables have non-clustered primary keys on guid columns that have the rowguid property set

    • changing the generation levelling threshold doesn't help

    • setting the agent profile to high volume doesn't help

    • running a trace at the publisher and subscriber shows the queries are all running very fast (less than 20 m/s generally, but there are gaps of 200 m/s or so between some batches of queries)

    • analysis on our WAN link shows we have huge amounts of bandwidth spare

    • analysis on our servers show we have huge amounts of Ram and CPU spare

    Some of the places the subscriber is at do suffer from high latency but this doesn't seem to have an impact - 300 m/s or 100m/s and we still get the same poor performance.

    One things I did wonder about - does the replication confirm to the publisher every time it has successfully processed a row at the subscriber? If we have thousands of rows and there is a latency on the line will this compound the issue if it confirms each item? If this does happen, is there a way to batch up messages between publisher and subscriber?

    Any help that you can offer will be gladly received!

    Thanks

    Mark

  • Let me guess: Formula1 team?

    In that case, sorry, but I can't help you: I work for your competitors 🙂

    -- Gianluca Sartori

  • Might be 😉

    No matter what - us IT folk should stick together 😀

  • Heh... My employer wouldn't be very happy to know I'm helping their competitors.

    Anyway, the link speed in Australia is showing higher latency than expected (over 500 ms). We had the same kind of issues last year. Malaysia was even worse, so be prepared to suffer there as well.

    I assume you have the same connectivity provider we have (Riedel). We contacted them today and they confirmed that they are having instability issues with their international network. Maybe in Lotus you have different providers: make sure you check your latency.

    -- Gianluca Sartori

  • I might add that setting up QoS helped us a lot.

    -- Gianluca Sartori

  • I'm sure you helping with our replication issues is not going to affect track performance too much so I think they'll be ok 🙂

    Our WAN guys have looked into that but I'll speak to them again.

    Thanks

  • Hi all,

    We got to the bottom of this in the end.

    It was our use of nvarchar(max) columns that was stopping the replication from using batches. What used to take 3 hours now takes 50 seconds just by changing the data type.

    Here is the lesson learnt: "nvarchar(max) is a replication killer"

    Thanks

  • Thanks for the feedback!

    -- Gianluca Sartori

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply