• jgenovese (1/22/2015)


    We have been experiencing issues with replication

    Our system has batch processes which process large amounts of data at various points during the day/night. The data is replicated to a set of subscribers.

    From time to time, this error shows up:

    "Replication-Replication Distribution Subsystem: agent … scheduled for retry. Query timeout expired"

    When this happens, we will get one such email per publication/subscriber database combo (approx. 65 in number)

    This never used to happen until about a month ago, which was when we added another subscriber to the set of subscribers.

    As an overview of the replication topology:

    • All sql2008r2
    • Two publisher databases, on one server
    • 26 subscriber databases, spread over 5 servers (none of which are on the publisher server) – 13 databases pertain to one publisher database, 13 databases to the other publisher database
    • For one publisher database:
    • - 3 publications, unfiltered, each pushing to its 13 subscriber databases – one publication has 194 articles, the other two have 3 articles apiece.
    • - 13 filtered publications, filtered, each pushing to one subscriber database apiece. Each of these 13 publications has the same 50 articles, with different filters specific to the subscriber database.
    • For the other publisher database, one publication, unfiltered, pushing to its 13 subscriber databases, with 16 articles
    • The distributor is on a server by itself, with one distribution database
    • Default settings all around

    Does anything stand out as a possible cause for the timeouts (apart from the large number of publications/subscribers working simultaneously)? Any suggestions for changing the topology to minimize this issue?

    Given that everything was working OK until the last subscriber was added suggests to me that you were probably close to the limit (of performance) until the last subscriber was added.

    For these large transactions (and I am assuming that the batch processes are doing their work in transactions), do they use stored procs for their work ? If you, it is work considering whether you replicate the call to the stored proc instead of the resultant insert/update/delete commands from the sproc. This would reduce the network traffic. Depending on the logic in the sproc, it may help (and it might make the situation worse...so test this first).

    You can change the query timeout values for the distribution agents to deal with this.

    And finally - check what the load is on your servers - are they under heavy load (because of replication and other activities) ? Perhaps you need to increase the capacity of your servers.