Replication - sp_msuplineageversion stuck

  • We have a publication on SQL 2012 with about 135 subscribers, and about once per week one subscriber will get stuck on SP_MSUPLINEAGEVERSION. We are forced to delete and recreate their subscription to get them synchronizing again, haven't been able to figure out how to fix it otherwise. Seems random, sometimes this happens to subscribers that are 3+ years since last reinitialization, other times it happens to subscriptions that are only a few weeks old. Sometimes subscribers on SQL2K8, sometimes on SQL2012. All subscribers on SQLExpress.

    We've done quite a bit of database maintenance to try and solve the problem, and made sure we've applied all SP and CU as well. Our subscribers are all across the country, and operate retail businesses, so recreating the whole publication isn't really an option and we're limited to over night maintenance.

    It seems to get stuck with enumerating changes and updates to one of two tables, and both of those tables have a [maxversion_at_cleanup] value of over 2,000,000,000. The tables have millions of records (one has 31 million, one has 2 million), and I think the maxversion has climbed over the years.

    Is it possible that the [maxversion_at_cleanup] is causing the problem? I've seen other mention of a similar problem around the web, but never with any resolution. How does this value get maintained? Is there a way to reset it without dropping and recreating articles?

    We don't have any subscribers with this problem currently, we have to repair them quickly once they start blocking the database, leaves limited time to troubleshoot. I'd welcome any troubleshooting suggestions that we could try the next time one jams up.

  • I wouldn't be overly concerned about the fact that you are dealing with merge replication. It's a client to SQL Server, just like any other client. It has no more "power" over your SQL Server than any other client process. While there is overview documentation about merge's lineages on msdn, you may need to take a deeper dive by inspecting merge's sprocs. For example, you can USE master EXEC sp_helptext 'SP_MSUPLINEAGEVERSION'.

    But before diving in like that (merge's sea is deep), you should consider what you mean by "blocking the database" and "stuck". I think you are talking about blocking in SQL Server, and there is plenty of help about troubleshooting that both here, on msdn, and on many other sites. You may end up discovering your application is blocking merge, and (sometimes), the blocking is happening the other way around :). You may end up discovering the blocking (no matter who is blocking who) is being caused by an "unexpected" physical resource constraint (RAM< IO or CPU). For example, you may be unable to control what else these stores do upon their system. Or, you may discover a logical constraint (an occasional need for a new index). For such concerns, merge and its SP_MSUPLINEAGEVERSION might only be the messenger, not the message :).

    Because time is of the essence, you may need to be patient. You may end up acting like blind men feeling different parts of an elephant, where each describes a different beast. You may end up discovering you are donning a pair of peril-sensitive sunglasses, when instead you should have told your patient "This may cause a little pain." You may end up trying to determine the cause of an accident, by only seeing the result. With all that in mind :), you might want to start with sp_lock.

  • Hi,

    Thanks for the response. Confirmed previously that the merge agent is locking tables that have a high maxversion_at_cleanup - one of the two tables that I mentioned - and I've tried leaving the blocking process run for 24+ hours before killing it to free the lock. It's always the merge agent with the lock that is blocking other processes, and at the time the blocking merge agent's last message is that it's waiting for a response from the query SP_MSUPLINEAGEVERSION(?,?,?). Other resources I've found point to a loop in that SP that is somehow tied to maxversion_at_cleanup (such as this old one on MSDN - https://social.msdn.microsoft.com/Forums/sqlserver/en-US/86c50afd-48db-46a0-b94c-b804453c992a/maxversionatcleanup?forum=sqlreplication). Hence my more specific question.

    Any ideas about the maxversion_at_cleanup column specifically and how it's connected to SP_MSUPLINEAGEVERSION? Why an article may have a value in the 2+ billion? How this column behaves and is maintained in merge replication? It's an int that is almost at its maximum value, does it ever reset on its own, or is there maintenance that can cause a reset without dropping/adding back the article?

    Of note that I never thought of in my OP: These two articles are join filters to parent articles (unique join keys). The parent/child articles are both partition_options 1 for overlapping with disallow out of partition changes.

  • maxversion_at_cleanup refers to the maximum lineage that has been cleaned up by merge. The higher its value, the more that has been changed. I think a high maxversion_at_cleanup value is a symptom of a problem. What are the maxversion_at_cleanup values, both before and after a sync? What user DML (the blockee) is sp_MSuplineageversion found to be blocking the most, and with what lock type? Keep in mind that merge's triggers are triggered by user DML on user tables. Why does the blockee need that lock (what is the blockee doing)? Depends upon the identified lock type clash, but the blockee's DML more than likely involves the @tablenick being passed to head blocker's sp_MSuplineageversion. More than likely, the blockee is incrementing merge's lineages (for that @tablenick) with an extremely high frequency. Why does the blockee need to create new lineages with such an apparently high frequency? If you can reduce that frequency, you will reduce merge's need to update maxversion_at_cleanup. Some examples are: The blockee could be a user trigger that is not marked NOT FOR REPLICATION, or the blockee could be user DML code that is stuck in a loop (with merge's DML, or just with user DML). Look at execution rates. If the incompatible lock instead involves a table lock, I would consider when sysmergearticle's statistics were last updated.

  • I think you're right about maxversion_at_cleanup being a symptom of a bigger problem. We dropped the smaller article and re-published it over the weekend and the maxversion_at_cleanup reset back to "1". Now when subscribers synchronize they don't seem to increment that value, even when there are changes to that article being replicated. If it does change, it seems to reset back to "1" since that's the only value we've seen since yesterday.

    I've also changed the partition_options to "0" instead of "1" - it used to disallow out of partition changes. I don't know if that impacts how this value is handled.

    I still don't understand what causes this value to change or in what scenarios it may climb exponentially. I don't believe our problem is hardware related at this point, CPU, Disk IO, etc. would all be symptoms of what I think is really wrong - a configuration problem with replication.

    Our replication system has been running for years without this problem occurring, so I suppose it's possible we've encountered some state with some of our articles that requires them to be dropped and re-added in order to clean it up.

    Thank you for your thoughts, and if anyone has any insights into how we've got into the situation of a really high maxversion_at_cleanup, or how it can be cleaned up without dropping/re-adding please let me know. Otherwise we may look to schedule a complete reset of the publication on a closed holiday.

  • We've experienced blockage in merge replication too. You need to be carefull not to set the retention period to high. Depending on the amount of changes coming through, the size of the meta data tables can grow very fast, which in itself may cause your agents to slow down and even time out. Apart from that, we've had a sitation where we experienced some blockage between the agents that we could not fix even with reducing retention period to only one or two days. Ultimately the cause was found to be in our application's DDL. But before we knew that, we did find a way that alleviated the problem a lot and I think this could be a solution for you as well: cleaning up merge history meta data is in the default merge agent settings a process that is semi-randomly assigned to one of the merge agents (a semaphore is used so that always only one agent at a time runs the cleanup in each db). The agent process that gets selected will block many other processes including other agents during the cleaning. The solution we found was to create a new agent profile, that has MetadataRetentionCleanup set to 0 instead of the default 1 and assign this new profile to all agents. This makes that none of the agents will be assigned the task of cleaning up MSmerge_contents, MSmerge_genhistory, MSmerge_tombstone and the likes. Thus none of the agents will block anymore. You still need that data outside the retention period removed from your databases though, to avoid your agents timing out due to too much history data being in those tables. So what we did is we created a SQL Agent job on the publisher and all subscribers that periodically calls sys.sp_mergemetadataretentioncleanup. If I understand your situation correctly, you have a period in which most agents have to replicate many changes and other time frames in which hardly any data is coming through. Now that the cleaning up is a job under your control, you can schedule it to run in those low-business hours: your merge agents will be less likely to get into conflict with each other and the cleanup process and merge agents will run more smoothly.

    Here's the code I put into the scheduled jobs on all servers:

    set nocount on;

    declare @num_genhistory_rows int;

    declare @num_contents_rows int;

    declare @num_tombstone_rows int;

    declare @result int;

    again:

    begin try

    exec @result = sys.sp_mergemetadataretentioncleanup

    @num_genhistory_rows = @num_genhistory_rows OUTPUT,

    @num_contents_rows = @num_contents_rows OUTPUT,

    @num_tombstone_rows = @num_tombstone_rows OUTPUT;

    select

    @result as result,

    @@trancount as trancount,

    @num_genhistory_rows as num_genhistory_rows,

    @num_contents_rows as num_contents_rows,

    @num_tombstone_rows as num_tombstone_rows;

    end try

    begin catch

    while xact_state() <> 0

    rollback tran;

    if error_number() = 1205

    goto again;

    throw;

    end catch



    Posting Data Etiquette - Jeff Moden[/url]
    Posting Performance Based Questions - Gail Shaw[/url]
    Hidden RBAR - Jeff Moden[/url]
    Cross Tabs and Pivots - Jeff Moden[/url]
    Catch-all queries - Gail Shaw[/url]


    If you don't have time to do it right, when will you have time to do it over?

  • Hi R.P. - that's a really good suggestion, it was one of the items we'd originally thought may be causing our problem. A couple years ago we had to put similar processes in place, and I'd also recommend them for anyone maintaining a long-term merge replication system. We created a process that checks the oldest successful sync subscriber and sets the retention period automatically to 48 hours before that point, so we always have a 48 hour window of retained metadata. So usually we keep 48 hours, but if a subscriber stops synchronizing or fails for any reason, the system auto-adjusts retention back 1 day at a time to a max of 14 days retention. Then every night we have a tune-up on the publisher to run sp_mergemetadataretentioncleanup manually, and rebuild the indexes on a half dozen of the most used merge system tables (like contents, tombstone, current_partition_mappings, etc). Our publisher's msmerge_contents only has about 100-200,000 records in it, and at most we've had about 2 million during periods of high retention, which hasn't caused any problems. Once a week we do the same cleanup and rebuilding indexes on all subscribers.

    It keeps our replication lean and mean, which is why this is so troubling for us, most synchronizations deal with about 100 records every 10-15 minutes per subscriber and only takes about 20 seconds for the agent to complete start-to-finish.

    Under normal circumstances everything runs great until we get the odd subscriber who sticks on that uplineageversion sproc.

  • We faced the same situation as you did and we opened a Microsoft case. After weeks of investigations they said that it is a known issue (known only by Microsoft, not public) where you can encounter fake conflict if the same article is replicated in filtered and unfiltered publications.

    So if you have multiple publications check if you have articles which are filtered in one publication and are not not filtered or filtered differently on another publication. If there are huge number of conflicts it can impact the lineage. sp_msuplineageversion stored procedure has a specific purpose. When a conflict occurs between publisher and subscriber, if the loser of the conflict has a smaller row version, the procedure enters a WHILE loop and increments the small row version of the winner to match the row version of the loser.

    Hopefully this will help you.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply