CPU Intensive Enumerating Changes

  • We've recently moved our replication system to a new server, and spent some time creating new indexes to help performance. At some point, we've done the opposite. Each subscriber is taking hours to synchronize after changes are made at the publisher (130 subscribers). Hardware on the new server benchmarks very well (2x8core CPU, 88GB RAM, 6xSSD RAID10, 2xSSD RAID for replication index data files), and significantly outperforms the old server. So I'm not thinking the new hardware is the issue, and am leaning toward an indexing problem.

    We analyzed every replication filter and join, and created covering indexes for each table article filter, and stored the replication indexes in a separate ndf on a separate 2xSSD RAID1. When I execute the replication filters manually, the query plans show significant improvement. But recently we've made small amounts of over night changes at the publisher (something we've always done regularly), and the next subscriber sync takes an extremely long time. It seems to take forever when enumerating and downloading the changes (100 changes per batch). Each subscriber only has about 1000 changes to download, and this used to happen in a minute or two instead of an hour.

    The biggest difference I've noticed is that it's extremely CPU intensive, each connection to the publisher seems to pin the CPU for the entire hour. I'm not sure what this might indicate, is it possible it's doing table scans? I'm not sure exactly what replication does at the publisher when it's enumerating the changes, and at the stage of downloading the changes to the subscriber? I've tried using profiler, but I don't see any one task that takes this long to run, and the disk I/O is minimal the entire time. The longest running entries in profiler are calls to "exec sys.sp_MSenumchanges_belongtopartition..." for our largest table article join, but I'm not sure how to dig into that further?

    It reminds me of a poorly indexed query that the server is scanning, but is there some way for me to confirm this? Is there any other performance measurements I can do to see what exactly is taking so long and using so much CPU? Any other info I can post to help?

    EDIT: I should mention that the merge agents seem to be taking forever on my join filters. I have a central 'Client' article that is filtered to send certain clients to certain subscribers. There are about a dozen table articles joined to each 'Client' record, and it seems to jam up when calculating changes to the joined tables. The joins use 2 columns each (a FK reference and a 'Replicate' bit). Each join predicate is indexed individually on each of the child tables, and there is a covering index on both columns together with all other columns included.

    EDIT: Another thought, statistics are set to update automatically, and are also updated manually every night with an index rebuild maintenance plan.

    EDIT: Another important point I skipped over is that in moving servers we upgraded from SQL2K8 R2 to SQL2K12.

  • I think I may be getting somewhere, anyone else have any suggestions on what to look at next? If I execute:

    exec sys.sp_MSenumchanges_belongtopartition

    38,

    100,

    '1510638,1510652,1510653,1510654,1510655,1510656',

    89259003,

    '00000000-0000-0000-0000-000000000000',

    '8236EE55-51EF-4E2E-A0A6-B2D26781C214',

    1510638,

    1510656,

    1,

    0,

    'F6DE0C2C-3305-4F24-B272-7C9B84705A03'

    It takes 10 minutes to run.

    But, as far as I can tell, that SP pretty much just builds and executes a dynamic SQL statement. I grabbed the SQL statement out of the profiler, and if I execute it manually with the same values, it executes in <1s.

    declare @mingen int, @maxgen int;

    declare @tablenick int, @publication_number int;

    declare @partition_id int, @rowguid uniqueidentifier, @pubid uniqueidentifier;

    select @partition_id = 38,

    @tablenick = 89259003,

    @rowguid = '00000000-0000-0000-0000-000000000000',

    @pubid = '8236EE55-51EF-4E2E-A0A6-B2D26781C214',

    @mingen = 1510638,

    @maxgen = 1510656,

    @publication_number = 1;

    select rows.tablenick, rows.rowguid, rows.generation, rows.lineage, rows.colv, t.[TanSessionGUID], t.[StudioGUID], t.[ClientGUID], t.[BedGUID], t.[EmployeeGUID], t.[CreatedDate], t.[WaitingDate], t.[WarmupDate], t.[StartedDate], t.[FinishedDate], t.[ProcessedDate], t.[CleanedDate], t.[StartDelay], t.[IsStoppedEarly], t.[IsRestored], t.[IsAutoCompleted], t.[IsVirtual], t.[IsMinutes], t.[IsVisits], t.[PrimaryPackageType], t.[VisitsUsed], t.[Replicate], t.[IsGuardianPresent], t.[IsWarmupCycle], t.[IsCleanCycle], t.[IsCancelled], t.[IsGlobalClient], t.[ParentalAccompanimentName], t.[IsOpportunity], t.[IsNewProspect], t.[IsNewSprayProspect]

    from (

    select distinct top 100 mc.tablenick as tablenick, mc.rowguid as rowguid, mc.generation as generation, mc2.lineage as lineage, mc2.colv1 as colv

    from [FabWareHouse].[dbo].[MSmerge_contents] mc

    inner join [FabWareHouse].[dbo].[MSmerge_current_partition_mappings] cpm

    on cpm.tablenick = mc.tablenick

    and cpm.rowguid = mc.rowguid

    and cpm.tablenick = @tablenick

    and mc.tablenick = @tablenick

    inner join [FabWareHouse].[dbo].[MSmerge_contents] mc2

    on mc2.tablenick = mc.tablenick

    and mc2.rowguid = mc.rowguid

    inner join (

    select 1510638 as gen union all

    select 1510652 as gen union all

    select 1510653 as gen union all

    select 1510654 as gen union all

    select 1510655 as gen union all

    select 1510656 as gen

    ) as genlist

    on mc.generation = genlist.gen

    and genlist.gen is not NULL

    where mc.generation >= @mingen and mc.generation <= @maxgen and

    mc.tablenick = @tablenick

    and cpm.tablenick = @tablenick

    and

    (

    (cpm.partition_id = @partition_id and cpm.publication_number = @publication_number) or

    (cpm.partition_id = @partition_id and cpm.publication_number = 0) or

    (cpm.partition_id = -1 and cpm.publication_number = @publication_number) or

    (cpm.partition_id = -1 and cpm.publication_number = 0)

    )

    order by mc.tablenick, mc.rowguid

    ) as rows

    inner join [FabWareHouse].[dbo].[MSmerge_repl_view_8236EE5551EF4E2EA0A6B2D26781C214_359376F00519403CB8B48E9E47C3EC8D] t on t.rowguidcol = rows.rowguid

    order by t.rowguidcol

    Any ideas why the SP may take so long to run the same query?

  • Attached is the profiler trace for when I executed sp_MSenumchanges_belongtopartition manually, showing it taking 8 minutes. Wait time on the process is zero, task state is running the whole time.

    I'm not sure if it's related, but I've noticed another case of SQL using a bad query plan. I have a LogEvent table that's indexed on the [Replicate] bit (no included columns on the index). And if I type:

    SELECT MIN([CreatedDate]) FROM [LogEvent] WHERE [Replicate]=1

    then SQL takes several minutes to do a full table scan instead of using IX_LogEvent_Replicate.

    And if I type

    SELECT MIN([CreatedDate]) FROM [LogEvent] WITH(FORCESEEK) WHERE [Replicate]=1

    then SQL takes a few ms and returns the min value immediately using IX_LogEvent_Replicate. All of the statistics on the table have been updated, no effect, but I've never tried creating manual statistics.

    I did a report on index usage, and most of the tables seem to be full of seeks except the biggest join filter table that seems to be hanging up replication. It's showing 3473 "User Seeks" in the last 3 hours, and 918 "User Scans" on the clustered rowguid index. It's a 20 million record table, so I think the user scans are my problem. It seems to be the only table that's full of user scans, the rest of zero for the most part, or less than 10.

    I think I'm getting close to the problem, but I honestly can't see why it would be table scanning when it's a heavily indexed table. None of my other table articles, even ones with the exact same join filter logic and index structures, are doing table scans. Index usage is normal, almost entirely seeks.

  • I was going to look into a 'Plan Guide' to see if I could force the replication stored procedure to use the same clustered index seek that the query uses when I manually execute it, but I think I found another solution.

    I created a second nonclustered index on the primary key column for the table article. Normally I think this would be redundant and a little silly, but now the call to sp_MSenumchanges_belongtopartition returns a batch of 100 rows in a few ms instead of 10 minutes. No idea why this fixed it, but it seems like SQL refused to use the PK index to seek and was using table scans to locate a PK. Now when I execute the SP manually, it executes super fast and the query plan shows it joining the replication metadata contents to my new nonclustered index [IX_Session_SessionGUID] on the rowguid=SessionGUID seek and is joining the results to the clustered index [PK_Session] seek on SessionGUID=SessionGUID. Not optimal, no idea why I had to do this, but has solved our problem for now.

    If anyone reads through this thread, do you have any ideas on why this might occur? Could it be index corruption, something I could look up somewhere?

  • Hi Kevin,

    I can't help you with why the query is slow when run by the merge agent, but I think I do know what makes it run fast when run manually: the view MSmerge_repl_view_63666C4AA85D46329603479A5E32376A_D633BCAFDDAC4F348FE5BA277E7BAA48 (yours has a different string of characters) internally checks for permissions of the current user. Or in other words, though you're executing the same T-SQL, you're not executing the same query. Here's the implementation for my replicated table:

    create view dbo.MSmerge_repl_view_63666C4AA85D46329603479A5E32376A_D633BCAFDDAC4F348FE5BA277E7BAA48

    as

    select *

    from [dbo].[FileEventLogs]

    where ({fn ISPALUSER('63666C4A-A85D-4632-9603-479A5E32376A')} = 1

    or (HAS_PERMS_BY_NAME('[dbo].[FileEventLogs]', 'OBJECT', 'SELECT')

    | HAS_PERMS_BY_NAME('[dbo].[FileEventLogs]', 'OBJECT', 'UPDATE')

    | HAS_PERMS_BY_NAME('[dbo].[FileEventLogs]', 'OBJECT', 'INSERT')

    | HAS_PERMS_BY_NAME('[dbo].[FileEventLogs]', 'OBJECT', 'DELETE'))=1)

    So you need to be sure you've got exactly the same user that the agent is using when trying to reproduce the issue. The ISPALUSER is another unpredictable factor in here. I don't know what exactly it does. I've not found many references to it on the net, but here's one that does give a little bit of info: http://stackoverflow.com/questions/1232570/what-does-the-ispaluser-function-call-in-the-msmerge-view-views-do. What I do know is that when a large number of entries exists in msMerge_contents this function can become incredibly slow. Up to a point where it slows down a merge agent to a point where it times out constantly. This is how I encountered it.



    Posting Data Etiquette - Jeff Moden[/url]
    Posting Performance Based Questions - Gail Shaw[/url]
    Hidden RBAR - Jeff Moden[/url]
    Cross Tabs and Pivots - Jeff Moden[/url]
    Catch-all queries - Gail Shaw[/url]


    If you don't have time to do it right, when will you have time to do it over?

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply