SP_MSIns_ "stuck" on inserting into a table.

  • Hello All,

    I'm having trouble with transactional replication I'm hoping someone can help with.

    I have Server A (publisher), Distro (Distribution server), and Server B (Subscriber). All servers are 2016. B is our report server. We have a particular table I'll just call 'Sales' that might have 300 records replicated to it in a day. We replicate roughly a 100 tables including Sales.

    About 2-4 days a week, we see the SP_MSIns_dboSales procedure get 'stuck' and block everything else. I put 'stuck' in quotes because according to sp_Whoisactive, it has no wait type and isn't being blocked by anything else. It will stay in this state for 30-120 minutes and then periodically show a IO_Completion waittype for another 30 minutes. Replication then moves on to updating the rest of the tables as normal.

    Server B is a VM with Server A on the same datastores. A has no issues during this time and doesn't appear to be causing the issue. IO on our subsystems seem to be fine, but see the next line.

    When this finally clears up, SP_MSIns_dboSales shows in QueryStore as being the highest eater of physical reads. Like 100+GB in roughly 2 hours with maybe a total of 150 row inserts. The Sales table has about 30 rows, with nothing bigger than a varchar(60).

    The Sales table has 14 nonclustered indexes and 1 columnstore index. I know thats a lot but none have been added in the last 6 months, but this issue started roughly 3 weeks ago. When this isn't happening, rows are still being inserted into Sales without issue which is also baffling.

    Does anyone know why replication on only one table would hang like this? Why would a relatively small table with just a few hundred inserts in 2 hours cause 100+GB of reads? Is there a way to see what its reading? The query plan for this procedure is simply inserting into the clustered index and columnstore index. Why does it report no wait types or blocking for hours?

    Thank you all for any help.

     

  • Thanks for posting your issue and hopefully someone will answer soon.

    This is an automated bump to increase visibility of your question.

  • I wanted to follow up with this in case someone else runs into this issue.

    The 'fix' if you can call it that ended up being dropping the columnstore index on the Sales table. Immediately it cleared up the replication hangup and it hasn't happened again in over a week.

    We're also seeing issues with BPSORT waits now with our other columnstore indexes and ended up turning on Trace9358 to avoid that operator completely. While some of our reports now suffer, some of the extracts that were stuck on BPSORT are completing within an adequate amount of time.

    I plan on opening a ticket with Microsoft. It seems like this started when we installed CU10 from CU6 in early December 2019. I don't have any concrete evidence, but thats the only thing thats changed in our environment. We've had these indexes for years with no issues.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply