Pleae help: Source of massive new tempdb IO (bytes & stall)

  • Hello,

    On 7/18 my company enabled "pull" Transactional Replication for a large production database, in part because we had a reporting database on the same server - using cross-db queries and huge temp tables to pull data from large production tables - and just killing us on tempdb IO.

    The production/publisher environment is a SQL 2008 cluster on server 2008 R2 using a single SAN device with seperate LUNs for data and logs. Tempdb is not isolated. The subscriber is a standalone SQL 2008 R2 on server 2008 R2 using local drives.

    On 7/23 we moved the reporting database off of the cluster where the production/publisher DB resides, onto the subscriber machine, so it can do all of its insane queries against the replica on that same hardware.

    I've been monitoring IO stall on tempdb for a while. I have a process that collects and stores statistics (reads, writes in bytes/ops, io stall, etc) for all databases on the cluster every ten minutes, and aggregates them each day. I use an Excel pivot chart to analyze this info. On 7/18, after replication was enabled and the snapshots processed, I expected to see a big hit, but it wasn't so bad. But then, after 7/23 (Monday morning 7/25 to be exact) everything hit the fan. I've attached a graph that explains it better than I can.

    I need to figure out what's causing all this tempdb thrashing. I would assume replication, but it's been running unchanged since 7/18. Nothing else has changed that I know of, including the users/business operations which are normal compared to the several weeks prior ro the spike. There's lots of documentation out there on getting IO stats (i.e. stall) for tempdb overall, but none that I can find for getting the queries, connections or processes that account for that IO. Any ideas on that? If it's replication, I'd like to understand why the problem didn't start when replication was enabled, but only when the replica started to be used. My butt's on the line because I did a week of monitoring w/replication enabled then reported to my boss that it wasn't a significant hit, and the benefit of moving the reporting DB off the cluster outweighed any new load from replication...

    Any insight/advice appreciated. Please just let me know what other information would be useful

  • Seriously, _any_ advice appreciated.

  • ... none to give but I'm listening in.

  • Check out the default trace [/url]for that particular peak time to find what exactly are running on server.

    2. Check if there is open transaction that running long DBCC Opentran

    3. Check Buffer cache hit ratio counters and stuff like that

    [font="Tahoma"]
    --SQLFRNDZ[/url]
    [/font]

  • Thanks...

    1.) So, there is no peak time, this seems to be happening all day. Here's today:

    After looking at that, I'm going to start collecting stats at 7a because it does seem like around 8:10 when the first stats come in everything is a lot lower... If I see the same pattern tomorrow I'm going to wonder if gathering stats is somehow causing the problem. :blink:

    Here's a day last month, chosen at random - with axes scaled the same as above:

    Notice IO stall/op spikes a few times but generally hovers in the 10-15ms range. Compare that to the above...

    Anyway if I can identify some threshold tomorrow morning when things seem to pick up i'll start a trace the next day to cover that timeframe.

    2.) DBCC Opentran

    tempdb: No active open transactions.

    production database:

    Replicated Transaction Information:

    Oldest distributed LSN : (91145:37015:6)

    Oldest non-distributed LSN : (91145:37017:1)

    3.) Same as #1 I'll run this as soon as I figure out a timeframe...

  • Oh BTW on all of these, in case it's not clear the left axis is bytes, the right is milliseconds (for stall/op) or reads.

Viewing 6 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply