Question about truncating tables and running statistics

  • This is going to be a book, so I'm hoping there are some bored DBAs out there that want to offer advice. For the sake of accuracy, I am including as many details as I can.

    I have a system that is loading data into a BI tool, using SSIS packages. There is a full refresh set up that runs nightly, and refreshes all data into the BI tool. There are also incremental refreshes that run hourly and just pull over any new or changed data throughout the day. Recently, the incremental refreshes have been intermittently hanging; randomly a job will take 3-4 hours, when normally the incremental refreshes only take 20 minutes. I have done a bit of digging, but do not have enough experience as a DBA to really get a handle on what's going on. Here's what I know:

    When a refresh job is running long, I can see the query that is running by running a script to see the currently running SQL statements. What I see is ~25 records for the same query that all have a wait type of CXPACKET. I have learned that this means the query has been sectioned out and spread out to multiple processors. Over the course of a couple hours, the amount of records slowly drops down from 25, until finally they all are no longer running. At this point, the job completes, albeit way later than usual.

    During one of these long running sessions, I backed up the database and restored it onto a test machine, and ran the query in question. It took a couple hours to complete in my test environment. I then ran dbcc dropcleanbuffers, ran exec sp_updatestats, and ran the query again. It returned it 22 seconds. At this point, I feel like I may have narrowed down the issue to out-of-date statistics.

    My going theory is that there is a certain table that gets loaded during the full refresh that is inserted with ~32 million records. The refresh first truncates the table, then inserts the 32 million records. This same table is used for incremental refreshes, and each time it is first truncated, and then inserted with any new data (which typically is only a couple thousand records for the incrementals). This database has auto-stats turned on (and asynchronous turned off). My theory is that during the initial load of 32 million, SQL Server recognizes that stats are now going to be out of date, so before the next query on this table, it update stats on the table, and thus the full refresh runs fine. But then, the incremental refresh runs and again first truncates the table, but it only inserts a couple thousand. Am I correct in saying that the statistics will NOT be updated, because A) SQL server does not treat truncates the same as inserts/deletes/updates, and B)~2000 inserts is nowhere near 20% of 32 million, so SQL server will still think the stats are fresh, even though they are way out of whack? That's my question #1.

    Question 2 is why am I seeing this only happen sporadically. If it is stats related (which it seems to be, going off of my test in the test environment), then wouldn't every incremental refresh run long using the out of date stats? Why only 1 every few days? Am I misunderstanding how SQL Server determines when to automatically update the stats?

    I set up a job to monitor the row counts for this table every hour, so next time this job runs long, I can see how many records were in the table. But for now, I am just trying to make some sense of it and learn about how statistics work in SQL Server. I also would like to update stats on this table during each refresh (after the initial inserts, but before the query), but I would like some more solid understanding on why. Any explanations or advice is appreciated, as well as any questions.

    tl;dr how does truncating tables play with statistics, and what happens to statistics for a table of 32 million records when you truncate all 32 million records and then load 3000 records

  • No time to deal with all of your questions, but:

    1) get the query plan that is running and see what it looks like. Compiled vs actual parameters? estimated vs actual rowcounts, does the plan make sense, etc? My guess is possibly some parameter sniffing going on ...

    2) ... which could likely be cured with OPTION (RECOMPILE) placed on every query in the data load process. That should ALMOST ALWAYS be done in such scenarios as yours. Do you REALLY want to use the same plan for 32M rows as you would use for a few thousand? :w00t:

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • OhioGuy (5/20/2016)


    Am I correct in saying that the statistics will NOT be updated, because A) SQL server does not treat truncates the same as inserts/deletes/updates, and B)~2000 inserts is nowhere near 20% of 32 million, so SQL server will still think the stats are fresh, even though they are way out of whack? That's my question #1.

    They'll get updated. Truncate sets the row count for the table back to 0 (but doesn't change the colmodctr as far as I'm aware), the next set of modifications adds to the rowmodctr again, then when something goes to use the table, colmodctr compared with rows in table to see if stats are out, and will almost certainly trigger an update.

    At least that's how 2008 (I think it was) worked. Can't remember the last time I played with this.

    Pull the query plans, they'll tell you what's going on with the row estimations. You can also easily check stats update data with the STATS_DATE() function

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply