Fragmentation & Statistics

  • Hi All

    Was approached with a slow running query.

    First thing I did was update statistics on the entire DB which improved the performance drastically.

    The problem is that the performance starts to degrade after about an hour or so at which point I update stats again - I've set up a job to update stats every hour to remedy this.

    I cannot help but think that this is not normal.

    What I did notice is that the indexes are heavily fragmented - Some of them 99%

    My question is, does fragmentation play a role in statistics? Could the fragmentation be the reason for having to update stats so frequently.

    I get that rebuilding the indexes will answer this for me but getting a window to do this is a but tricky at the moment.

     

     

    Thanks

  • Fragmentation and statistics are two completely different, though somewhat related, things.

    Fragmentation is simply about storage on the system. A page gets allocated to an index. Data gets put into it. At some point, an INSERT or UPDATE will cause the page to have to be split. It can no longer hold the data. Half the rows get moved to a new page. This move means the data is written to a different part of the disk, so, as you scan the data in the index, you move around on the disk following the links.

    Statistics is meta data about your data. It consists of the header, information about the stats, the density, how unique is the data in question, and the histogram, a distribution of the data from the first column of the statistics. This information is used by the query optimizer to determine estimated row counts, cardinality, or cardinality estimation. That helps the optimizer decide if a scan or a seek should be used, a hash join or a loops join or a merge, and lots and lots more.

    These two things are related because, as you change the data in your system, both go through various processes to deal with those changes. That's all. One has nothing to do with the other, but they're both affected by incoming changes. I hope that helps a little.

    As to your immediate issue, the key is going to be understanding why the changes in data are causing your statistics to age so badly. Are you updating them with a full scan or sampled? That affects their accuracy. The way to start to understand is to look at the execution plans. If you're new to all this, the book below (search for it, you can get a free digital copy from here on SSC) on execution plans can help. Otherwise, we're talking query tuning and that's what the second book is all about.

    As to stats updates, probably better to focus down and determine which queries need more frequent stats updates and only update the necessary tables, indexes, or statistics, in support of that query, than the entire database. I once had a really bad database design that, until we fixed it, we were updating the stats on one table every 15 minutes. We do get in these situations occasionally. However, best to identify, understand, and specifically target the issue rather than drop nukes like rebuilding the entire set of stats. BTW, that causes all your queries to get recompiled, so, enhanced CPU load, additional waits, bad things.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant Fritchey wrote:

    Fragmentation and statistics are two completely different, though somewhat related, things.

    Fragmentation is simply about storage on the system. A page gets allocated to an index. Data gets put into it. At some point, an INSERT or UPDATE will cause the page to have to be split. It can no longer hold the data. Half the rows get moved to a new page. This move means the data is written to a different part of the disk, so, as you scan the data in the index, you move around on the disk following the links.

    Statistics is meta data about your data. It consists of the header, information about the stats, the density, how unique is the data in question, and the histogram, a distribution of the data from the first column of the statistics. This information is used by the query optimizer to determine estimated row counts, cardinality, or cardinality estimation. That helps the optimizer decide if a scan or a seek should be used, a hash join or a loops join or a merge, and lots and lots more.

    These two things are related because, as you change the data in your system, both go through various processes to deal with those changes. That's all. One has nothing to do with the other, but they're both affected by incoming changes. I hope that helps a little.

    As to your immediate issue, the key is going to be understanding why the changes in data are causing your statistics to age so badly. Are you updating them with a full scan or sampled? That affects their accuracy. The way to start to understand is to look at the execution plans. If you're new to all this, the book below (search for it, you can get a free digital copy from here on SSC) on execution plans can help. Otherwise, we're talking query tuning and that's what the second book is all about.

    As to stats updates, probably better to focus down and determine which queries need more frequent stats updates and only update the necessary tables, indexes, or statistics, in support of that query, than the entire database. I once had a really bad database design that, until we fixed it, we were updating the stats on one table every 15 minutes. We do get in these situations occasionally. However, best to identify, understand, and specifically target the issue rather than drop nukes like rebuilding the entire set of stats. BTW, that causes all your queries to get recompiled, so, enhanced CPU load, additional waits, bad things.

     

    Thanks a lot for taking the time to write this up - Appreciated.

    As far as the sampling size, I'm running sp_updatestats against the entire database so I am not sure what sampling that means.

    Perhaps I should look at running with FULLSCAN against the tables used in the query.

     

    Thanks

  • You got it. sp_updatestats uses a sample method instead of full scan. It means it's faster, but it's less accurate. There's always a tradeoff deciding which to use, so you'll have to think it through and experiment a little. Probably, full scan will work out better, but as to whether or not it impacts how long between stats updates depends on everything else going on.

    I'd capture the execution plan when things are running well and again when things are running poorly. Comparing the two should give you a lot of information as to what's going on.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant Fritchey wrote:

    You got it. sp_updatestats uses a sample method instead of full scan. It means it's faster, but it's less accurate. There's always a tradeoff deciding which to use, so you'll have to think it through and experiment a little. Probably, full scan will work out better, but as to whether or not it impacts how long between stats updates depends on everything else going on.

    I'd capture the execution plan when things are running well and again when things are running poorly. Comparing the two should give you a lot of information as to what's going on.

     

    Thanks very much

    Will proceed with running FULL SCAN.

    Question, with using sp_updatestats, what determines the sampling % that SQL uses?

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply