Odd column statistics

  • I’ve recently found a very odd result from a statistics rebuild. The statistics are on a single column index that is mostly unique. There are 1.5 million rows and 1.1 million distinct values, so on average each value that exists occurs 1.3 times. The statistics on the index were built with a 5% sampling rate and the density vector shows the all-density to be 8.8E-06 for the indexed column. That density value estimates that each value exists 13.2 times (1.5 million * 8.8E-06), nowhere near the correct value of 1.3 times.

    Now, I know that 5% is a pretty low sampling rate. But given that it would sample 75,000 rows and there are only two values in the table that have more than 10 rows, how did it estimate that each value would exist 13.2 times? Even if it had perfectly sampled all the values that had the most number of rows (the BIG ones), that average is 3.1 rows per value – nowhere near 13.2.

    This is my question: Do you have any ideas why the statistics would be so incorrect?

    My only thought is that it must be extrapolating, but why would it take a sample where the wost-case average number of rows was 3.1 and assume that the average for the rest of the table is 13.2? I have noticed that if I increase the sampling percentage the statistics become more accurate (100% sampling is perfectly accurate of course), but it doesn’t explain why a low sampling rate comes up with a value that is way beyond what is contained in the sampled data.

    To allow you to enjoy my confusion, I’ve written a repo script with faked data that you can play with. I tried to reduce the size of the table and was able to trim the dataset to 500K rows and 10 columns and still see the issue, if I removed more columns or rows the problem became less obvious. I apologize, I know that is still a big script and I had to run the SQL using SQLCMD - it was too big for SSMS on my box, so you might need to do the same. In this dataset, there are no values that exist more than 4 times (4 is the max), but the density values indicate that each row should exist 4.2 times on average. The SmallSetInserts.sql file creates a table and indexes, then fills it with data 1000 rows at a time. CheckStatistics.sql updates the stats and reports back on what it finds. Run that script and check out the difference between the EstRowsPerValue column and the RealRowsPerValue column - the RealRowsPerValue will be 1.36, but the EstRowsPerValue will probably be between 4.1 and 4.5 (depending on variations between your hardware and mine).

    Machine:

    4 core, 16 GB

    Windows 2012 R2

    SQL Server 2014 Enterprise SP1

    [Also tested on 3 other 2014 boxes and 1 SQL Server 2012 Developer SP2]

    Thanks,

    Chad

  • hi,

    But if you update the statistics with fullscan you will get the correct estimation , correct ?

    -- R

  • Yes, with a fullscan the statistics are accurate.

  • If you were to run the sampled update of the statistics several times over, do the results stay substantially the same, or does the distribution estimate change from run to run?

  • If I update the statistics repeatedly, the results stay exactly the same down to the 6th decimal place (which is the last one). Interestingly enough though, the results vary a small bit from instance to instance (I've tested 6 different instances on 5 different boxes). I assume this is due to variations in how the pages split when the data is inserted or perhaps each instance has a different static seed for the random number generator that picks the sample.

    Thanks,

    Chad

  • I got a little bit of help off-forum and wanted to post the results here. Apparently sampled statistics over estimate uniqueness, so the algorithm fudges the numbers slightly to show higher duplicates in an attempt to compensate (there are exceptions for columns that have constraints that enforce uniqueness of course). This would explain why the statistics I saw were outside the bounds of anything stored in the table. Unfortunately, I don't have a blog or white paper to reference - if anyone has a source I can link to, please let me know.

    Thanks!

    Chad

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply