Statistics Explained

  • Without going out and buying yet another book on SQL Server 2005 by Kalen Delaney, can any one provide me with a good explaination of statistics and how the work and are used by the query engine without going into a lot of detail?

    Thanks

  • Here's my brief answer... 

    Statistics are simply data SQL Server collects about a column's data selectivity,  uniqueness and counts of particular values in that column.  Stats exist on every index, however stats can also exist on columns that are not indexed.



    A.J.
    DBA with an attitude

  • Adam has a good explanation, but I'll add mine:

    Statistics are a way of guessing what the distribution of data is in a column. So if you have age, SQL "samples" the data and gets some idea of how many people are < 10, 11-20, 21-30, etc. Note these are my numbers, not really what SQL would use. It would just have an "idea" of the distribution. That way it can "guess" whether an index will help the query run quicker or if it's better to just scan the table.

  • Along these lines, how much effort should I put in to understand statistics?  Would getting Inside Microsoft SQL Server 2005: Query Tuning and Optimization be a worthwhile investment?  Just curious as I have invested a lot in a variety of SQL Server 2005 books over the last few months.

    Thanks.

  • If you plan to be a DBA for a significant portion of time, invest the time in learning about query tuning.  I would recommend you spare the week to take a class if you can get your company to pay for it.

    It will do two things that are really important.  First, it will teach you how to performance tune and troubleshoot slow queries in your databases.  Second, it will give you a much better understanding of what SQL Server is doing with the query you have passed it so you can more intelligently make decisions on what to do and what not to do.

    You may never specifically use performance tuning and query optimization, but understanding how SQL creates execution plans will be a great help in database design if you have to build anything new.

  • I think I have a pretty good understanding about building and tuning queries, and the use and creating of indexes.  What I am not sure about is statistics.  In working on slow performance of our PeopleSoft Finance system after upgrading to SQL Server 2005, I ran an hour long trace and used that with DTA.  It recommended we build 35 indexes and a lot more multi-column statistics.  We have implemented those recommendations and have seen a significant improvement in performance.  I can understand how the indexes help, but how do the statistics?  If you are doing this manually, how do you decide if a statistic is more appropriate than index?  Without going into a 3 to 30 page explaination, how are statistics used to improve query performance?

    Or should I invest in the book I mentioned earlier by Karen Delaney?

    Also, I work for a school district, training funds are very sparse.

     

  • In my limited understanding : statistics are used by the optimiser to determine the most efficient query plan, including which indexes to use and how.  Using an index to identify rows might be extremely efficient if it's a unique index on a large table, but extremely inefficient if there is only a small number of distinct values in that index.  On non-indexed fields the stats presumably help the optimiser determine what kinds of join are appropriate and which order to process each element in the plan.

    If you haven't done so already, take a look at the properties for some of the stats in your tables - it helps to visualise the kind of information they represent.

    I think statistics are definitely one of the least-understood aspects of performance tuning.  Not sure about the book as I haven't read it, but I would expect there to be some blogs about stats somewhere. 

     

  • Based on that, would you say it is worth learning more about statistics?  I am just curious how much time and effort I should put into some of this or if I should just "put my faith in Microsoft" and that DTA knows what's best.

    Thanks

  • I think it's handy to understand if you have the time and inclination, but for the most part you don't need to know it in detail - just knowing that they are beneficial is probably sufficient.  Having said that, it's another tool to use in performance tuning and I have occasionally made queries run faster due to that knowledge.  If you just want to get your toes wet, knowing when statistics become out of date is useful. 

    Here's a document I found:

    http://www.microsoft.com/technet/prodtechnol/sql/2005/qrystats.mspx

     

  •  I am just curious how much time and effort I should put into some of this or if I should just "put my faith in Microsoft" and that DTA knows what's best.

    The DTA does not always know best. I've found that it tends to go overboard.

    When I use it (which isn't often) what I'll generally do is implement its suggestions 1 by 1 and test the performance improvement, then I implement the ones that have the most impact.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Does Microsoft ever know what's really best?  What I implemented has helped.  I understand how the indexes can help, but what I still don't understand are statistics.  What I have read so far hasn't shed a whole lot of little on the subject, yet.

    Starting to look like I'll need to find yab (yet another book) to help shed that light.

    Thanks

  • At their simplest, statistics tell the optimiser the distribution of data in a column (or a set of columns). It can help the optimiser know how many rows to expect and hence help it to determine a more appropriate plan.

    The SQL engine will create single column statistics as it needs (you can see these in sysindexes. Their names always start _WA_Sys). It will never, however, create multi column stats automatically.

    I take the stance that statistics often (but not always) show where indexes might be useful.

    If you want a good book, Ken Henderson's Guru's guide to SQL Server internals has a god section on statistics (although it is for SQL 2000), as does Sajal Dam's SQL Server Performance Tuning Distilled (SQL 2000). I highly recomend Ken's bok. It is a treasure trove of information.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • I'll have to see if I can find a copy of Ken's book.  Your explaination helps, but to get a better understanding, perhaps additional books are worth the investment.

    Thanks.

  • interesting, while looking through a bunch of our statistics i found that most of the wa ones are for more than one column. will have to add this to my to read up on list

  • Interesting. I recall that SQL 2000 would never create multi-column stats. I looked in the 2005 BoL and couldn't find anything that outright stated that it will or won't. I did find the following comments.

    SQL Server 2005 allows for statistical information to be created regarding the distribution of values in a column. The query optimizer uses this statistical information to determine the optimal query plan by estimating the cost of using an index to evaluate the query.

    Creating statistics manually lets you create statistics that contain multiple column densities. 

    Edit: Hmm. Strange. I checked my server and I also have multi column stats. However I found the following quote on technet. Seems to be a contradiction here between what the docs say and what is happening.

    (http://www.microsoft.com/technet/prodtechnol/sql/2005/qrystats.mspx)

    Use multi-column statistics when queries have multi-column conditions

    When a query has a multi-column condition, consider using multi-column statistics if you suspect that the optimizer is not producing the best plan for the query. You get multi-column statistics as a by-product of creating a multi-column index, so if there is already a multi-column index that supports the multi-column condition, there is no need to create statistics explicitly. Auto create statistics only creates single-column statistics, never multi-column statistics. So if you need multi-column statistics, create them manually, or create a multi-column index

    Edit 2: I went through and checked some of the multi-column stats I have. In all cases that I saw, it's the clustering key(s) that are listed as the other columns in the stats. When I manually create a multi-colum statistic it doesn't get the clustering key included in it.

    Maybe what the docs actually mean is that SQL automatically creates statistics with a single column and the clustering key in it.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply