Stairway to SQL Server Indexes: Level 14, Index Statistics

  • Comments posted to this topic are about the item Stairway to SQL Server Indexes: Level 14, Index Statistics

  • two headers

  • Thanks for spotting it. Note sure how it happened, but I'll try to get it fixed.

    David Durant

    Author

  • why does the date at the bottom of these articles show a future date? For example, 2012/3/26 for this article. Or is that not a date.

  • sjsubscribe (1/16/2012)


    why does the date at the bottom of these articles show a future date? For example, 2012/3/26 for this article. Or is that not a date.

    Because it'll be published at that date in the newsletter. The are preuploaded on the site and scheduled.

  • Awesome description. Thanks for taking the time to explain index statistics with such clarity.

  • This article has been a great learning experience for me, as have all the others. But I do have a question about the Histogram - why does it only take the first column of the index into account? I ask because I have a table with approx 62 million rows with 2 main, multi-column, non-clustered, indexes on them. But both have the same first column. From what I've read in this article, the histograms for both would identical, right?

    To give a bit of background the current first column of both indexes (campaignid) has about 1000 distinct values which aren't exactly evenly distributed, I'd say 40% of the distinct values take up about 80% of the rows. The next columns in index A is a datetime field (that has & needs time defined - so it would be almost unique) and the other is a varchar(50) that on it's own has 2000+ distinct values, but is also uneven in its distribution.

    I'm trying to work out if switching the order of the columns in both indexes would be of benefit to SQL Server. Most queries on this table have a clear preference of using one index or the other, because the where clause would only have those 2 indexed fields in it, but I have seen on some occasions the query planner decide that a table scan would be of more benefit.

    Any insight you could offer would be much appreciated,

    Drew

  • Hi Drew

    You are correct, the histograms will be the same for each index as only the first column of the index is used. However remember that statistics also work out the densities for index columns and it calculates density for each combination of columns in a compound index eg for 1st column, 1st + 2nd, 1st + 2nd + 3rd, etc. Therefore although the 2 indexes will have the same histogram, they will have differing densities. The optimiser uses the densities to help it make more accurate estimates after evaluating the histogram, effectively refining it's estimates. Therefore although the first column is clearly required in the query before the index will even be considered, the additional column densities help the optimiser to choose which one to use, if any. As you rightly observe though, the columns in the "where" clause basically determine which index will be used.

    The optimiser will choose a table scan if it estimates that using the index will in fact be more expensive. Consider a table with 500 data pages, all of which are read by a table scan. Compare to the case where using an index may perform a nested loop doing an index seek, but for many thousands of rows plus potentially a key lookup on the clustered index as well. It becomes far cheaper to just scan the 500 pages than jump about seeking thousands of individual rows. This is a obviously a very vague analogy, I'm just trying to highlight a potential scenario for you to show why a table scan may be used, even though an index exists.

    As a general rule of thumb, make an index as selective as possible, putting the most unique columns first. Bear in mind the difference between a column having many distinct values and those values being unique. You mention 62 million rows but with column 1 having only 1000 distinct values. Any distinct value can obviously return many rows, possibly millions. Clearly it's an "it depends" situation, based on which columns are actually in each "where" clause though. However I would personally try adding an index with the date column first as it will most likely be used for those queries which contain the date, but which may currently be table scanning because of the value of the current first column not being unique enough in that particular instance, as the date will be more selective (unique) than your current column 1 is.

    With indexing there is often no absolute right or wrong, and you will normally need multiple indexes to accommodate different queries, or are restricted in using a less efficient index simply because the where clause does not provide unique enough values. On a table with 62 million rows though, it's far better to have all the indexes necessary to support your full range of queries than risk a table scan.

    Cheers

    Roddy

  • Roddy, you confirmed what I was suspecting, especially that general rule about "make an index as selective as possible, putting the most unique columns first". Convincing my boss to change index structures on a 62 million (and ever growing) table will be separate exercise in itself. 😉

    Thanks for such a quick reply to my question,

    Drew

  • No problem Drew. Personally I would leave your current indexes in place, but add additional ones with the different column order and see if they improve performance for specific queries. That way you don't risk anything by changing an existing index, you supplement it by adding extra ones, certainly until you determine which ones actually work best, maybe all of them though. You only have 2 at present anyway so adding a third or fourth is not an issue.

    You can then monitor how often each index is being used. You may find they are all used, each supporting different queries. If a particular one is not being used you can then remove it. On a large table the overhead of maintaining an extra couple of indexes is insignificant compared to the overhead of table scanning because they don't exist. Just ask a user waiting for the query to return when it's table scanning 62 million rows :w00t:

    Cheers

    Roddy

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply