• Roger Sabin (2/8/2013)


    I am having a hard time understanding why the first index would be used when it requires tremendously more IO and time than the second index to return the same records. Why isn't it obvious to the query optimizer that the first index costs more than the second index?

    There are a number of reasons.

    First up, where there is an obvious covering nonclustered index for the query, the optimizer picks it straight away at the trivial plan stage. The plan is chosen before the optimizer starts exploring the space of possible plans and choosing one based on lowest estimated cost. You can see this in the SSMS execution plan view by clicking on the SELECT icon and looking at the Optimization Level property in the Properties window - it shows TRIVIAL for the CorrectIndex.sqlplan file.

    For the second query, there is no longer an obviously best choice, and although the trivial plan logic is still run, it does not find a plan. So, the optimizer starts generating plan alternatives and costing them. One alternative is to seek the clustered index:

    ProcessYear = 2013

    ProcessMonth = 1

    ...and then apply a residual filter to rows that match:

    VisionAccountNumber = '1234'

    The alternative is to seek the nonclustered index:

    ProcessYear = 2013

    ProcessMonth = 1

    VisionAccountNumber = '1234'

    ...and then use the clustered index keys stored in the leaf level of the nonclustered index to locate the base row in the clustered index and return the value it finds for CustomerName. This process repeats for each row found by the nonclustered index seek.

    These two alternatives are assigned an estimated cost using the query optimizer's cost model, and this is where the problem occurs. The model is just a model, of course, one which happens to work pretty well in practice over the broad range of queries, database designs, and hardware in use across the world. It does not look to make a precise estimate of the cost of the query in your particular circumstances, and the cost it produces is also dependant on the quality of statistical information it has to work with.

    Now, without scripted statistics for the table in question I cannot work through the numbers and show you precisely why the clustered index seek + residual predicate is chosen over the nonclustered seek + base table lookups, but if you compare the estimated plan costs for both (using an index hint as necessary) you will see the clustered access path appears to be cheaper.

    One reason for this is that the optimizer assumes each query starts with a cold data cache, so all I/Os will at least initially be physical. The model also assumes sequential I/O (scanning a range of the clustered index) will be much faster than the random I/O expected to result from a given number of lookups from the nonclustered index to the base table.

    The main reason though is that the optimizer is grossly underestimating the number of rows returned by the range seek of the clustered index which will need to have the residual filter applied. This may be due to a lack of good statistics on the column combination concerned, or it may be a consequence of the way predicates are modelled. You can check this by issuing a query like SELECT COUNT_BIG(*) FROM UNMonthly_CMS WHERE ProcessYear = 2012 AND ProcessMonth = 1 OPTION (RECOMPILE); the estimated number of rows ought to be quite close to the actual rows.

    If you fancy scripting out the indexes and statistics using SSMS, I can go into more detail.