Never Judge A Query By Its Cost

Signs and Numbers

When tuning queries, or even finding queries to tune, there’s a rather misguided desire to look for queries with a high cost, or judge improvement by lowering query cost. The problem is that no matter what you’re looking at, costs are estimates, and often don’t reflect how long a query runs for or the actual work involved in processing the query.

You typically need to observe the query running and gather more precise metrics on what it’s really doing to judge things.

Good Example

I have these two indexes:

And when I run this query, I would expect to use both of them to help the join/where clause, and then do a Key Lookup to get the columns not covered by my nonclustered index on users from the clustered index.

After all, there are only about 3000 rows in the Users table that qualify for my predicates, and that only turns into about 7300 rows when joined to Posts. That seems like one of those “no brainer” places to do a key lookup, but it doesn’t happen.

Grinchy.

The metrics we get from stats time/io look like this (trimmed for brevity):

So about 5 seconds of wall clock time, 8 seconds of CPU time, and 150k logical reads from Users.

If we hint our index, to make the optimizer choose it, we get the plan we’re after.

Fancy Feast

The metrics we get from stats time/io look like this (trimmed for brevity):

Interesting! We’re down to 780 milliseconds of wall clock time, about 1.1 seconds of CPU time, and what appears to be a lot fewer reads from the Users table.

This is clearly the faster plan, so what gives?

Scrooged

The much slower query has a lower cost!

Fraudity

The optimizer associates a fairly high cost to the key lookup. This comes from two things:

  1. The optimizer assumes data comes from a cold buffer cache, meaning no data is in memory
  2. The optimizer hates I/O, especially random I/O

I’m gonna go out on a limb and say the costing algorithm still thinks about disks like they’re creaky old spinning rust, and that it’s not hip to SSDs. I don’t know this for a fact, and I’d love to hear otherwise. When you put it all together, you get a query plan that prefers to do one sequential scan of the clustered index because no data is in memory, and because doing a Key Lookup via random I/O, or even sorting data first would be more expensive.

Costs can be helpful to figure out why a certain plan was chosen, but they shouldn’t be used to judge what the best plan is.

Thanks for reading!

Brent says: in SQL Server’s defense, I bet most plan caches are chock full of beautiful query plans where the costs accurately reflect the amount of work involved. Say maybe 99.9% of your queries get perfectly crafted plans. Still, that’s 1 in 1,000 plans that go south, and those are the ones you’re typically focused on tuning. You’re tuning them because the estimates are wrong – and that’s why you don’t wanna put faith in those estimated cost numbers. If they were accurate, you probably wouldn’t be tuning ’em.

Previous Post
Does low fill factor affect SELECT performance?
Next Post
What Azure SQL DB Managed Instances Don’t Support (Yet)

7 Comments. Leave new

  • Very interesting. Thanks Erik!

    Reply
  • I LOVE this!!! Thanks!

    Reply
  • We ran into something like this some time back with a parameter sniffing issue. Small clients would set the plan for a query that would be run by all sizes of clients. When a large client would then try to run, it would be really slow. We forced an index hint on it, resulting in a larger cost, but a faster overall time and the issue with “bad” query plans entering the cache was no longer an issue. (Back in the days when the only option was to clean the entire proc cache, of course)

    Working with something like this now where putting in place a more efficient query, cost-wise, dramatically made the performance worse.

    Reply
  • The idea that the query engine would optimise Scans over Key Lookups because of older hardware constraints is a great point. It makes me wonder if that is a thing, and if there are other hardware optimisations for what are now legacy purposes, and what performance gains could be achieved by a revision.

    Reply
  • Bryant McClellan
    December 17, 2018 8:41 am

    Erik,
    I believe your assumption about creaky old disks is close but I don’t think that SQL Server cares about quality.

    Kimberly Tripp teaches a great rule of thumb about predicting when NCIs will be used that relates rows and pages. It works because the optimizer assumes that I/O is performed as random I/O. It cares not if the read is from an SSD or an 8-inch 1mb floppy drive (although if the latter is part of your system you have bigger issues).

    Reply
  • So… what should we judge a query by?

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.