Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

A Tale Of Two Execution (Updated)

Update: Scroll to the bottom for more information added after originally posting.

Earlier this week a coworker asked me for help figuring out why an execution plan was different for a query run against two identically configured servers. I've since spent the better part of the week trying to figure out why, and after several suggestions from folks on Twitter and two MVPs I still don't have an answer.

First, A Little Background
My company has 9 SQL Servers that are used by web servers to render pages for a web application. Each SQL server is more or less identical to one another – Dell 2950, 4 GB RAM, Windows 2003 R2 x86, and SQL 2005 Standard Edition. The /3GB switch is set to let SQL use 3GB of RAM, and these are dedicated SQL servers with no other processes running on them. Every server is a subscriber to transactional replication so they've got identical data and schema. There's no SSIS, SSAS, or SS-anything else-S at play on these boxes. The only difference is that a few servers have dual-dual CPUs and a few have dual-quad CPUs, but as you'll read in a minute that doesn't seem to matter.

The Problem
Now here's the problem…At the beginning of the week, for this simple query 1 of the 9 servers resulted in execution plan "A" and the other 8 used plan "B". Things have shifted; now 7 of the 9 are using plan "A", but 2 servers are still using plan "B".

The first thing that came to mind was out of date or updated statistics. So I isolated one server using plan "A" and one using plan "B" and ran an UPDATE STATISTICS with FULLSCAN on all the indexes for the two tables included in the query. I verified the statistics from both servers matched by reviewing the results of DBCC SHOWSTATISTICS. No change in execution plans. (For what it's worth, AUTO_UPDATE_STATISTICS is enabled and we have an agent job that runs sp_updatestats throughout the day to keep statistics up to date)

I looked at SET options for connections. Identical.

I looked at collations. Identical.

I looked at the data itself. Identical.

I looked at server configuration options (sp_configure). Identical.

I looked at database compatibility levels. Identical.

I looked at memory available to SQL Server. Identical as far as I can tell.

I looked at SQL server versions. A-ha!

  • Servers with plan "A":
    • 9.00.4230.00
    • 9.00.4226.00 (6 of these)
  • Servers with plan "B":
    • 9.00.4262.00
    • 9.00.3054.00

So maybe there's something there with the version difference. Except that the server that started the week on plan "A" is the one running 9.00.4230.00. The 6 servers running 9.00.4226.00? They were using plan "B" and have since switched to plan "A". And what about the server running 9.00.4262.00? That's a later version than the one running 9.00.4230.00, yet it's using plan "B". That makes me doubt that it's something related to the version number.

Just for fun and to add to the confusion, I changed the table variable in the query to a temporary table…and the plans matched!

Think You Can Figure It Out?
I'm stumped, and I'd love to hear what you think the reason for the difference in plans might be. I've linked to a the schema for the tables, the output of DBCC SHOWSTATISTICS for the index that I think is the root cause of the problem, the query itself, and execution plans "A" and "B" below. Object names and servers have been changed to protect the innocent. If there's something I'm missing that you want to know more about, contact me on Twitter (@SQLDBA) or leave a comment.

I'd love to figure out what's going on!

Download the schema and queries here

 

UPDATE: OK, as some people were quick to point out, the XML execution plans I included showed two different builds of SQL. So I went poking around and found that one of the 9.00.4266.00 builds has gone back to using plan "B" again. That means I have two different servers on the same build with different execution plans. To be sure it wasn't obviously statistics related I ran another UPDATE STATISTICS with FULLSCAN (and verified that statistics matched across servers), cleared the buffer cache, and the plans were still different. I also realized that in the course of changing object names I ended up with a duplicate column. I have fixed the schema script and updated the download with the plan "A" and plan "B" from the 9.00.4266.00 builds.

Comments

Posted by Steve Jones on 11 February 2010

Interesting issue. Nothing jumps out, but I ran this back out on Twitter and Linked In.

Posted by Robert Davis on 11 February 2010

Parameter sniffing? Did you check fragmentation levels?

The download doesn't work for me.

We're sorry, but we are unable to display this page at this time.

Please try again in a few moments.

If you continue to get this error, please contact Google Support.

Posted by Peter Samson on 11 February 2010

What is Max Degree of Parallelism (MAXDOP) set to on each server?

Posted by Jeffrey Yao on 11 February 2010

Are the parameters (used in the query if any) the same across the servers?

Any undeterministic functions (such as getdate()) used in the query?

(I cannot download either, same problem as Robert Davis has encountered)

Posted by Dave Mc on 12 February 2010

It seems like there must be something different about the index IX_Item_24_12_5_i17 on some of the databases -- or else the optimizer is using different logic with respect to the index seek being done on it.  Plan A shows estimated number of rows as 134.959 vs 667.417 for Plan B.  Could that be causing the differences in the execution plan steps?

Can you check anything further about that index to ensure that it's identical (including stats) across all servers?  Is it possible that the index is more fragmented on some servers?

Posted by Vladimir Sotirov on 12 February 2010

If the Estimated number of rows on the two plans are different before the join to the Provider table then the statistics are different on the two servers and the optimizer chooses hash join on one and inner loop join on the other. Try adding  joint hint inner loop join and compare the plans.

Also you can get rid of the key lookup if you add the missing columns to the index.

Posted by tomasz.cwajda on 17 February 2010

Plan is fixed on first execution after compilation and obviously depends on parameters passed to procedure - most probably values in parameters ware different in each (first execution after compilation).

Leave a Comment

Please register or log in to leave a comment.