The other day I was with Andy and his Blackberry rang. Since it seems to ring about 4 times an hour on his days off, there was nothing special about this time except that he didn't get off it in his usual 10 minutes. It turned out the his SQL Server was pegged and causing issues at work. Since I had a few minutes, I went along with him to the office to help troubleshoot.
We got there and found a query that normally returns in a few seconds but was taking minutes now. This has happened before and a variety of highly advanced expert techniques are required to bring the performance back to it's previous level. These usually involve rebooting, yelling at the server, or even a slight kick to the server rack.
Actually it's a really ugly stored procedure that contains a few items, but the big one is a large left join from a 35 million row table. Despite the size of the table and the left join, 3 left joins really, the query normally runs with 7000 reads.
So we fire up query analyzer and check the server. We see various executions of the stored procedure coming back at 150,000+ reads or more. And with 50,000+ CPU milliseconds. That's caused havoc with the business users who are seeing their application take minutes to pull up a screen.
Now he has a pretty beefy server. An 8 way box that has 8GB of RAM and normally easily handles the load. But today the queries are slow, the average CPU% is 95%, and nothing is going right. We check the query plans and see that the offending portion of the stored procedure is running an Index Scan and eating up over 80% of the entire query.
Nothing really has changed and it was working first thing this particular morning, but for some reason the query plan changed. We tried a few things, but finally Andy rolled back to a previous version of the stored procedure that returned the same values, but was structured slightly differently with temp tables and the query broken into a couple parts. Back to 11,000 reads, but that's miles better than 100,000+.
At the same time we noticed another stored procedure, this one being called from the web server that required 900,000 reads. Andy pulled that one into Query Analyzer and ran it, getting a result in 50,000 reads and much less time. I suggested that there may be some security checking in the complex stored procedure, so we tried running it under the web server account, at the same time it ran from the web server, and still had much better performance from Query Analyzer. Since that one wasn't causing a huge issue, we left it alone.
Both of these are strange behaviors and ones that I've seen asked about before as well as encountered myself at times. There doesn't seem to be any rhyme or reason for it to behave that way. So we decided to go to the source and ask for help.
Luckily a few hours later we were in a Query Plan Stabilization talk at TechEd 2005. In questioning the speaker, there wasn't a good explanation given that seemed to match. Since the query plan has a blob, Andy thought that he might be able to "save off" the query plan that worked and then when he had issues, update syscacheobjects with the old plan and have users then use the good plan. The speaker didn't think that would work and certainly didn't convince Andy. I was skeptical, so we sought out more help.
At the cabanas in TechEd, we found Christian Kleinerman, one of the storage engine developers for SQL Server. He had a few interesting things to say.
On the reason for the query plan suddenly changing and causing performance to go south, he suggested that it was mostly data driven and that the statistics slightly changed, which cause a change in the choice of plan. This could lead to the optimizer deciding a scan was needed instead of a seek. As for why it occurred in the middle of the day, he said that it could be memory pressure and a slight parameter change that kicked out the old plan and forced the recompilation. He didn't have any great ideas on how to prevent this, but he did mention that some SQL Server clients actually schedule the execution of a stored procedure every minute with specific parameters to ensure that the query plan that works remains active. That's an interesting idea.
His other thought on the web server query that ran differently under query analyzer was that the connection options, SET ANSI_NULLS, SET ARITH_ABORT, etc. were different. He said that query analyzer sets a few things by default that are different than the defaults of ADO and other connection methods. His suggestion was to get the options for each place and then compare them, checking to see if a setting might be affecting the way the procedure was executed. Another good idea from an experienced SQL Server developer.
We didn't get any earthshaking information, but we did learn a few things. These little nuggets of wisdom are things that you gain from experience and working through a particular issue. I know we all wish there was some great list that we could reference, or that this stuff was written down somewhere. My guess is that some of this has been documented, but in a long list or KB type reference, it's not likely that you'd remember if, even if you had read it.
Instead, I think that reading a short article like this one, more of a story, might stick with you a little longer. Or might trigger an idea of where you might change your application to run a little better. Let me know if you think so using the "Your Opinion" link below.