Advanced Troubleshooting Week at SQL University, Lesson 2

Welcome back to Advanced Troubleshooting Week at SQL University. I’m your guest professor for this course and today I will continue talking about Advanced Troubleshooting. I'm delighted to see that everyone returned safely from their battle with those dastardly Nargles.Nargles are actually fairly inocuous once you learn their weaknesses. Let's move on to something a little fiercer ... grindylows.

Grindylows are those nasty little water demons who trick SQL Server into performing a table or clustered index scan even though there are perfectly good indexes that it could use instead. We've all heard about this instances where people claim that SQL Server simply won't use their index. The advice given back to these poor souls i pretty much the same every time; update statistics or update stats with a full scan if that doesn't work. And if that still doesn't work, we love to tell them that SQL Server must think the clustered index scan is faster. But why would SQL Server think that a clustered index scan is faster than a nonclustered index seek? Do any of us really believe that SQL actually thinks that a scan is faster?

We're going to take a look at these cases where SQL Server just won't see to use the indexes available in today's session. Also, if you have not already done so, be sure to take a look at Lesson 1: Doing Battle With Nargles:

Advanced Troubleshooting Week at SQL University, Lesson 1

Cost Thresholds

The truth is that yes, SQL Server does think a scan is faster in some cases. The reason for this is that the query cannot be satisfied by the index alone and must lookup extra columns of data. This lookup is referred to as a bookmark lookup in older versions of SQL Server. The operation has been renamed to a key lookup when it uses a clustered index. It is still referred to as a bookmark lookup if there is no clustered index and it uses the base table (heap) for the lookups. At some point, the cost of doing these lookups plus the index seek exceeds the cost of doing just a clustered index scan.

One misconception out there is that key lookups should be faster than index scans because lookups are looking up single records whereas an index scan is scanning whole pages. SQL Server will never read in a part of an index smaller than a page. The page is the smallest unit it will ever read. So yes, SQL Server is picking out individual records, but it is having to read in the entire page the record is on to get it. Since the order of the records are most likely to be randomly spread throughout the clustered index, there will not be many cases where it is able to satisfy reading multiple records from a single page.

The engine chooses between possible plans by assigning a cost value to each plan. Everything it does under the covers to get the cost value is a carefully guarded secret, though much of it is known or has been figured out. The engine uses statistics to estimate how many pages will have to be read to satisfy the query. Whichever query has the lowest number of reads is the plan chosen.

So let's take for example a query against a table that contains customer data including the city and state of their address. If we query for customers who live in Kalamazoo, the number of potential matches is going to be very low. Let's say there are 1 million records in the table, and the statistics indicate that there are 10 customers who live in Kalamazoo. If there is an index on city, the index records will be fairly small and a lot of them will fit on every page. It is likely that all of the records needed for the query will fit on a single page, but there is a possibility that they could span the page into a second page. So it will need to read in 1 or 2 pages to find the cities in the nonclustered index and then perform 10 key lookups to find the other columns being requested reading a single page for each lookup. If there is an average of 750 records per page in the clustered index, there would be 1,334 pages. If SQL estimates that it will need to read 3/4 of the clustered index to find all matches, then it estimates that it would require 1001 page reads to perform a clustered index scan. This example is a no-brainer; 12 page reads is better than 1001.

Now, what if I want to query for customers who live in Los Angeles, the number is going to be much higher. If you have 1200 customers in Los Angeles, performing a nonclustered index seek would still likely only require 1 or 2 page reads for the nonclustered index seek.due to the small size of the index records. Additionally, it would need 1200 page reads for the key lookups. If it again estimates that it will need to scan 3/4 of the clustered index, that would be 1001 page reads. So 1202 page reads against 1001 page reads. SQL will opt to perform the clustered index scan. However, in some cases, it might simply opt to use the already cached query plan created for the Kalamazoo query. In this case, you won't notice the difference because the page reads are pretty close to each other.

So what happens if when you query for customers in New York city where you have 8,000 customers? It sounds like SQL Server should choose the best plan which would be a clustered index scan. But the optimizer's goal is not to find the best plan. The optimizer's goal is to find a good plan fast. In some cases, SQL will once again opt to use the already cached plan and perform an index seek plus key lookups. The query would require 8,000+ page reads which is about 4 times the amount of page reads as the clustered index scan. This is a noticeable difference. The tricky part is that when you look at the query plan, your first thought will be that SQL is choosing the right plan because it is using the index. As I laid it out though, you can see that for this particular version of the query, the best plan would be the clustered index scan.

Parameter Sniffing

The problem in the last example above is called parameter sniffing. When a query plan is cached and resused for subsequent variations of the parameter without re-evaluating the query plan, this is called Parameter Sniffing. The problem with parameter sniffing is that the plan being reused may be good for most parameters, but very bad for others. Think of a case where data is fairly evenly distributed except for a few values. Using the example above, if it is performing a nonclustered index seek plus key lookups, and the expected recordset is suddenly many times larger than expected, the number of pages that must be read can sky rocket and a normally quick query can take minutes or hours to run. This problem would not arise if the query had been recompiled and a new plan generated.

Parameter sniffing can be troublesome to figure out because it may work great for 99% of the possible parameters and miserably for the last 1%. We had an issue with an event processing job that would process events added to our system 1 million events at a time. This resulted in 11 batches of 1 million events each run. For the first 9 batches, it worked great. For the 10th batch, performance was terrible. Unfortunately, someone got it in their head that it was an issue with statistics being out of date. Whenever performance would tank, they would update stats on the table and performance would improve. The reason it would improve is that updating statistics caused the plan to recompile. This was masking the real problem, and it took quite a lot of convincing to get people to believe me that it wasn't a statistics issue. This is the trouble with taking an approach where you throw fixes at a problem with taking the time to diagnose it.

We were able to diagnose the issue and determine that it was parameter sniffing. We were using SQL Server 2005, so our options were limited. We decided to use the RECOMPILE option for the query to force SQL Server to recompile the query every time the procedure ran. We incurred a small amount of overhead at the start of the query each run, but we did not experience the very bad performance any more. In the meantime, our dev team reworked the query so that it was susceptible to parameter sniffing. If we had been using SQL Server 2008, we would have had more options. Here are some common ways to address this issue in SQL Server 2008:

RECOMPILE option - recompiles every run, extra overhead every run
Plan guides - assigns a plan already in cache as the plan to always use, could make other runs of the query run slower
OPTIMIZE FOR UNKNOWN option - creates the plan based on statistical data intead of the initial parameter values
OPTIMIZE FOR @variable = 'Value' - uses a specific value for determining the query plan, may not always be the best plan for all vlaues
Create a covering index so that a nonclustered index seek can be used without needed key lookups to satisfy the query

An Interesting Case of Poor Index Performance

I was helping someone troubleshoot a performance problem recently that was interesting to me. Before I give too much away, let's take a look at the query plan.

My first thoughts from looking at the query plan if I had not seen the query is that it is using the available indexes and the joins are not optimized. The biggest cost is the merge join, so perhaps if we force a loop join. Let's take a look at the query now and see if we can force a loop join on the first join.

Select ResultVal

From dbo.MyTable with(nolock)

Where ObjectID = 8

And GUID = '7FD2B0DB-66E2-44AE-AB5B-C11EECA9337F'

And ResultTypeID = 'weeref'

Soooo .... where are the joins? What we are seeing is index intersection. It is using multiple indexes to satisfy the query and then joining the indexes together. Great, so how do you force a loop join for an index intersection? Simple answer is that you don't. Performance is better than if it was doing a table scan, but not by a lot. So what do we do here? My thought was to add a covering index to avoid having to join multiple indexes together. It just so happens that the query plan was suggesting the same thing. I tested the index that SQL recommended (which was slightly different than the way I would have written it). This is the index I tested and the resulting plan:

CREATE INDEX NCIX_MyTable_ObjectID_GUID

ON dbo.MyTableL

(

ObjectID ASC,

GUID ASC

)

INCLUDE ( ResultTypeID,

ResultVal)

And this is the improvement we can see by comparing the two plans (click to enlarge):

Preview of Lesson 3

For the next lesson, I want to take a look at something that scares a lot of people, but really isn't scary at all if you understand it. Yes folks, I'm talking about boggarts!! Boggarts are known for taking the shape of whatever scares a person the most. For many DBA's, the shape a boggart takes when facing them is a Kerberos error. So we'll take a look at troubleshooting Kerberos connections and hopefully we can wave our wands, shout Riddikulus, and get rid of those boggarts for good.

Advanced Troubleshooting Week at SQL University, Lesson 2