Blog Post

Parameterization Part 4: Handling Non-Uniform Data Distribution

,

This is the fourth post in the “Parameterization” series. In the previous post I wrote about parameter sniffing, and I mentioned that this method can be a good thing in some cases, but it can also be quite a bad thing in other cases.

One case in which parameter sniffing is a bad thing is when the data in some column is distributed in a non-uniform way. For example, consider the “Status” column in the “Sales.Orders” table, which has a total of about 10M rows. This column has seven unique values with the following distribution:

StatusNumber of Rows
Open314
Pending Approval561
Approved28,990
Paid17,610
Shipped817,197
Closed7,922,834
Cancelled1,032,886

 

If parameter sniffing is used with the value “Open”, then the optimizer will most probably choose a plan with an index seek and a key lookup. This plan will then be put in cache for reuse. When another user executes the same query with the value “Closed”, the same plan with the index seek and key lookup is going to be reused, and this is probably going to be catastrophic, because now there are going to be around 8 million key lookup operations.

Another case in which parameter sniffing is a bad thing is when parameters are used with inequality predicates. For example, consider the following query:

SELECT
Id ,
CustomerId ,
TransactionDateTime ,
StatusId
FROM
Billing.Transactions
WHERE
TransactionDateTime BETWEEN @FromDateTime AND @ToDateTime
ORDER BY
TransactionDateTime ASC;

If the query is compiled with parameter sniffing using the values “2014-07-01″ and “2014-08-01″, then the optimizer will estimate the number of rows based on statistics and come up with an estimation of around 20,000 rows. It is then going to create a plan based on this estimation and put it in the cache. Subsequent executions might use completely different parameter values. For example, a user might execute the query with the values “2012-01-01″ and “2014-01-01″. These values would have produced an estimation of 610,000 rows, but nobody asked. The plan based on the 20,000 estimation would still be used, and it probably won’t be a good plan for this execution.

So what can we do when parameter sniffing becomes a bad thing and affects performance?

I’m going to demonstrate a few techniques based on the example stored procedure I used in previous posts:

CREATE PROCEDURE
Marketing.usp_CustomersByCountry
(
@Country AS NCHAR(2)
)
AS
SELECT
Id ,
Name ,
LastPurchaseDate
FROM
Marketing.Customers
WHERE
Country = @Country;
GO

Here is the distribution of the values in the “Country” column:

CountryNumber of Rows
BE70
CL55
CN29,956
DK74
EG64
IL72
MT83
PT75
TR63
UK28,888
US40,101
VE78

 

As you can see, out of 12 unique values, three of them have many rows, while the rest have very few. This is an extreme case of a non-uniform data distribution, and not necessarily one you’ll see in your production environment, but it’s going to help me demonstrate my points…

Before we talk about the possible solutions, let’s demonstrate the problem…

When the stored procedure is executed for the first time with the value “IL”, the plan generated (based on parameter sniffing) includes an index seek on the “Country” column.

This is very good for this specific execution. The optimizer estimated the number of rows as 72, which was 100% accurate in this case.

The next time the stored procedure is executed, it uses the value “US”. The number of rows for “US” is 40,101, and the optimal plan in this case would be to use a clustered index scan, in order to avoid the many key lookups. But the plan already exists in cache, and it is reused. Unfortunately, this plan includes an index seek and a key lookup rather than a clustered index scan, and it is a very poor choice for this execution.

If we look at the properties of the index seek operator, we can see that the estimated number of rows is 72, while the actual number of rows is 40,101. The reason the estimated number of rows is so wrong is not because the optimizer tried to estimate and got it wrong, due to inaccurate statistics, for example. In this case, the optimizer didn’t even try to estimate the number of rows for the “US” parameter value. It simply reused the plan in cache that was created based on the estimation of 72 rows.

 

If we look at the “Parameter List” property of the SELECT operator, we can see the reason. The compiled value (the one used to create the plan) is “IL”, while the runtime value (the one used in the current execution) is “US”.

 

This execution resulted in 122,892 logical reads.

Now let’s see how bad this plan is for “US”. If the stored procedure was executed for the first time with the value “US”, the optimizer would have correctly estimated 40,101 rows, and the plan would have included a clustered index scan.

In this case, the execution resulted in only 671 logical reads.

By the way, the missing index recommendation that appears in green at the top of the execution plan is for a non-clustered index on the “Country” column with the “Id”, “Name” and “LastPurchaseDate” columns as included columns. This will enable the optimizer to still use index seek with this index, because it will eliminate the need to perform key lookups. This might be a good recommendation, but it also might have a cost that is higher than its benefit. Anyway, this discussion is beyond the scope of this post. For the sake of this post, we will ignore this recommendation.

So now that we understand the problem, let’s talk about possible solutions…

Solution #1 – sys.sp_recompile

This is a very simple solution. You can use the

sys.sp_recompile

system stored procedure to remove a specific execution plan from cache or even all the plans that reference a specific table or view. This way, the next time the stored procedure is executed, it will be compiled again, and a new execution plan will be created.

Remember that our problem is the distribution of values and the fact that a plan based on one set of parameter values is not necessarily good for other sets. So recompiling the stored procedure based on a new set of parameter values (the one that will be executed next) will create a good plan for that specific execution, but most probably it won’t solve anything, because the new plan is still going to be only good for some values but not for others.

It’s like a casino. Getting a good or bad plan is just a matter of luck, and

sys.sp_recompile

just gives you another spin of the wheel (or roll of the dice, if you like). Not a very clever solution…

There is one scenario, though, where this solution might be a good choice. Let’s go back to the “Sales.Orders” example from the beginning of this post. Let’s assume that the application queries this table and filters the “Status” column many times, but in 99% of the cases, it uses the value “Pending Approval”. In this case, whenever the plan is recompiled, no matter what the reason is, and parameter sniffing is used, most chances the plan will be created based on the value “Pending Approval”. This is good, because this plan will be good for 99% of future executions. But what happens if we had really bad luck, and just when a recompile occurred, a user executed the query with the “Closed” value. In this case, a completely different plan would have been created, and this plan would probably be a very bad choice for 99% of executions. Using

sys.sp_recompile

in this case is a good bet, because most chances (99%) are that in the next execution, the value “Pending Approval” is going to be used, and this way we’ll get our best plan back. But remember that this is again just another spin of the wheel (although this is a much better wheel)…

Solution #2 – WITH RECOMPILE

If you don’t like gambling, then this solution is for you. Instead of relying on the parameter values passed in a specific execution, you can tell the optimizer to compile a plan in each and every execution of the stored procedure.

ALTER PROCEDURE
Marketing.usp_CustomersByCountry
(
@Country AS NCHAR(2)
)
WITH
RECOMPILE
AS
SELECT
Id ,
Name ,
LastPurchaseDate
FROM
Marketing.Customers
WHERE
Country = @Country;
GO

Each time parameter sniffing will be used, which means that each execution will get the best plan that the optimizer can come up with. Since a new plan has to be created for each execution, SQL Server won’t even bother to put the plan in cache.

This is a great solution, because it produces the best plan for every execution of the stored procedure, eliminating altogether the casino effect. But the downside is that it has to go through the expensive process of optimization and compilation each time. This is a CPU intensive process. If your system is already high on CPU utilization and the stored procedure is executed quite often, then this might not be a good solution for you. In extreme cases, it might bring your system down to its knees.

On the other hand, if CPU utilization is relatively low and the stored procedure is executed only once in a while, then this is a very simple solution that can bring you the best value.

Solution #3 – OPTION (RECOMPILE)

This solution is very similar to the previous one, but it has two important differences. The first is that you specify “RECOMPILE” as a query option only for the problematic query instead of specifying “RECOMPILE” for the whole stored procedure.

ALTER PROCEDURE
Marketing.usp_CustomersByCountry
(
@Country AS NCHAR(2)
)
AS
SELECT
Id ,
Name ,
LastPurchaseDate
FROM
Marketing.Customers
WHERE
Country = @Country
OPTION
(RECOMPILE);
GO

If the stored procedure is composed of many statements, and your problematic query is the 17th statement out of 26, then it’s going to be a real waste to compile the whole stored procedure each time, when only a single query has to be compiled. Using OPTION (RECOMPILE) for the specific query will only recompile the query and not the whole stored procedure. This can save you a lot of resources.

The second difference between “WITH RECOMPILE” at the stored procedure level and “OPTION (RECOMPILE)” at the query level is that the former occurs at compile-time, while the latter occurs at run-time. When the engine gets to the 17th statement (at run-time), it pauses execution, recompiles the query, and continues with the newly generated plan. Once the query has been executed, it continue to execute the rest of the stored procedure using the original plan. Compiling at runtime brings a new advantage – the optimizer “sees” all the runtime values. It’s not even parameter sniffing. The optimizer knows the values of all the parameters, local variables and any other environmental settings, and it can compile the query using this valuable information. In many cases, it can come up with a better plan than a compile-time generated plan.

So you should consider using “OPTION (RECOMPILE)” instead of “WITH RECOMPILE”, because it uses less resources and it can produce better plans. But remember that this is still a CPU-intensive process, and in some cases, it might kill performance.

Solution #4 – OPTIMIZE FOR

Another query option you can use to fight parameter sniffing problems is “OPTIMIZE FOR”. This option instructs the optimizer to compile the query using a given set of parameter values instead of the actual parameter values as passed by the user. It actually overrides parameter sniffing. Note that this option is used only when the query has to be recompiled anyway. The option by itself doesn’t cause a recompilation.

ALTER PROCEDURE
Marketing.usp_CustomersByCountry
(
@Country AS NCHAR(2)
)
AS
SELECT
Id ,
Name ,
LastPurchaseDate
FROM
Marketing.Customers
WHERE
Country = @Country
OPTION
(OPTIMIZE FOR (@Country = N'US'));
GO

Remember the scenario of the “Sales. Orders” table, where 99% of the executions use the value “Pending Approval”? Instead of using

sys.sp_recompile

, as suggested above, and hope that the next execution will use this value, a better choice would be to use OPTIMIZE FOR and instruct the optimizer to use this value regardless of the actual value in the next execution.

If you want to disable parameter sniffing altogether, you can do so by using the query option “OPTIMIZE FOR UNKNOWN”. This option instructs the optimizer to treat the parameter value as unknown, which essentially disables parameter sniffing for the parameter. If the stored procedure has multiple parameters, then you can choose how to handle each parameter separately.

ALTER PROCEDURE
Marketing.usp_CustomersByCountry
(
@Country AS NCHAR(2)
)
AS
SELECT
Id ,
Name ,
LastPurchaseDate
FROM
Marketing.Customers
WHERE
Country = @Country
OPTION
(OPTIMIZE FOR (@Country UNKNOWN));
GO

 

Solution #5 – The Best of All Worlds

As you probably noticed by now, there are two conflicting goals that we’re trying to achieve. One is to produce the best plan for each execution, and the other is to minimize the number of compilations to avoid excessive resource utilization. The “WITH RECOMPILE” solution, for example, achieves the first goal, but it requires a compilation for each execution. On the other hand, the

sys.sp_recompile

solution recompiles the stored procedure only once, but it doesn’t help much in producing the best plan for each execution.

This solution tries to balance between these two conflicting goals. The idea is to split the values of the parameters into groups, where each group has a different optimal plan, and then generate the optimal plan for each group. Each plan is compiled only once, and from that point each execution gets the best plan, because based on the parameter values, the appropriate group is determined and the corresponding plan for the group is used.

Sounds like magic, doesn’t it? Let’s see how this magic is done…

First, we need to split the values into groups. This is the tricky part, and there are all kinds of ways to do it. In our simple example, we’re going split between the common countries and the uncommon countries. The definition of a common country is one that has at least 1% of the rows in the table. We assume that SQL Server has already identified the common countries for us in the statistics of the “Country” column. When SQL Server creates statistics, it usually uses common values as entries in the histogram.

So we’re going to store the histogram of the “Country” column in our own table called “CommonCountries”, and then we’re going to delete all the uncommon values…

CREATE TABLE
Marketing.CommonCountries
(
RANGE_HI_KEYNCHAR(2)NOT NULL ,
RANGE_ROWSINTNOT NULL ,
EQ_ROWSINTNOT NULL ,
DISTINCT_RANGE_ROWSINTNOT NULL ,
AVG_RANGE_ROWSFLOATNOT NULL ,
CONSTRAINT
pk_CommonCountries_c_RANGEHIKEY
PRIMARY KEY CLUSTERED
(RANGE_HI_KEY ASC)
);
GO
INSERT INTO
Marketing.CommonCountries
(
RANGE_HI_KEY ,
RANGE_ROWS ,
EQ_ROWS ,
DISTINCT_RANGE_ROWS ,
AVG_RANGE_ROWS
)
EXECUTE ('DBCC SHOW_STATISTICS (N''Marketing.Customers'' , ix_Customers_nc_nu_Country) WITH HISTOGRAM');
GO
DECLARE
@RowCount AS INT;
SELECT
@RowCount = COUNT (*)
FROM
Marketing.Customers;
DELETE FROM
Marketing.CommonCountries
WHERE
EQ_ROWS < @RowCount * 0.01;
GO

Here is the content of the table:

RANGE_HI_KEYRANGE_ROWSEQ_ROWSDISTINCT_RANGE_ROWSAVG_RANGE_ROWS
CN02995601
UK02888801
US04010101

 

Nice, isn’t it? These are exactly the three common countries in our example. Of course, our example is simple, and life is usually not that simple. You’ll need to come up with an algorithm that can distinguish the common values from the uncommon values in your production table. You might use statistics like I did. You might use some kind of monitoring mechanism to track used values and corresponding plans. You might need to develop your own statistics mechanism. Whatever works for you. It’s tricky and complex, but in most cases it is possible to come up with an algorithm that splits the values into the appropriate groups.

So now that we have the groups (in our case – the common countries vs. the uncommon countries), we need to generate the optimal plan for each one. The way to achieve that is to create a stored procedure for each group. The stored procedures are identical (except for the name), and they are also identical to the original stored procedure.

In our case, we are going to create the “Marketing.usp_CustomersByCountry_Common” stored procedure and the “Marketing.usp_CustomersByCountry_Uncommon” stored procedure…

CREATE PROCEDURE
Marketing.usp_CustomersByCountry_Common
(
@Country AS NCHAR(2)
)
AS
SELECT
Id ,
Name ,
LastPurchaseDate
FROM
Marketing.Customers
WHERE
Country = @Country;
GO
CREATE PROCEDURE
Marketing.usp_CustomersByCountry_Uncommon
(
@Country AS NCHAR(2)
)
AS
SELECT
Id ,
Name ,
LastPurchaseDate
FROM
Marketing.Customers
WHERE
Country = @Country;
GO

What’s the point in making duplicates of the same stored procedure, you might ask… Hold on, I’m getting there…

Now we are going to modify the original stored procedure – the one that the application calls. This stored procedure becomes a router. Its job is to examine the parameter values and to decide to which group they belong. Then it executes the stored procedure that corresponds to that group.

ALTER PROCEDURE
Marketing.usp_CustomersByCountry
(
@Country AS NCHAR(2)
)
AS
IF
EXISTS
(
SELECT
NULL
FROM
Marketing.CommonCountries
WHERE
RANGE_HI_KEY = @Country
)
BEGIN
EXECUTE Marketing.usp_CustomersByCountry_Common
@Country = @Country;
END
ELSE
BEGIN
EXECUTE Marketing.usp_CustomersByCountry_Uncommon
@Country = @Country;
END;
GO

Here is the beauty of this solution:

The first time a common country is used, the router stored procedure is going to call the common stored procedure. Since this is the first time it is executed, a plan is going to be generated and placed in cache. Thanks to parameter sniffing, the plan is going to be good for common countries. From this point forward, whenever the router stored procedure is executed with a common country, it is going to execute the common stored procedure, which is going to reuse its plan, which is good for common countries.

And the same goes for uncommon countries… The first time the router stored procedure is executed with an uncommon value, it is going to execute the uncommon stored procedure, which will produce a good plan for uncommon countries. And whenever an uncommon value is used again, the same stored procedure is going to be used, and the same good plan is going to be reused.

So we get a good plan for each set of parameter values, and each good plan is compiled only once. Usually there are only two or three groups, so there are only two or three compilations. Magic, isn’t it?

Again, the tricky part here is to implement an algorithm to split the values into groups. Another thing you need to be aware of is the maintenance of this solution. Since data is changing, your algorithm has to be maintained and adjusted over time. In our example, there should be a job that recreates the “CommonCountries” table every once in a while.

Parameter sniffing can be good or bad. Since it is used by default in SQL Server, as long as it’s good, you’re fine. Your goal is to identify the scenarios in which parameter sniffing is bad, and then apply one of the solutions mentioned here in order to solve the problem.

In the next post in the series I’m going to deal with two common mistakes that developers make when using parameters. From my experience, there is a good chance you’ll find these use cases in your application, and there is a good chance that by fixing the code you are going to gain a big performance improvement, and you are going to be a hero. Stay tuned…

The post Parameterization Part 4: Handling Non-Uniform Data Distribution appeared first on .

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating