The Query Optimizer: Date Correlation Optimisation

In SQL Server 2005, a feature was introduced that was hardly noticed, but which might make a great difference to anyone doing queries involving temporal data. For anyone doing Data Warehousing, timetabling, or time-based pricing, this could speed up your queries considerably. Who better to introduce this than Query Optimizer expert, Fabiano Amorim?

Since SQL Server 2005, there has been a database option which can be SET at database level called DATE_CORRELATION_OPTIMIZATION: This feature is OFF by default. Personally, I have never seen anyone use this feature, but I believe that this is only because there is so little material around that refers to, or explains, this feature.

It is very important that the developer or DBA knows the type of application and database query that would benefit of this feature, since it can produce striking increases in performance; but, at this point, you are probably wondering, ‘what does this feature actually do?’

Tying Dates together

When enabled, the DATE_CORRELATION_OPTIMIZATION allows the Query Optimizer to improve the performance of certain queries by gathering some statistical information about the DATETIME columns of two tables which are linked by one-column foreign key. This collected data is generally known as correlation statistics (not to be confused with Query Optimization Statistics), because they help the Query Optimizer to identify a correlation between two DATETIME columns. With this information, the Query Optimizer can make the best decision to obtain a better execution plan, and which data can be filtered to avoid unnecessary page reads.

Let’s try an example:  we have one table with some Orders (Pedidos) and a column called OrderDate(Data_Pedido), and another table called OrderItems(Items) and a column called DeliveryDate (Data_Entrega) where the delivery date is not equal to the order date but where the delivery date is within a range of dates after an order. In other words, where the two dates correlate, but are not equal. As an example of this, Imagine that  I’ve made a purchase at submarino.com.br, buying a lot of different products, from books to a plush toys. Perhaps I ought to explain that my wife likes penguins, so I buy one. Anyway, despite the fact that I only made one order, I received each product on different days, because each product came from different places.  

In this example, we have a table called orders with the date of the order, and a table called OrderItems with the delivery date of each product. We can be fairly confident in saying that the date of order is always very close to the delivery date of each item. That represents a clear correlation between the OrderDate and the DeliveryDate.

I hope that this explains what is meant by a correlation between the two DATETIME columns. So let’s get back to the optimization.

Suppose you have a query which has a join between the Order and OrderItems columns, and you are doing a filter in one of DATETIME column, the SQL Server can modify your query to make that better without you seeing any change; All internally and automatically. How?   Let’s go start with an example.

First of all we’ll create one metadata to allow us see tables close to the example I’ve described.

The names of the tables are in Portuguese, but I’m sure this will not be a problem to you.

After the tables are created, we’ll have one structure close to what has been presented above, let’s go see one sample of the data.

826-image001.png

Now, suppose one developer builds the following query.

Translating, I want all data from Orders and Items sold between 2009/03/01 and 2009/03/05. To this query the Query Optimizer created the follow execution plan:

826-image004.gif

SELECT *    FROM Pedidos   INNER JOIN Items      ON Pedidos.ID_Pedido = Items.ID_Pedido   WHERE Pedidos.Data_Pedido BETWEEN ‘20090301’ AND ‘20090305’
  |–Hash Match(Inner Join, HASH:([Performance].[dbo].[Pedidos].[ID_Pedido])=([Performance].[dbo].[Items].[ID_Pedido]))
       |–Clustered Index Scan(OBJECT:([Performance].[dbo].[Pedidos].[xpk_Pedidos]), WHERE:([Performance].[dbo].[Pedidos].[Data_Pedido]>=’2009-03-01′ AND [Performance].[dbo].[Pedidos].[Data_Pedido]<=’2009-03-05′))
       |–Clustered Index Scan(OBJECT:([Performance].[dbo].[Items].[ix_Data_Entrega]))

In the text execution plan, we can easily see that the filter specified in the WHERE clause, was applied where the data was reading from index xpk_Pedidos, and the column which was received the filter was just the Data_Pedido specified in the WHERE clause.

The first thing we could do to optimize this query, is to create one index based on Data_Pedido column. You may be thinking that this would be the first thing that anyone would do with the query. Ok, you are right;  but we will just create this index and continue with the optimization.

After create the index we have the follow plan:

826-image006.gif

SELECT *    FROM Pedidos   INNER JOIN Items ON Pedidos.ID_Pedido = Items.ID_Pedido   WHERE Pedidos.Data_Pedido BETWEEN ‘20090301’ AND ‘20090305’
  |–Hash Match(Inner Join, HASH:([Performance].[dbo].[Pedidos].[ID_Pedido])=([Performance].[dbo].[Items].[ID_Pedido]))
       |–Index Seek(OBJECT:([dbo].[Pedidos].[ix_teste]), SEEK:([dbo].[Pedidos].[Data_Pedido] >= ‘2009-03-01’ AND [dbo].[Pedidos].[Data_Pedido] <= ‘2009-03-05’) ORDERED FORWARD)
       |–Clustered Index Scan(OBJECT:([Performance].[dbo].[Items].[ix_Data_Entrega]))

Now the query is a little better, instead of making a Clustered Index Scan to read the data from Pedidos table, the SQL makes an Index Seek based on the index which we just created.  Now, what else we could do to improve this further? It is time to see the Correlation_Optimization work.

As we can see, there is a point of bottleneck in the execution plan to read the data from the table Items: According to the execution plan, this read represents 49% of the entire cost, and the execution plan was using the clustered Index Scan operator to read the data from the table. If we could do something to use the Index Seek operator, then maybe the query could be improved. Is there something that we can do to the Query Optimizer use the Index Seek operator?

Knowing the database, the metadata and the business rules, we could modify the query to use one filter in the Items table. But, to do that, we need to be sure about this, because this change can’t modify the result of the query. One way to do this would be like:

  1. Based on the Orders between 2009/03/01 and 2009/03/05, look which are the min and the max delivery date from the Orders Items.
  2. With that information, put one new filter on the query to column Items.Data_Entrega.

After that, we have the following query:

This way we could force a filter on the Items table as in the plan:

826-image008.gif

But, I now have a question. How to do to know the values specified on Item 1 and 2 above mentioned? Is there a way to know these values without searching the tables? The problem is that  these values are not fixed values, but  are variables.

What about letting the SQL Server Query Optimizer figure out that for you? All you need to do is to enable the DATE_CORRELATION_OPTIMIZATION, then the SQL Server Query Optimizer will identify that for you, and will apply the filter on the Data_Entrega column itself. Wow! pretty smart isn’t it?  let me see if that is really true….

Let’s now run the query without the Data_Entrega filter again.

826-Rescaled.jpg

SELECT *    FROM Pedidos   INNER JOIN Items      ON Pedidos.ID_Pedido = Items.ID_Pedido   WHERE Pedidos.Data_Pedido BETWEEN ‘20090301’ AND ‘20090305’
  |–Hash Match(Inner Join, HASH:([dbo].[Pedidos].[ID_Pedido])=([dbo].[Items].[ID_Pedido]))
       |–Index Seek(OBJECT:([dbo].[Pedidos].[ix_teste]), SEEK:([dbo].[Pedidos].[Data_Pedido] >= ‘2009-03-01’ AND [dbo].[Pedidos].[Data_Pedido] <= ‘2009-03-05’) ORDERED FORWARD)
       |–Clustered Index Seek(OBJECT:([dbo].[Items].[ix_Data_Entrega]), SEEK:([dbo].[Items].[Data_Entrega] >= ‘2009-02-28’ AND [dbo].[Items].[Data_Entrega] < ‘2009-04-29’) ORDERED FORWARD)

Look;  this time, even when we have not specified the Data_Entrega filter, the Query Optimizer builds this for me.

Summary

The DATE_CORRELATION_OPTIMIZATION is a feature that is very good as a way of improving the performance of queries that use joins based on DATETIME values that are related but not necessarily equal, but it is almost never used outside specialized data-warehousing applications and BI.  This may be because there needs to be a good analysis, and prior thought, about when and where this can be used.

All the following conditions must be met to this feature work:

  • The database SET options must be set in the following way. All the following database options, ANSI_NULLS, ANSI_PADDING, ANSI_WARNINGS, ARITHABORT, CONCAT_NULL_YIELDS_NULL, and QUOTED IDENTIFIER must be SET to ON. NUMERIC_ROUNDABORT must be SET to OFF.
  • There must be a single-column foreign key relationship between the tables.
  • The tables must both have DATETIME columns that are defined NOT NULL.
  • At least one of the DATETIME columns must be the key column of a clustered index (if the index key is composite, it must be the first key), or it must be the partitioning column, if it is a partitioned table.
  • Both tables must be owned by the same user.

If you can meet these conditions and you are dealing with temporal data, then you could be in for a pleasant surprise