07 September 2021

15123 views

15 0

How LAG compares to other techniques

LAG pulls a column from another row without a self-join. In this article, Kathi Kellenberger shows how LAG compares to other techniques

Microsoft introduced the first window (aka, windowing or windowed) functions with SQL Server 2005. These functions were ROW_NUMBER, RANK, DENSE_RANK, NTILE, and the window aggregates. Many folks, including myself, used these functions without realizing they were part of a special group. In 2012, Microsoft added several more: LAG and LEAD, FIRST_VALUE and LAST_VALUE, PERCENT_RANK and CUME_DIST, PERCENTILE_CONT, and PERCENTILE_DISC. They also added the ability to do running totals and moving calculations.

These functions were promoted as improving performance over older techniques, but that isn’t always the case. There were still performance problems with the aggregate functions introduced in 2005 and the four of the functions introduced in 2012. In 2019, Microsoft introduced Batch Mode on Row Store, available on Enterprise and Developer Editions, that can improve the performance of window aggregates and the four statistical functions from 2012.

I started writing this article to compare some window function solutions to traditional solutions. I found that there were so many ways to write a query that includes a column from another row that this article is dedicated to the window functions LAG and LEAD.

Include a column from another row using LAG

Including a column from another row typically means some a self-join. The code is somewhat difficult to write and doesn’t perform well. The LAG function can be used to pull in the previous row without a self-join. As long as there are adequate indexes in place, using LAG solves those problems. LEAD works the same way as LAG, except that it grabs a later row.

Here’s an example using LAG in the AdventureWorks database:

SET STATISTICS IO, TIME ON;

--LAG

SELECT PROD.ProductID, PROD.Name, SOH.OrderDate,

DATEDIFF(DAY, LAG(SOH.OrderDate)

OVER(PARTITION BY PROD.ProductID

ORDER BY SOH.OrderDate),SOH.OrderDate) AS DaysBetweenOrders

FROM Production.Product AS PROD

JOIN Sales.SalesOrderDetail AS SOD

ON SOD.ProductID = PROD.ProductID

JOIN Sales.SalesOrderHeader AS SOH

ON SOH.SalesOrderID = SOD.SalesOrderID

GROUP BY PROD.ProductID, PROD.Name, SOH.OrderDate

ORDER BY PROD.ProductID, SOH.OrderDate;

This query returns a list of products and the dates they were ordered. It uses the DATEDIFF function to compare the current order date to the prior order date determined with LAG. The OVER clause uses a PARTITION BY on ProductID to ensure that different products are not compared. The OVER clause for LAG requires an ORDER BY. It is ordered by OrderDate since that is how the rows should be lined up to find the previous date. Figure 1 shows the partial results.

Image showing the partial results

Figure 1. The partial results of the query using LAG

The query ran in under a second, about 300 ms, and required just 365 logical reads, as shown in Figure 2.

Image showing logical reads from LAG

Figure 2. The logical reads of the LAG query

I tried several methods to see if it was possible to write a query that performed well without LAG. Even though the database is small, the queries took some time to run, depending on the technique.

Self-join

The self-join technique is painfully slow. Keeping the first order row of each ProductID in the results required LEFT JOINS, but this was so slow, I killed the query after a few minutes. Instead, the following query omits the NULL row for each product.

--Self-join

SELECT PROD.ProductID, PROD.Name, SOH.OrderDate,

DATEDIFF(DAY, MAX(SOH2.OrderDate), SOH.OrderDate) AS DaysBetweenOrders

FROM Production.Product AS PROD

JOIN Sales.SalesOrderDetail AS SOD

ON SOD.ProductID = PROD.ProductID

JOIN Sales.SalesOrderHeader AS SOH

ON SOH.SalesOrderID = SOD.SalesOrderID

JOIN Sales.SalesOrderDetail AS SOD2

ON SOD2.ProductID = PROD.ProductID

JOIN Sales.SalesOrderHeader AS SOH2

ON SOH2.SalesOrderID = SOD2.SalesOrderID

WHERE SOH2.OrderDate < SOH.OrderDate

GROUP BY PROD.ProductID

, PROD.Name

, SOH.OrderDate

, SOD2.ProductID

ORDER BY PROD.ProductID, SOH.OrderDate;

The query ran in 20 seconds and had 3,103 logical reads, as shown in Figure 3.

Statistics IO from self-join

Figure 3. Logical reads of self-join query

The query uses the MAX function To find the previous OrderDate and filters to find rows in SOH2 with OrderDate less than the OrderDate in SOD.

Derived table

Is it possible to improve the performance of the self-join with a derived table? Here’s a query to find out:

--Derived tables

SELECT S1.ProductID, S1.Name, S1.OrderDate,

DATEDIFF(DAY,MAX(S2.OrderDate),S1.OrderDate) AS DaysBetweenOrders

FROM

(SELECT Prod.ProductID, Prod.Name, SOH.OrderDate

FROM Production.Product AS PROD

JOIN Sales.SalesOrderDetail AS SOD

ON SOD.ProductID = PROD.ProductID

JOIN Sales.SalesOrderHeader AS SOH

ON SOH.SalesOrderID = SOD.SalesOrderID

GROUP BY PROD.ProductID

, PROD.Name

, SOH.OrderDate

) AS S1

LEFT JOIN (

SELECT SOD.ProductID, SOH.OrderDate

FROM Sales.SalesOrderDetail AS SOD

JOIN Sales.SalesOrderHeader AS SOH

ON SOH.SalesOrderID = SOD.SalesOrderID

GROUP BY SOD.ProductID

, SOH.OrderDate

) AS S2

ON S2.ProductID = S1.ProductID AND S1.OrderDate > S2.OrderDate

GROUP BY S1.ProductID

, S1.Name

, S1.OrderDate

ORDER BY S1.ProductID, S1.OrderDate;

I was surprised to find that this query ran in 2 seconds even though the logical reads were much higher!

Statistics IO from derived table

Figure 4. Logical reads of derived table query

That’s quite good, though not as good as the query with LAG.

Common table expression

Another way to solve the query is with a common table expression (CTE):

--CTE

WITH Products AS (

SELECT PROD.ProductID, PROD.Name, SOH.OrderDate

FROM Production.Product AS PROD

JOIN Sales.SalesOrderDetail AS SOD

ON SOD.ProductID = PROD.ProductID

JOIN Sales.SalesOrderHeader AS SOH

ON SOH.SalesOrderID = SOD.SalesOrderID

GROUP BY PROD.ProductID, PROD.Name, SOH.OrderDate

)

SELECT P1.ProductID, P1.Name, P1.OrderDate,

DATEDIFF(DAY, MAX(P2.OrderDate), P1.OrderDate) AS DaysBetweenOrders

FROM Products P1

LEFT JOIN Products P2

ON P2.ProductID = P1.ProductID

WHERE P1.OrderDate > P2.OrderDate

GROUP BY P1.ProductID, P1.Name, P1.OrderDate;

This query also ran in 2 seconds with the same logical reads as the derived table. What’s going on here? The CTE is not saving the results to be reused – the tables are accessed twice – but the joins and some other operators are different. The optimizer was able to come up with better plans by using derived tables or a CTE.

OUTER APPLY

The APPLY operator is often used to improve the performance of queries. There are two flavors: CROSS APPLY and OUTER APPLY, similar to JOIN and LEFT JOIN. The APPLY operator can be used to solve many interesting queries, and in this case, it OUTER APPLY replaces the LEFT JOIN.

--OUTER APPLY

SELECT PROD.ProductID, PROD.Name, SOH.OrderDate,

DATEDIFF(DAY, S2.PrevOrderDate, SOH.OrderDate) AS DaysBetweenOrders

FROM Production.Product AS PROD

JOIN Sales.SalesOrderDetail AS SOD

ON SOD.ProductID = PROD.ProductID

JOIN Sales.SalesOrderHeader AS SOH

ON SOH.SalesOrderID = SOD.SalesOrderID

OUTER APPLY (

SELECT MAX(SOH2.OrderDate) AS PrevOrderDate

FROM Sales.SalesOrderDetail AS SOD2

JOIN Sales.SalesOrderHeader AS SOH2

ON SOH2.SalesOrderID = SOD2.SalesOrderID

WHERE SOD2.ProductID = PROD.ProductID

AND SOH2.OrderDate < SOH.OrderDate) S2

GROUP BY DATEDIFF(DAY, S2.PrevOrderDate, SOH.OrderDate)

, PROD.ProductID

, PROD.Name

, SOH.OrderDate

ORDER BY PROD.ProductID, SOH.OrderDate;

This query ran in 12 seconds and had a whopping 86,281,577 logical reads!

Statistic IO from OUTER APPLY

Figure 5. Logical reads from OUTER APPLY query

OUTER APPLY is acting like a function call in this scenario calling OUTER APPLY once for each row in the outer query, which would probably not be the case if the less than operator was not involved.

It’s possible to use TOP(1) instead of MAX, but then the query inside OUTER APPLY must also be ordered, and the results are not any better.

Temp Table

Since whatever method is used needs a distinct list of ProductID, Name, and OrderDate, these rows could be stored in a temp table.

--Create temp table

DROP TABLE IF EXISTS #ProductList;

SELECT PROD.ProductID, PROD.Name, SOH.OrderDate

INTO #ProductList

FROM Production.Product AS PROD

JOIN Sales.SalesOrderDetail AS SOD

ON SOD.ProductID = PROD.ProductID

JOIN Sales.SalesOrderHeader AS SOH

ON SOH.SalesOrderID = SOD.SalesOrderID

GROUP BY PROD.ProductID

, PROD.Name

, SOH.OrderDate;

Here’s an example where the temp table was used with OUTER APPLY:

--OUTER APPLY with Temp table

SELECT PL.ProductID, PL.Name, PL.OrderDate,

DATEDIFF(DAY, PrevOrder.OrderDate, PL.OrderDate) AS DaysBetweenOrders

FROM #ProductList PL

OUTER APPLY (

SELECT MAX(PL1.OrderDate) AS OrderDate

FROM #ProductList AS PL1

WHERE PL1.ProductID = PL.ProductID

AND PL1.OrderDate < PL.OrderDate

) AS PrevOrder

GROUP BY DATEDIFF(DAY, PrevOrder.OrderDate, PL.OrderDate)

, PL.ProductID

, PL.Name

, PL.OrderDate

ORDER BY PL.ProductID, PL.OrderDate;

The temp table drastically improved the time, 1.4 seconds. The actual tables were only touched once to create the temp table. Then the temp table was scanned twice and joined with a Nested Loop.

Image showing execution plan for temp table query

Figure 6. Partial execution plan when using a temp table

Adding an index to the temp table might improve the performance even more. This script ran in about 700 ms.

--Indexed temp table and OUTER APPLY

CREATE CLUSTERED INDEX IDX_ProductList ON #ProductList

(ProductID, OrderDate);

SELECT PL.ProductID, PL.Name, PL.OrderDate,

DATEDIFF(DAY, PrevOrder.OrderDate, PL.OrderDate) AS DaysBetweenOrders

FROM #ProductList PL

OUTER APPLY (

SELECT MAX(PL1.OrderDate) AS OrderDate

FROM #ProductList AS PL1

WHERE PL1.ProductID = PL.ProductID

AND PL1.OrderDate < PL.OrderDate

) AS PrevOrder

GROUP BY DATEDIFF(DAY, PrevOrder.OrderDate, PL.OrderDate)

, PL.ProductID

, PL.Name

, PL.OrderDate

ORDER BY PL.ProductID, PL.OrderDate;

The self-join was also improved with the temp table, returning in just 3 seconds even while using LEFT JOIN, which wasn’t possible before.

--Self-join with temp table

SELECT PL.ProductID, PL.Name, PL.OrderDate,

DATEDIFF(DAY, MAX(PL2.OrderDate), PL.OrderDate) AS DaysBetweenOrders

FROM #ProductList AS PL

LEFT JOIN #ProductList AS PL2

ON PL2.ProductID = PL.ProductID AND PL.OrderDate > PL2.OrderDate

GROUP BY PL.ProductID

, PL.Name

, PL.OrderDate

ORDER BY PL.ProductID, PL.OrderDate;

Scalar user-defined function

The performance of user-defined scalar functions was improved in 2019 with inlining. I had hoped that the performance would not be too bad since SQL Server is running version 2019. Unfortunately, the less than operator (<) kills the performance. Even adding in an index on OrderDate didn’t help much. I killed the query after 25 seconds.

--Scalar UDF

CREATE OR ALTER FUNCTION [dbo].[GetPreviousOrderDate]

(

@ProductID INT, @OrderDate DATETIME

)

RETURNS DATETIME

BEGIN

DECLARE @PrevOrderDate DATETIME;

SELECT @PrevOrderDate = MAX(OrderDate)

FROM Sales.SalesOrderHeader AS SOH

JOIN Sales.SalesOrderDetail AS SOD

ON SOD.SalesOrderID = SOH.SalesOrderID

WHERE SOD.ProductID = @ProductID AND SOH.OrderDate < @OrderDate;

-- Return the result of the function

RETURN @PrevOrderDate;

END;

CREATE INDEX test_index ON Sales.SalesOrderHeader (OrderDate);

--Function call

SELECT PL.ProductID, PL.Name, PL.OrderDate,

dbo.GetPreviousOrderDate(PL.ProductID, PL.OrderDate)

FROM #ProductList AS PL

GROUP BY PL.ProductID

, PL.Name

, PL.OrderDate

ORDER BY PL.ProductID, PL.OrderDate;

Since I killed the query before it completed, I am not sure how long it would keep running. The execution looks simple enough, but it doesn’t show that the function is called many times.

Execution plan for UDF

Figure 7. The execution plan for the scalar UDF

Table-valued functions

I’ve often heard someone say, “just turn it into a table-valued function” when UDF issues arise. However, it’s still possible to do “bad things” with table-valued functions as well. There are two types of table-valued functions, multi-statement and inline. Multi-statement table-valued functions (MSTVF) can have loops, IF blocks, and table variables, so they do not scale well.

Inline table-valued functions (ITVF) only allow a single query. Of course, the function could contain a poorly written query, but generally, you will see better performance with these. In this case, however, the performance is still not as good as using LAG. Note that the call to the ITVF also uses OUTER APPLY.

--Inline table-valued function

CREATE OR ALTER FUNCTION dbo.ITVF_GetPrevDate

(

@ProductID INT, @OrderDate DATETIME

)

RETURNS TABLE

RETURN

(

SELECT MAX(SOH.OrderDate) AS PrevOrderDate

FROM Sales.SalesOrderHeader AS SOH

JOIN Sales.SalesOrderDetail AS SOD

ON SOD.SalesOrderID = SOH.SalesOrderID

WHERE SOD.ProductID = @ProductID

AND SOH.OrderDate < @OrderDate

)

SELECT PL.ProductID, PL.Name, PL.OrderDate,

DATEDIFF(DAY,IGPD.PrevOrderDate,PL.OrderDate) AS DaysBetweenOrders

FROM #ProductList AS PL

OUTER APPLY [dbo].[ITVF_GetPrevDate] (PL.ProductID,PL.OrderDate) IGPD

ORDER BY PL.ProductID, PL.OrderDate;

The ITVF takes about 12 seconds to run with over 59 million logical reads when using the temp table.

Execution plan for ITVF and temp table

Figure 8. The logical reads for the ITVF

Cursor

I debated whether to include a section on cursors because I don’t want to encourage anyone to start with a cursor solution. However, I remembered other situations when a cursor solution performed better compared to other techniques, so I decided to include it. Cursors are another tool in your T-SQL toolbox. They may be at the bottom of the box and a bit rusty from disuse, but a tool, nonetheless.

One important thing to note about running the cursor solution is to turn off the Actual Execution Plan (or any other method you might be using to capture execution plans) and Statistics. With those turned off, the script took about 2 seconds to run!

--Cursor

--Important! Also turn off Actual Execution Plan

SET STATISTICS IO, TIME OFF

ALTER TABLE #ProductList

ADD DaysBetweenOrders INT;

DECLARE @LastProductID INT, @ProductID INT;

DECLARE @LastOrderDate DATETIME, @OrderDate DATETIME;

DECLARE @DaysBetweenOrders INT;

DECLARE Products CURSOR FAST_FORWARD FOR

SELECT p.ProductID, P.OrderDate

FROM #ProductList AS P

ORDER BY p.ProductID, P.OrderDate

;

OPEN Products;

FETCH NEXT FROM Products INTO @ProductID, @OrderDate;

WHILE @@FETCH_STATUS = 0

BEGIN

IF @LastProductID = @ProductID BEGIN

SET @DaysBetweenOrders = DATEDIFF(DAY,@LastOrderDate,@OrderDate);

UPDATE #ProductList SET DaysBetweenOrders = @DaysBetweenOrders

WHERE ProductID = @ProductID AND OrderDate =@OrderDate;

END;

SELECT @LastOrderDate = @OrderDate, @LastProductID = @ProductID;

FETCH NEXT FROM Products INTO @ProductID, @OrderDate;

END

CLOSE Products;

DEALLOCATE Products;

SELECT P.ProductID

, P.Name

, P.OrderDate

, P.DaysBetweenOrders

FROM #ProductList AS P

ORDER BY P.ProductID, P.OrderDate;