SQLServerCentral Article

Reaping the benefits of the Window functions in T-SQL

,

The enhanced OVER clause support and the new Analytic functions in SQL Server® 2012, better known as the Window functions, brought some interesting capabilities to the platform. Having used the Window functions quite extensively on various platforms I have learned to appreciate them for the performance benefits and how some previously complex solutions can be simplified. Now I want to share my thoughts on using the Window functions and hopefully this article will encourage other readers to do the same.

This article has two parts, the first part explains how the new cross row referencing functions LAG() and LEAD() can help improve the performance when splitting text in T-SQL. The second part demonstrates how simple windowed aggregation can be used to extend the functionality of the previous exercise.

Part 1, a LEAD for improved performance.

There are few performance-related challenges to overcome when splitting delimited strings in T-SQL. Fortunately some good work is already out there and quite few articles are available on the subject. To establish what can be improved by using the Window functions, one must first look at the actual work involved and then explore the alternatives.

Looking at the “work” part of Jeff Moden's DelimitedSplit8K function we see that the first part scans the string to find the position of every delimiting character. The second half uses the delimiters position as a starting point for both the SUBSTRING() function and the CHARINDEX(), the former for retrieving the substring and the latter finds the next delimiter position to calculate the length of the substring.

CREATE FUNCTION [dbo].[DelimitedSplit8K]
--===== Define I/O parameters
        (@pString VARCHAR(8000), @pDelimiter CHAR(1))
RETURNS TABLE WITH SCHEMABINDING AS
 RETURN
--===== "Inline" CTE Driven "Tally Table” produces values from 0 up to 10,000...
     -- enough to cover VARCHAR(8000)
  WITH E1(N) AS (
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
                ),                          --10E+1 or 10 rows
       E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
       E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
 cteTally(N) AS (--==== This provides the "zero base" and limits the number of rows right up front
                     -- for both a performance gain and prevention of accidental "overruns"
                 SELECT 0 UNION ALL
                 SELECT TOP (DATALENGTH(ISNULL(@pString,1))) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
                ),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
                 SELECT t.N+1
                   FROM cteTally t
                  WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter OR t.N = 0) 
                )
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
 SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY s.N1),
        Item       = SUBSTRING(@pString,s.N1,ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000))
   FROM cteStart s
;
GO

In a procedural pseudo code the DelimitedSplit8K algorithm with the CHARINDEX function expanded (in bold).

while not EndOfString           ; loop through the string to
  if string[pos] == delimiter   ; find a delimiter, then
    previous_pos = pos          ; record the position and
    next_pos = previous_pos + 1 ; pass it to the charindex function
    while not  EndOfString      ; scan the string from the position
      if string[next_pos] == delimiter  ; find the next delimiter
        output substring ( character_string,  ; and return the part of the
                           previous_pos, ; string between the two.
                           next_pos – previous_pos )
        exit loop
      iterate next_pos
    loop
  iterate pos
loop

which translates to a Set based pseudo-code;

select substring
  ( 
    pos
    ,(select TOP(1) pos         ; select characters where position is
     from string                ; greater than pos
     where string[next_pos] == delimiter ; find the next delimiter
     and pos < next_pos         ; shown here as a subquery
     ) - pos
  )
from string
where char[pos] == delimiter

It doesn't matter how we dress it, the algorithm is doing an extra scan of the string! But since we are working with a set and the set already has the positions of the all the delimiters, we should be able to retrieve the next delimiter position from the set instead of (re)scanning the string to find it.

For those who caught the contradiction in the previous statement, we are forcing the order of the retrieval of values from the set in the clause;

ROW_NUMBER() OVER(ORDER BY s.N1).

Four of the Window functions introduced in SQL Server® 2012 are cross-row referencing functions, FIRST_VALUE(), LAST_VALUE(), LAG() and LEAD(). In this case we are interested in the latter two. On the label it says (contracted):

LAG / LEAD (scalar_expression [ ,offset ] , [ default ] ) 
 OVER ( [ partition_by_clause ] order_by_clause)

Accesses data from a previous /subsequent row in the same result set without the use of a self-join in SQL Server 2012. LAG/LEAD provides access to a row at a given physical offset that comes before/follows the current row. Use this analytic function in a SELECT statement to compare values in the current row with values in a previous/following row.

In other words, retrieve a value from another row by a given offset and if not found pass the default. This means that the CHARINDEX() can be replaced with the LEAD() function, nearly halving the amount of work needed:

Item = SUBSTRING(@pString,s.N1,ISNULL(NULLIF((LEAD(s.N1,1,1) OVER (ORDER BY s.N1) – 1),0)-s.N1,8000))

In the same procedural pseudo-code as before the DelimitedSplit8K with the LEAD() now looks better with only a single scan of the string;

while not EndOfString          ; loop through the string to find all
  if string[pos] == delimiter  ; delimiters and retrieve the strings
    output substring ( character_string,  ; between the delimiters
                       pos, 
                       next_pos – pos )
  iterate pos
loop

and in a set based pseudo-code;

select
substring(pos,next_pos - pos)
from string
where char[pos] == delimiter

Putting the changes to a test using Jeff Moden's excellent Splitter test script produces the expected results, the LEAD() versions outperforming the CHARINDEX() ones by up to 50%.

A noticeable improvement and a good example of how performance can be improved by using the Window functions. In this particular problem the only code modification was the replacement of CHARINDEX() function with the LEAD() function.

Before:

CHARINDEX(@pDelimiter,@pString,s.N1)

After:

LEAD(s.N1,1,1) OVER (ORDER BY s.N1) – 1

How does it work?

Looking into the details of the query execution using the “Include Actual Execution Plan” we see the new Window Spool and a Stream Aggregate operator.

STATISTICS IO output shows a Worktable with zero IO:

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Itzik Ben-Gan explains this as "the Window Spool iterator uses a highly optimized, in-memory work table, without all of the usual overhead that exists with work tables in tempdb, such as I/ O, locks, latches, and so forth." (Ben-Gan,I.,2012,Kindle Locations 3499-3501).

In this case the Window Spool operator in conjunction with the Stream Aggregate operator provide the cross-row reference with almost no overhead.

Part 2 - Parsing a CSV

When it comes to real world problems, the simple delimiter splitting functions do leave few things to be desired. In many ways one could argue that so far it has been more of “coding for the optimal problem” than “optimally coding for the problem” as a large portion of the delimiter separated data one has to deal with on a daily basis is slightly more complex. The biggest shortcomings would propably be the handling of text qualifiers according to RFC-4180.

For the sake of argument we introduce a typical “Artist” record

100123,"Blues","West Point, Mississippi","Burnett,""Howlin' Wolf"" Arthur Chester",1910-06-10.

Using our previous splitter functions to parse the Artist record certainly does not produce the desired results:

Or as this visual presentation of the CSV logic reveals, there is some level of the complexity;

Here's the CSV Splitter logic translated into the procedural pseudo-code;

while not EndOfString
  if string[pos] == [text qualifier] 
    if string[previous_pos] != [text qualifier]
      set InTextQualifier = !InTextQualifier
    else
      delete previous_pos
    end if
  end if
  if string[pos] == [delimiter] [AND NOT] InTextQualifier
    output substring ( character_string, pos, next_pos – pos )
  end if
  iterate pos
loop

At first glance this looks like a contorted and complex code in T-SQL. Even the C(ursor) word comes to mind! Interestingly enough the problem is somewhat simpler when looked at as a set based rather than procedural based task. Again, using the set based pseudo-code to describe the problem:

select 
      unescape_text_qualifier(substring(string, pos, next_pos - pos))
where (pos and next_pos) not between text_qualifier

This may only look like a simplification of the problem but in fact it is very accurate. Starting with the filtering of the sets, “where pos / next_pos not between text_qualifier”, how can the range of active text qualifiers be defined? The answer is by counting the text qualifiers and when the count is an odd number, then the text qualifiers are “active”. A running total by the order of the character position will then provide the means of identifying the active and inactive parts of the character set.

/*********************************************************************
    Using Jeff Moden's DelimitedSplit8K as a base with the addition
    of a Text Qualifier parameter, @pTxtQualifier CHAR(1) = '"'
 *********************************************************************/DECLARE @pString        VARCHAR(8000) = ''
DECLARE @pDelimiter     CHAR(1)       = ',';
DECLARE @pTxtQualifier  CHAR(1)       = '"';
-- A typical Artist record
SELECT @pString = '100123,"Blues","West Point, Mississippi","Burnett,'
 + '""Howlin'' Wolf"" Arthur Chester",1910-06-10,';
/*********************************************************************
    cteTally, inline Tally table returning a number sequence equivalent 
    to the length of the input string. The Cast of the Len() to a 
    Bigint prevents an implicit cast.
 *********************************************************************/;WITH E1(N) AS 
(
  SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
  SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
  SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
),                          
E2(N) AS (SELECT 1 FROM E1 a, E1 b), 
E4(N) AS (SELECT 1 FROM E2 a, E2 b), 
cteTally(N) AS 
(
  SELECT 
  ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) 
  FROM E4
  ORDER BY 1 OFFSET 0 ROWS 
  FETCH FIRST CAST(LEN(@pString) AS BIGINT) ROWS ONLY
)
/********************************************************************
   Retrieve the position (N) and the character code (chrCode) 
   for all delimiters (@pDelimiter) and text qualifiers 
   (@pTxtQualifier)
********************************************************************/,ctePrimer(N,chrCode) AS 
(
  SELECT 
     t.N 
    ,UNICODE(SUBSTRING(@pString,t.N,1))  AS chrCode
  FROM cteTally t 
  WHERE SUBSTRING(@pString,t.N,1) COLLATE Latin1_General_BIN 
                 = @pDelimiter    COLLATE Latin1_General_BIN
  OR    SUBSTRING(@pString,t.N,1) COLLATE Latin1_General_BIN 
                 = @pTxtQualifier COLLATE Latin1_General_BIN
)
/********************************************************************
   The cteStart encloses the string in virtual delimiters using 
   Union All at the beginning and the end. The main body sets the 
   IsDelim and IsTxQf flags. 
********************************************************************/,cteStart(N,IsDelim,IsTQA) AS 
(
  SELECT 
     0                   AS N
    ,1                   AS IsDelim
    ,0                   AS IsTxQf
UNION ALL 
  SELECT 
     t.N 
    ,(1 - SIGN(ABS(t.chrCode - UNICODE(@pDelimiter))))    AS IsDelim
    ,(1 - SIGN(ABS(t.chrCode - UNICODE(@pTxtQualifier)))) AS IsTxQf
  FROM ctePrimer t 
UNION ALL
  SELECT                  
     LEN(@pString) + 1   AS N
    ,1                   AS IsDelim
    ,0                   AS IsTxQf
)
/********************************************************************
   Position (N), Delimiter flag (IsDelim), Text Qualifier flag
   (IsTQA) and the running total of the number of appearance of 
   Text Qualifiers
********************************************************************/SELECT
   cST.N
  ,cST.IsDelim
  ,cST.IsTQA
  ,SUM(cST.IsTQA) 
    OVER (ORDER BY cST.N 
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS TQRT
  ,CASE
    WHEN SUM(cST.IsTQA) 
    OVER (ORDER BY cST.N 
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) % 2 = 1 THEN 'X'
ELSE ''
   END AS Active
FROM cteStart    cST

As before, the query starts with an inline Tally table in place of an iterative structure. The ctePrimer then selects all positions (N) where the character is either a delimiter or a text qualifier. The next part, cteStart then assigns either 0 or 1 to the two flags. IsDelim is set to 1 when the character is a delimiter and IsTxQf is set to 1 when the character is a text qualifier.

The output of the query correctly identifies the active and inactive delimiters;

NIsDelimIsTQATQRTActive
0100
7100
8011X
14012
15102
16013X
27103X
40014
41104
42015X
50105X
51016
52017X
65018
66019X
820110
831010
941010
951010

Encapsulating the logic in a CTE cteWorkSet, replacing the previous select statement, now effectively cancels out all delimiters which are not between text_qualifier;

/********************************************************************
   cteWorkSet:
   Position (N), Delimiter flag (IsDelim), Text Qualifier flag
   (IsTQA) and the running total of the number of appearances of 
   Text Qualifiers. The delimiters which are inside Text Qualifiers
   are cancelled out by multiplying the IsDelim flag with the result
   of ( 1 + the running total of IsTQA ) mod 2.
********************************************************************/,cteWorkSet(N,IsDelim,IsTQA) AS 
(
  SELECT
      cST.N
     ,cST.IsDelim * ((1+ SUM(cST.IsTQA)  OVER 
      (PARTITION BY (SELECT NULL) ORDER BY cST.N 
      ROWS UNBOUNDED PRECEDING)) % 2)                   AS IsDelim
     ,((SUM(cST.IsTQA)  OVER (PARTITION BY (SELECT NULL) 
      ORDER BY cST.N ROWS UNBOUNDED PRECEDING)) % 2)    AS IsTQA
  FROM cteStart  cST
)
SELECT 
   N
  ,IsDelim
  ,IsTQA
FROM cteWorkSet
WHERE IsDelim = 1 OR IsTQA = 1
NIsDelimIsTQA
010
710
801
1510
1601
2701
4110
4201
5001
5201
6601
8310
9410
9510

The positions of the active delimiters, identified by the IsDelim flag equals 1.

Preparing the output in a CTE by adding both LAG() and LEAD() for the Start and Length calculation;

cteWSTQ(P_START,IsDelim,NEXT_IsTQA,LAG_IsTQA) AS
(
  SELECT 
    cWS.N                                      AS P_START
    ,cWS.IsDelim                               AS IsDelim
    ,LEAD(cWS.IsTQA,1,0) OVER (ORDER BY cWS.N) AS NEXT_IsTQA
    ,LAG(cWS.IsTQA,1,0) OVER (ORDER BY cWS.N)  AS LAG_IsTQA
  FROM  cteWorkSet    cWS 
  WHERE cWS.IsDelim = 1
  OR    cWS.IsTQA   = 1
)

The formula for calculating the starting point is then

P_START + NEXT_IsTQA + SIGN(P_START)

where P_START is the actual position of the delimiter, NEXT_IsTQA indicates whether the field has Text Qualifiers and finally the SIGN(P_START) shifts all but the first field by 1 position to exclude the actual delimiter character.

Calculating the length of the field is then done by subtracting the start position from the end position and adding the offset correction to the start position.

LEAD(X.P_START,1,0) OVER (ORDER BY X.P_START) - ((X.P_START + X.NEXT_IsTQA) + SIGN(X.P_START) + LEAD(X.LAG_IsTQA,1,0) OVER (ORDER BY X.P_START))

As the outputs of the Window functions are needed to filter the set in order to exclude the last record where the length is a negative value, another CTE is added to the mix;

cteWSLEN(P_START,P_LEN) AS
(
  SELECT 
     (X.P_START + X.NEXT_IsTQA + SIGN(X.P_START))       AS P_START
    ,(LEAD(X.P_START,1,0) OVER (ORDER BY X.P_START) - 
     ((X.P_START + X.NEXT_IsTQA)  + SIGN(X.P_START) + 
     LEAD(X.LAG_IsTQA,1,0) OVER (ORDER BY X.P_START)))  AS P_LEN
  FROM cteWSTQ X WHERE X.IsDelim = 1
)

All that is left to do now is to pass the output of the cteWSLEN to the SUBSTRING() function and add a ROW_NUMBER() for the ItemNumber. It is very important that the ORDER BY clause of the ROW_NUMBER() function is set to (SELECT NULL). Otherwise the optimizer will introduce a rather expensive SORT operator.

/********************************************************************
   Splitting the string using the output of the cteWSLEN, filtering
   it by the length being non-negative value. The NULLIF returns NULL
   if the field is empty.
********************************************************************/SELECT
   ROW_NUMBER() OVER (ORDER BY (SELECT NULL))           AS ItemNumber
  ,NULLIF(REPLACE(SUBSTRING(@pString,cWL.P_START,cWL.P_LEN),@pTxtQualifier+@pTxtQualifier,@pTxtQualifier),'') AS Item
FROM cteWSLEN    cWL
WHERE cWL.P_LEN > -1

Here are the results:

ItemNumber           Item
-------------------- --------------------------------------
1                    100123
2                    Blues
3                    West Point, Mississippi
4                    Burnett,"Howlin' Wolf" Arthur Chester
5                    1910-06-10
6                    NULL

Performance vs Functionality

The added functionality does come at a price. When compared to the simple split functions it is more than three times slower. But then again like someone once said, "it doesn't matter how fast your code is if it doesn't work".

The single most expensive part is the replacement of the escaped text qualifiers. Without it the function code is still faster than most splitting functions which use the CHARINDEX() function.

,NULLIF(REPLACE(SUBSTRING( @pString,cWL.P_START,cWL.P_LEN)
,@pTxtQualifier+@pTxtQualifier,@pTxtQualifier),'') AS Item

It is rather obvious that this string manipulation is slow so improving it could be a future challenge.

Moving on to more complex parsing

Recently I was asked if I knew of a way to normalize parameterized dynamic SQL strings being captured by a monitoring system. The problem was that the values passed to the parameters would also be captured, creating a number of entries for each individual query statement. The problem is best described by a “before” and “after” example;

Before:

EXEC sp_executesql N'USE AdventureWorks2012; SELECT PROD.ProductID,PROD.Name FROM Production.Product PROD WHERE PROD.ProductNumber = @PARAM_1',N'@PARAM_1 NVARCHAR(25)',@PARAM_1=N'BA-8327'

After:

EXEC sp_executesql N'USE AdventureWorks2012; SELECT PROD.ProductID,PROD.Name FROM Production.Product PROD WHERE PROD.ProductNumber = @PARAM_1',N'@PARAM_1 NVARCHAR(25)',@PARAM_1=#

This is slightly more complex than the previous CSV parsing but still could be achieved using a variation of the same technique, even if neither the name nor the number of parameters are known. Any suggestions?

References

Ben-Gan, Itzik (2012-04-19). Microsoft® SQL Server® 2012 High-Performance T-SQL Using Window Functions. OReilly Media - A. Kindle Edition.

Jeff Moden, Tally OH! An Improved SQL 8K “CSV Splitter” Function, http://www.sqlservercentral.com/articles/Tally+Table/72993/ , accessed 2013-12-29

Code for the DelimitedSplit8K_LEAD and CSVSplitter is in the resource section.

Resources

Rate

5 (50)

You rated this post out of 5. Change rating

Share

Share

Rate

5 (50)

You rated this post out of 5. Change rating