Introduction
An old trick nowadays but one which is still underused here on SQLServerCentral is the cascading CROSS APPLY – where output from one CROSS APPLY is used as input for another. The term cascading CROSS APPLY (or "cCA" for short) was coined by our very own Jeff Moden here.
The use of cCAs can provide some rather remarkable set-based "in-row" processing abilities where the use of RBAR, inline functions, or multiple separate queries or CTEs might otherwise be required. Because cCAs can eliminate so many different types of code elements, they also make for some relatively short, easy to read, and easy to maintain code.
Whilst the technique works equally well with OUTER APPLY, you’re most likely to use it as sets of inline row-valued functions, that is, operating on rows from a table on the left side of the CROSS APPLY.
A Simple "Split" Example
One of the simplest examples of cCA usage takes the form of a string splitter with a tightly defined scope.
Defining the Problem
Let’s say we’ve imported a table containing email information in the following format:
CREATE TABLE #EmailAddress
(
EmployeeID INT IDENTITY(1,1),
EmailAddress VARCHAR(128)
)
;
INSERT INTO #EmailAddress
(EmailAddress)
SELECT '<mmooney@gmail.com>;mmoney@gmail.com' UNION ALL
SELECT '<chrisgardener@yahoo.com>)''[chrisgardener@comcast.net)]' UNION ALL
SELECT '"natan@hotmail.com'':>"natan@hotmail.com?''' UNION ALL
SELECT '"jennifer@walmart.com}'';jennifer@fedex.com' UNION ALL
SELECT '''<jennifer.juniper@fedex.com>;"jennifer.juniper@fedex.com"'''
;
Let's also say that we want to compare the domain of the two email addresses in each data element and perform some action depending on whether they are the same or not. To compare the domains we first have to resolve them out of the string.
First Thoughts
If we intend to tackle this problem using only T-SQL, two tools spring to mind.
One of these tools is the built-in scalar function PARSENAME(), which returns the specified part of a 4-part object name delimited with a period(.). Well, perhaps - if the at sign (@) was replaced with a period - but it can be eliminated immediately because it returns NULL if the string contains more than three delimiters.
The other tool would be the popular DelimitedSplit8K inline table-valued function, which returns parsed strings as a derived table. Using it seems simple enough. Just pass in the email address and at sign as delimiter and truncate the returned strings at the first period:
SELECT ea.EmployeeID, ea.EmailAddress, Domain1 = MAX(CASE WHEN ca.ItemNumber = 2 THEN LEFT(ca.Item,CHARINDEX('.',ca.Item,1)-1) ELSE '' END), Domain2 = MAX(CASE WHEN ca.ItemNumber = 3 THEN LEFT(ca.Item,CHARINDEX('.',ca.Item,1)-1) ELSE '' END) FROM #EmailAddress ea CROSS APPLY dbo.DelimitedSplit8K (ea.EmailAddress, '@') ca WHERE ca.ItemNumber IN (2,3) GROUP BY ea.EmployeeID, ea.EmailAddress HAVING COUNT(*) = 2;
This does generate the correct results:
EmployeeID | EmailAddress | Domain1 | Domain2 |
1 | <mmooney@gmail.com>;mmoney@gmail.com | gmail | gmail |
2 | <chrisgardener@yahoo.com>)'[chrisgardener@comcast.net)] | yahoo | comcast |
3 | "natan@hotmail.com':>"natan@hotmail.com?' | hotmail | hotmail |
4 | "jennifer@walmart.com}';jennifer@fedex.com | walmart | fedex |
5 | <jennifer.juniper@fedex.com>;"jennifer.juniper@fedex.com"' | fedex | fedex |
…but seems a bit heavy handed because it has to split everything out, return a result set for each row, and then choose from the result set. Using our test data it runs about 2.5 times slower than the cCA code we'll end up with using.
Inline Splitting
If we could read from the first at to the first period, then jump to the second at and second period, we’d be there. The next simple query picks up the character position of the first at:
SELECT EmployeeID, EmailAddress, p1 = CHARINDEX('@',EmailAddress,1) FROM #EmailAddress;
The results:
EmployeeID | EmailAddress | p1 |
1 | <mmooney@gmail.com>;mmoney@gmail.com | 9 |
2 | <chrisgardener@yahoo.com>)'[chrisgardener@comcast.net)] | 15 |
3 | "natan@hotmail.com':>natan@hotmail.com?' | 7 |
4 | "jennifer@walmart.com}';jennifer@fedex.com | 10 |
5 | '<jennifer.juniper@fedex.com>;"jennifer.juniper@fedex.com"' | 19 |
If p1 was passed into a similar CHARINDEX expression as the 3rd parameter it would return the position of the next delimiter, and we can show this with a CTE:
SELECT EmployeeID, EmailAddress, p1 = CHARINDEX('@',EmailAddress,1) FROM #EmailAddress; WITH level1 AS ( SELECT EmployeeID, EmailAddress, p1 = CHARINDEX('@',EmailAddress,1) FROM #EmailAddress ) SELECT EmployeeID, EmailAddress, p1, p2 = CHARINDEX('.',EmailAddress,p1) FROM level1;
with these results:
EmployeeID | EmailAddress | p1 | p2 |
1 | <mmooney@gmail.com>;mmoney@gmail.com | 9 | 15 |
2 | <chrisgardener@yahoo.com>)'[chrisgardener@comcast.net)] | 15 | 21 |
3 | "natan@hotmail.com':>natan@hotmail.com?' | 7 | 15 |
4 | "jennifer@walmart.com}';jennifer@fedex.com | 10 | 18 |
5 | <jennifer.juniper@fedex.com>;"jennifer.juniper@fedex.com"' | 19 | 25 |
Continuing in the same manner eventually yields the 4 character positions required to SUBSTRING the domains out of the email address column:
WITH level1 AS ( -- find p1 = first '@' SELECT EmployeeID, Email, p1 = CHARINDEX('@',Email,1) FROM @t), level2 AS ( -- find p2 = first '.' SELECT EmployeeID, Email, p1, p2 = CHARINDEX('.',Email,p1) FROM level1), level3 AS ( -- find p3 = second '@' SELECT EmployeeID, Email, p1, p2, p3 = CHARINDEX('@',Email,p2) FROM level2), level4 AS ( -- find p4 = second '.' SELECT EmployeeID, Email, p1, p2, p3, p4 = CHARINDEX('.',Email,p3) FROM level3) SELECT EmployeeID, Email, Domain1 = SUBSTRING(Email,p1+1,p2-(p1+1)), Domain2 = SUBSTRING(Email,p3+1,p4-(p3+1)) FROM level4;
This gives us:
EmployeeID | EmailAddress | Domain1 | Domain2 |
1 | <mmooney@gmail.com>;mmoney@gmail.com | gmail | gmail |
2 | <chrisgardener@yahoo.com>)'[chrisgardener@comcast.net)] | yahoo | comcast |
3 | "natan@hotmail.com':>"natan@hotmail.com?' | hotmail | hotmail |
4 | "jennifer@walmart.com}';jennifer@fedex.com | walmart | fedex |
5 | <jennifer.juniper@fedex.com>;"jennifer.juniper@fedex.com"' | fedex | fedex |
Which, whilst it works and speaking of "heavy handed," is a whole heap of code.
Splitting with Cascading CROSS APPLYs
If you’ve used APPLY before, you will know that it’s very handy for performing inline calculations. Partial calculations can be completed in the output set – passing the results “up”, as it were, like this:
SELECT EmployeeID, EmailAddress, SUBSTRING(EmailAddress,x1.p1,20) FROM #EmailAddress CROSS APPLY(SELECT p1 = CHARINDEX('@',EmailAddress,1)) x1;
The results:
EmployeeID | EmailAddress | |
1 | <mmooney@gmail.com>;mmoney@gmail.com | @gmail.com>;mmoney@g |
2 | <chrisgardener@yahoo.com>)'[chrisgardener@comcast.net)] | @yahoo.com>)'[chrisg |
3 | "natan@hotmail.com':>natan@hotmail.com?' | @hotmail.com':>"nata |
4 | "jennifer@walmart.com}';jennifer@fedex.com | @walmart.com}';jenni |
5 | <jennifer.juniper@fedex.com>;"jennifer.juniper@fedex.com"' | @fedex.com>;"jennife |
The results can be passed “down” too, to the WHERE clause:
SELECT EmployeeID, EmailAddress, SUBSTRING(EmailAddress,x1.p1,20) FROM #EmailAddress CROSS APPLY(SELECT p1 = CHARINDEX('@',EmailAddress,1)) x1 WHERE x1.p1 = 9;
What happens if we pass x1.p1 down to another CROSS APPLY block? It would look like this:
SELECT EmployeeID, EmailAddress, SUBSTRING(EmailAddress,x1.p1,20), SUBSTRING(EmailAddress,x2.p1,20) FROM #EmailAddress CROSS APPLY(SELECT p1 = CHARINDEX('@',EmailAddress,1)) x1 CROSS APPLY(SELECT p1 = CHARINDEX('@',EmailAddress, x1.p1+1)) x2;
x1.p1 is the position of the first at in the string. If we pass this output column into another CHARINDEX as the third parameter (which is the location to start searching from) it will find the same at and hence return the same position, so we add a 1 to it to begin the search at the character immediately after. The results are very close to what we’re looking for:
EmployeeID | EmailAddress | ||
1 | <mmooney@gmail.com>;mmoney@gmail.com | @gmail.com>;mmoney@g | @gmail.com |
2 | <chrisgardener@yahoo.com>)'[chrisgardener@comcast.net)] | @yahoo.com>)'[chrisg | @comcast.net)] |
3 | "natan@hotmail.com':>natan@hotmail.com?' | @hotmail.com':>"nata | @hotmail.com?' |
4 | "jennifer@walmart.com}';jennifer@fedex.com | @walmart.com}';jenni | @fedex.com |
5 | <jennifer.juniper@fedex.com>;"jennifer.juniper@fedex.com"' | @fedex.com>;"jennife | @fedex.com"' |
And from here it’s a trivial matter to fetch the position of the period immediately after each at:
SELECT EmployeeID, EmailAddress, Domain1 = SUBSTRING(EmailAddress, x1.p1, x1.p2-x1.p1), Domain2 = SUBSTRING(EmailAddress, x2.p1, x2.p2-x2.p1) FROM #EmailAddress CROSS APPLY(SELECT p1 = 1+CHARINDEX('@',EmailAddress,1), p2 = CHARINDEX('.',EmailAddress,CHARINDEX('@',EmailAddress,1)) ) x1 CROSS APPLY(SELECT p1 = 1+CHARINDEX('@',EmailAddress, x1.p1), p2 = CHARINDEX('.',EmailAddress,CHARINDEX('@',EmailAddress, x1.p1)) ) x2;
Remember the 1 added to p1? Here I’ve put it into the expression for p1 so I can safely forget about it elsewhere in the query. Note that you could introduce two more CROSS APPLY blocks, one for each p2.
Here are the results:
EmployeeID | EmailAddress | Domain1 | Domain2 |
1 | <mmooney@gmail.com>;mmoney@gmail.com | gmail | gmail |
2 | <chrisgardener@yahoo.com>)'[chrisgardener@comcast.net)] | yahoo | comcast |
3 | "natan@hotmail.com':>"natan@hotmail.com?' | hotmail | hotmail |
4 | "jennifer@walmart.com}';jennifer@fedex.com | walmart | fedex |
5 | <jennifer.juniper@fedex.com>;"jennifer.juniper@fedex.com"' | fedex | fedex |
This is the cascaded (CROSS) APPLY: the output from one APPLY block is used as an input parameter in a subsequent block.
A note about performance
Cascaded (CROSS) APPLY used to extract elements from strings quickly becomes expensive as the number of elements increases. Give it up to four or five elements and it is likely to be faster than a tally-based string splitter; more than that and it begins to lag behind. Don’t be fooled by the execution plan. Regardless of how many APPLY blocks are in your query, it shows a single Compute Scalar operator. You have to raise the properties of the Compute Scalar Operator to get any idea of how much work is performed.
Here’s a crude comparison between cCA and SSC’s favourite string splitter, DelimitedSplit8K:
Element | cCA | DelimitedSplit8K | ||
count | CPU | Elapsed | CPU | Elapsed |
1 | 250 | 1,017 | 4,056 | 4,186 |
2 | 1,248 | 1,488 | 4,665 | 4,760 |
3 | 2,824 | 2,990 | 5,257 | 5,414 |
4 | 5,148 | 5,354 | 5,913 | 6,023 |
5 | 8,409 | 8,555 | 6,693 | 6,994 |
6 | 12,481 | 12,634 | 7,566 | 7,788 |
7 | 17,191 | 17,367 | 8,097 | 8,347 |
8 | 22,137 | 22,255 | 8,705 | 8,920 |
9 | 27,581 | 27,822 | 9,360 | 9,590 |
10 | 33,243 | 33,399 | 9,844 | 10,190 |
11 | 37,487 | 37,878 | 10,514 | 10,743 |
A few real-world applications
Developing a complex aggregate
Case:
-- whoosh up some sample data IF object_id('TempDB..#Persons') IS NOT NULL DROP TABLE #Persons; GO CREATE TABLE #Persons ( PersonID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED, Forename VARCHAR(20), Lastname VARCHAR(20), Title VARCHAR(10), DOB DATE, WeightKg NUMERIC (5,2), Heightmm INT) ; INSERT INTO #Persons (Forename, Lastname, Title, DOB, WeightKg, Heightmm) SELECT Forename, Lastname, Title, DOB = CONVERT(DATE,DOB,107), WeightKg = CAST(RIGHT(DOB,2) AS INT) * 1.5, Heightmm = YEAR(CONVERT(DATE,DOB,107)) FROM ( VALUES (('Francis'),('Bacon'),('Mr'),('January 22, 1561'),(NULL)), (('Jean'),('Baptiste'),('Mr'),('March 21, 1768'),(NULL)), (('Clarence'),('Birdseye'),('Mr'),('December 9, 1886'),(NULL)), (('Charles'),('Babbage'),('Mr'),('November 24, 1872'),(NULL)), (('John'),('Logie Baird'),('Mr'),('August 13, 1888'),(NULL)), (('Nicolaus'),('Copernicus'),('Mr'),('February 19, 1473'),(NULL)), (('Christian'),('Doppler'),('Mr'),('November 29, 1803'),(NULL)), (('Paul'),('Ehrlich'),('Mr'),('March 14, 1854'),(NULL)), (('Alexander'),('Fleming'),('Mr'),('August 6, 1881'),(NULL)), (('Michael'),('Faraday'),('Mr'),('September 22, 1791'),(NULL)), (('Pierre'),('de Fermat'),('Mr'),('August 17, 1601'),(NULL)), (('Robert'),('Fulton'),('Mr'),('November 14, 1765'),(NULL)), (('Joseph'),('Fourier'),('Mr'),('March 21, 1768'),(NULL)), (('Galileo'),('Galilei'),('Mr'),('February 15, 1564'),(NULL)), (('Stephen'),('Hawking'),('Mr'),('January 8, 1942'),(NULL)), (('Otto'),('Hahn'),('Mr'),('March 8, 1879'),(NULL)), (('Jean-Baptiste'),('Lamarck'),('Mr'),('August 1, 1744'),(NULL)), (('Konrad'),('Lorenz'),('Mr'),('November 7, 1903'),(NULL)), (('Gregor'),('Mendel'),('Mr'),('July 22, 1822'),(NULL)), (('Guglielmo'),('Marconi'),('Mr'),('April 25, 1874'),(NULL)), (('Isaac'),('Newton'),('Mr'),('December 25, 1642'),(NULL)), (('J. Robert'),('Oppenheimer'),('Mr'),('April 22, 1904'),(NULL)), (('Joseph'),('Priestley'),('Mr'),('March 13, 1733'),(NULL)), (('Wolfgang'),('Pauli'),('Mr'),('April 25, 1900'),(NULL)), (('Blaise'),('Pascal'),('Mr'),('June 19, 1623'),(NULL)), (('Craig'),('Venter'),('Mr'),('October 14, 1946'),(NULL)), (('James'),('Watson'),('Mr'),('April 16, 1970'),(NULL)) ) d (Forename, Lastname, Title, DOB, WeightKg) ; -- Task: group scientists according to their weight and height SELECT Century, LightweightShorties = COUNT(z.LightweightShorty), -- more columns GihugeousTallies = COUNT(z.GihugeousTallie) FROM #Persons CROSS APPLY ( -- calculate century of birth and weight band SELECT Century = (YEAR(DOB)/100)+1, WeightBand = CASE WHEN WeightKg < 80 THEN 1 WHEN WeightKg >= 80 AND WeightKg < 100 THEN 2 WHEN WeightKg >= 100 AND WeightKg < 120 THEN 3 ELSE 4 END ) x CROSS APPLY ( -- calculate height band SELECT HeightBand = CASE WHEN Heightmm < 1700 THEN 1 WHEN Heightmm >= 1700 THEN 2 ELSE 4 END ) y CROSS APPLY ( -- count scientists in specific weight and height bands SELECT LightweightShorty = CASE WHEN WeightBand = 1 AND HeightBand = 1 THEN 1 END, -- more columns GihugeousTallie = CASE WHEN WeightBand = 4 AND HeightBand = 2 THEN 1 END ) z GROUP BY Century ORDER BY Century -- Once the results are verified, condense the code. -- note differences between COUNT and SUM -- count rows where the result is not null SELECT x.Century, LightweightShorties = SUM(CASE WHEN WeightKg < 80 AND Heightmm < 1700 THEN 1END), -- SUM(1) -- more columns GihugeousTallies = COUNT(CASE WHEN WeightKg >= 120 AND Heightmm >= 1700 THEN 1 END) FROM #Persons CROSS APPLY (SELECT Century = (YEAR(DOB)/100)+1) x GROUP BY x.Century ORDER BY x.Century
Note that the optimizer chooses the same plan for both queries.
Make a complex calculation easier to work with by breaking it down into smaller, simpler elements
Case: calculate the elapsed time in milliseconds (in data type TIME(3)) between two DATETIME2 values truncated to milliseconds. Show workings and determine if the DATETIME datatype would be a sensible choice for intermediate values in the calculation.
WITH MyCTE (aDT2, bDT2) AS ( SELECT CONVERT(datetime2,'1900-01-01 00:10:00.000'), CONVERT(datetime2,'1900-01-01 00:00:00.000') UNION ALL SELECT '1900-01-01 00:10:00.000', '1900-01-01 00:00:00.001' UNION ALL SELECT '1900-01-01 00:10:00.000', '1900-01-01 00:00:00.002' UNION ALL SELECT '1900-01-01 00:10:00.000', '1900-01-01 00:00:00.003' UNION ALL SELECT '1900-01-01 00:10:00.000', '1900-01-01 00:00:00.004' UNION ALL SELECT '1900-01-01 00:10:00.000', '1900-01-01 00:00:00.005' ) SELECT *, Result_DT2 = CAST(msTime_DT2 AS TIME(3)), Result_DT = CAST(msTime_DT AS TIME(3)), ChosenCalc = CAST(DATEADD(ms,a.msTimeDiff,CAST(CAST(0 AS DATETIME) AS DATETIME2)) AS TIME(3)) FROM myCTE CROSS APPLY (SELECT msTimeDiff = ABS(DATEDIFF(ms,aDT2, bDT2))) a CROSS APPLY (SELECT TimeZero_DT2 = CAST(CAST(0 AS DATETIME) AS DATETIME2)) b CROSS APPLY (SELECT TimeZero_DT = CAST(0 AS DATETIME)) b1 CROSS APPLY (SELECT msTime_DT2 = DATEADD(ms,a.msTimeDiff,TimeZero_DT2)) c CROSS APPLY (SELECT msTime_DT = DATEADD(ms,a.msTimeDiff,TimeZero_DT)) c1;
Results
Result_DT2 | Result_DT | ChosenCalc |
00:10:00.000 | 00:10:00.000 | 00:10:00.000 |
00:09:59.999 | 00:10:00.000 | 00:09:59.999 |
00:09:59.998 | 00:09:59.997 | 00:09:59.998 |
00:09:59.997 | 00:09:59.997 | 00:09:59.997 |
00:09:59.996 | 00:09:59.997 | 00:09:59.996 |
00:09:59.995 | 00:09:59.997 | 00:09:59.995 |
There’s our answer - precision is lost if DATETIME is used. A quick scan of the DATETIME datatype section in BOL warns Accuracy: Rounded to increments of .000, .003, or .007 seconds. I could have read BOL first of course, but it wouldn’t have been as much fun and working through an exercise like this helps retention.
Split a string with multiple heterogeneous delimiters
Case: a single data element in a string column may contain several lines of text (delimited with carriage return/linefeed pairs). There are also spaces between words, and at (@) and period (.) act as delimiters either side of domain names. The second line of each string contains two email addresses. We want the domain name of the second email address.
-- Create a small sample table DECLARE @MyString VARCHAR(8000), @CRLF CHAR(2) SET @CRLF = CHAR(13) + CHAR(10) SET @MyString = 'The quick brown fox slip@snailmail.com over the,lazy,dog¬' + @CRLF + 'The,quick,abc@hutmail.com fox,jumped,over the,frizzy@boogle.com dog¬' + @CRLF + 'The,quick,red,fox,jumped,shell@snailmail.com,the,lazy,dog¬' + @CRLF + 'Fourth line' PRINT @MyString DROP TABLE #Strings SELECT * INTO #Strings FROM ( -- make each string slightly different SELECT '1 ' + @MyString AS MyString UNION ALL SELECT '12 ' + REPLACE(@MyString,'boogle','woogle') UNION ALL SELECT '123 ' + REPLACE(@MyString,'boogle','doogle') UNION ALL SELECT '1234 ' + REPLACE(@MyString,'boogle','noogle') UNION ALL SELECT '12345 ' + REPLACE(@MyString,'boogle','foogle') UNION ALL SELECT '123456 ' + @MyString UNION ALL SELECT '1234567 ' + @MyString UNION ALL SELECT '12345678 ' + @MyString UNION ALL SELECT '123456789 ' + @MyString UNION ALL SELECT '1234567890 ' + @MyString ) d -- solution SELECT *, domain = SUBSTRING(d.MyString, x3.p3+1, x4.p4-(x3.p3+1)) FROM #Strings d CROSS APPLY (SELECT p1 = CHARINDEX(@CRLF, d.MyString, 0)) x1 -- start of line 2 (end of line 1) CROSS APPLY (SELECT p2 = CHARINDEX('@', d.MyString, x1.p1)) x2 -- first '@' in line 2 CROSS APPLY (SELECT p3 = NULLIF(CHARINDEX('@', d.MyString, x2.p2+1),0)) x3 -- second '@' in line 2 CROSS APPLY (SELECT p4 = NULLIF(CHARINDEX('.', d.MyString, x3.p3+1),0)) x4; -- first '.' after second '@' in line 2
Solving this puzzle (a real case which came up on SSC sometime last year) is trivial with cCA but would be quite tricky using a string splitter, splitting firstly on CR or LF and choosing the second item from the list, then on at and choosing the second item, and finally CHARINDEX() to capture the position of the terminal period of the domain. Something like this;
SELECT s.MyString, domain = LEFT(b.Item, c.dotpos-1) FROM #Strings s CROSS APPLY dbo.DelimitedSplit8K(MyString,CHAR(10)) a CROSS APPLY dbo.DelimitedSplit8K(a.Item,'@') b CROSS APPLY (SELECT dotpos = CHARINDEX('.',b.Item)) c WHERE a.ItemNumber = 2 AND b.ItemNumber = 3;
- which of course uses a cascaded CROSS APPLY!
Thank you for taking the time to read this article. A huge thanks to my good friend Dwain Camps for his endless encouragement and to Jeff Moden for his first-rate editing and comments.