Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««12

complicated sql syntax - looking for explanation for working query Expand / Collapse
Author
Message
Posted Monday, May 5, 2014 5:55 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Yesterday @ 9:58 AM
Points: 36,995, Visits: 31,517
KoldCoffee (5/5/2014)
At this point, the recurrent question lurking in my mind is 'why do we have to count from zero' instead of '1' referring to 'SELECT N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) -1


It's just simple math. Since we want to use all the values that ROW_NUMBER() produces by adding them to the start date/time to come up with a date/time incremented by minute, the first value has to be "0" so that when you add the value of ROW_NUMBER() to the start date/time, the first value (the one with the "0") will return the start date/time as expected. Since the first value (1) was shifted towards "0" by 1, all the values must be shifted towards "0" by 1 or you'll end up with a gap. Subtracting 1 effectively moves the value that ROW_NUMBER() produced towards "0" by 1.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1567745
Posted Monday, May 5, 2014 7:41 PM
SSChasing Mays

SSChasing MaysSSChasing MaysSSChasing MaysSSChasing MaysSSChasing MaysSSChasing MaysSSChasing MaysSSChasing Mays

Group: General Forum Members
Last Login: Monday, August 25, 2014 11:12 AM
Points: 631, Visits: 1,470
OK, what a day of goodness from remote corners
I most of all appreciate Jeff Moden’s top to bottom explanation of the query, but all posts deepened the understanding of many concepts and creativity that goes into clever SQL.
I get the asci and non asci standard for cross apply notation (which I didn’t before, particularly because I was tripped up with the ‘x’ thinking it was a multiplication symbol instead of an alias), the UNION ALL replacing the VALUES syntax (whose x n I didn’t and still don’t understand but don't have to), replacing derived queries with CTEs, CTE syntax/structure, explaining 1 as placeholder so you have something to assign to row_numbers, harnessing Row_Number to get rows numbered by using the ORDER BY NULLs trick, appreciate the performance observations...but….why the row numbers have to start as ZERO still eludes me…..even if Lowell gave an explanation that smacked of something that I should get, and even though ‘ it is simple math’. I thought I knew simple math. I now see how it works (ie. I tested it with and without zero, at least I can see the purpose), but not why.
The - 1 is more about logic than math....and I don't expect you to help me there actually.
So, you people are pretty wonderful. Thanks.
Post #1567754
Posted Monday, May 5, 2014 9:13 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Yesterday @ 9:58 AM
Points: 36,995, Visits: 31,517
KoldCoffee (5/5/2014)
but….why the row numbers have to start as ZERO still eludes me…..


No problem. Let's start over there.

Let's make the problem much smaller. Let's say we have a start date/time of 1900-01-01 00:00 and we want to create just 4 times, starting at the start date/time, that have an increment of just 1 minute.

In the following, the CTE takes the place of ROW_NUMBER() just for this example. We want 4 date/times and ROW_NUMBER() always starts at 1, so the CTE has the values of 1, 2, 3, and 4 in it. Let's use those numbers as if they were minutes and add them to the start date/time and see what happens. Here's the code...

DECLARE @StartDT DATETIME;
SELECT @StartDT = '1900-01-01 00:00';

WITH
cte(RowNumber) AS (SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4)
SELECT StartDate = @StartDT
,RowNumber
,[@StartDT+RowNumberAsMinutesAdded] = DATEADD(mi,RowNumber,@StartDT)
FROM cte
;


And here are the results of that code...

StartDate               RowNumber   @StartDT+RowNumberAsMinutesAdded
----------------------- ----------- --------------------------------
1900-01-01 00:00:00.000 1 1900-01-01 00:01:00.000
1900-01-01 00:00:00.000 2 1900-01-01 00:02:00.000
1900-01-01 00:00:00.000 3 1900-01-01 00:03:00.000
1900-01-01 00:00:00.000 4 1900-01-01 00:04:00.000

(4 row(s) affected)



That looks fine. All of the date/times are exactly one minute apart. The only problem is, they all off by 1. We wanted the first date/time (the one that's highlighted) to be the same as the start date/time.

What do we have to do to fix that?
.
.
.
.
.
Subtract 1 from the row number. Like this...

DECLARE @StartDT DATETIME;
SELECT @StartDT = '1900-01-01 00:00';

WITH
cte(RowNumber) AS (SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4)
SELECT StartDate = @StartDT
,RowNumber
,[@StartDT+RowNumberAsMinutesAdded] = DATEADD(mi,RowNumber,@StartDT)
,[RowNumber-1] = RowNumber-1
,[@StartDT+RowNumber-1AsMinutesAdded] = DATEADD(mi,RowNumber-1,@StartDT)
FROM cte
;


And now we the first time (highlighted) starts at the start date/time just like we wanted. All of the rest of the dates also fell in line because we subtracted 1 from all of row numbers.

StartDate               RowNumber @StartDT+RowNumberAsMinutesAdded RowNumber-1 @StartDT+RowNumber-1AsMinutesAdded
----------------------- --------- -------------------------------- ----------- ----------------------------------
1900-01-01 00:00:00.000 1 1900-01-01 00:01:00.000 0 1900-01-01 00:00:00.000
1900-01-01 00:00:00.000 2 1900-01-01 00:02:00.000 1 1900-01-01 00:01:00.000
1900-01-01 00:00:00.000 3 1900-01-01 00:03:00.000 2 1900-01-01 00:02:00.000
1900-01-01 00:00:00.000 4 1900-01-01 00:04:00.000 3 1900-01-01 00:03:00.000

(4 row(s) affected)



The bottom line is that we're using ROW_NUMBER() as the number of minutes that we want to add to the start date/time. If we want the first date/time returned to be the start date/time, then we need to add 0 to it. Since ROW_NUMBER() always starts at 1, we have to subtract 1 from it to get the 0. Everything else just falls in line.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1567764
Posted Tuesday, May 6, 2014 12:32 AM
SSChasing Mays

SSChasing MaysSSChasing MaysSSChasing MaysSSChasing MaysSSChasing MaysSSChasing MaysSSChasing MaysSSChasing Mays

Group: General Forum Members
Last Login: Monday, August 25, 2014 11:12 AM
Points: 631, Visits: 1,470
yes I see now.
The 'n' in the second to last CTE of the original query that I posted called Test(n) refers to the row_number values assigned to n column in the cte prior to that, and is also used by the DATEADD function to control the number of minutes to increment by.
so, if it is important that the first datetime stamp in the first row not increment at all, than the row_number in the first row needs to =0.

I figure at this rate of chugging through such problems, I will, at the age of 50, be a plausible thinker To be the one to actually come up with these solutions would be awesome.
Post #1567794
Posted Tuesday, May 6, 2014 10:38 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Yesterday @ 9:58 AM
Points: 36,995, Visits: 31,517
I ran out of time yesterday and will try to get to it tonight but some of the "solutions" for the original problem have some major flaws in them even though they produce the correct answer.

--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1568086
Posted Tuesday, May 6, 2014 10:58 AM
SSChasing Mays

SSChasing MaysSSChasing MaysSSChasing MaysSSChasing MaysSSChasing MaysSSChasing MaysSSChasing MaysSSChasing Mays

Group: General Forum Members
Last Login: Monday, August 25, 2014 11:12 AM
Points: 631, Visits: 1,470
OK. I'm interested.
Post #1568095
Posted Tuesday, May 6, 2014 11:16 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Yesterday @ 9:58 AM
Points: 36,995, Visits: 31,517
KoldCoffee (5/6/2014)
OK. I'm interested.


In that case, here we go… let’s test the “Original” code, Scott’s code, and my humble submittal. With any luck at all, someone will come along with a method to beat all three methods but I’ll be able to get a point across, first.

Just as a side note, we’re not testing any Scalar or mTVF (Multi-Statement Table Valued Functions) or rCTE’s, or While Loops so we can get away with using SET STATISTICS to measure performance in these tests. For more information on how badly the use of SET STATISTICS can skew performance measurements, please see the following article. It’s worth the read for more than 1 reason even if I do say so myself.
http://www.sqlservercentral.com/articles/T-SQL/91724/

Original Code
Pick your favorite test database and run the following code making sure that the Actual Execution plan is turned off (because it does add to duration and has some overhead of its own). It will run the original code 3 times just to make sure we don’t hit a speed bump during testing.
PRINT '================================================================================';
PRINT '===== Original Test';
SET STATISTICS TIME, IO ON;
;WITH e1(n) AS(
SELECT * FROM (VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1))x(n)
),
e2(n) AS(
SELECT e1.n FROM e1, e1 x
),
e4(n) AS(
SELECT e2.n FROM e2, e2 x
),
e8(n) AS(
SELECT e4.n FROM e4, e4 x
),
cteTally(n) AS(
SELECT TOP 6307204 ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) - 1
FROM e8
),
Test(n) AS(
SELECT DATEADD( minute, n, DATEADD( YEAR, -2, GETDATE()))
FROM cteTally)
select DATEADD(mi,datediff(mi,0,n),0)as cte_start_date
INTO dbo.OriginalTest
FROM Test
WHERE n <= DATEADD( YEAR, 4, GETDATE())
SET STATISTICS TIME, IO OFF;
DROP TABLE dbo.OriginalTest;
GO 3


I checked the results in the table that it created (which is dropped in the code above) and we’re OK there. Here are the displayable run results.

Beginning execution loop
================================================================================
===== Original Test

SQL Server Execution Times:
CPU time = 2668 ms, elapsed time = 2669 ms.

(3155041 row(s) affected)
================================================================================
===== Original Test

SQL Server Execution Times:
CPU time = 2652 ms, elapsed time = 2726 ms.

(3155041 row(s) affected)
================================================================================
===== Original Test

SQL Server Execution Times:
CPU time = 2637 ms, elapsed time = 2694 ms.

(3155041 row(s) affected)
Batch execution completed 3 times.


Heh… you’ve just got to love these modern 64 bit, multiple high performance processor laptops with more memory than all my previous computers combined. My old 32 bit box with a gig of memory took more than 10 seconds to run the same thing. But, as they say, I digress. Back to the problem.

Turn on the Actual Execution plan and run the code again. Look at the arrows closely… especially the one just to the left of the left-most nested loop formed by the cascading CTE. If you click on that arrow, you’ll see an actual rowcount of 6,307,204, which is about twice as many rows that we need and it’s because of that same hardcoded value in the code.

I won’t take the time to rewrite the code but if we change that large number to 3,155,041, which is precisely the number of rows we need, here are the run results.

Beginning execution loop
================================================================================
===== Original Test

SQL Server Execution Times:
CPU time = 2043 ms, elapsed time = 2093 ms.

(3155041 row(s) affected)
================================================================================
===== Original Test

SQL Server Execution Times:
CPU time = 1966 ms, elapsed time = 2005 ms.

(3155041 row(s) affected)
================================================================================
===== Original Test

SQL Server Execution Times:
CPU time = 1965 ms, elapsed time = 1999 ms.

(3155041 row(s) affected)
Batch execution completed 3 times.


Not bad. We got about a 34% increase in performance from that. We can do better, though.

Scott’s Code
Let’s take a look at Scott’s code next. And, no... I'm absolutely not picking on Scott. He's using code that I've seen dozens of times before by other people and I just want to take the opportunity to tell people to be extremely leery of such code.

Again, pick your favorite test database and run the following code making sure that the Actual Execution plan is turned off.
PRINT '================================================================================';
PRINT '===== Scott Test';
SET STATISTICS TIME, IO ON;
;WITH
cteDigits AS (
SELECT 0 AS digit UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL
SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9
),
cteTally AS (
SELECT
[1000000s].digit * 1000000 + [100000s].digit * 100000 + [10000s].digit * 10000 +
[1000s].digit * 1000 + [100s].digit * 100 + [10s].digit * 10 + [1s].digit AS minutes_increment
FROM cteDigits [1s]
CROSS JOIN cteDigits [10s]
CROSS JOIN cteDigits [100s]
CROSS JOIN cteDigits [1000s]
CROSS JOIN cteDigits [10000s]
CROSS JOIN cteDigits [100000s]
CROSS JOIN cteDigits [1000000s]
WHERE
[1000000s].digit * 1000000 + [100000s].digit * 100000 + [10000s].digit * 10000 +
[1000s].digit * 1000 + [100s].digit * 100 + [10s].digit * 10 + [1s].digit
--limit rows generated to 6 yrs' worth of minutes
<= DATEDIFF(MINUTE, '19000101', '19060101')
),
cteTime_by_minute AS (
SELECT DATEADD(MINUTE, minutes_increment, DATEADD(YEAR, -2, DATEADD(MINUTE, DATEDIFF(MINUTE, 0, GETDATE()), 0)))
AS time_by_minute
FROM cteTally t
)
SELECT time_by_minute
INTO dbo.ScottTest
FROM cteTime_by_minute
SET STATISTICS TIME, IO OFF;
DROP TABLE dbo.ScottTest;
GO 3


Here are the displayable results…

Beginning execution loop
================================================================================
===== Scott Test
Table 'Worktable'. Scan count 2, logical reads 82316, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 11529 ms, elapsed time = 4834 ms.

(3155041 row(s) affected)
================================================================================
===== Scott Test
Table 'Worktable'. Scan count 2, logical reads 82316, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 11762 ms, elapsed time = 4939 ms.

(3155041 row(s) affected)
================================================================================
===== Scott Test
Table 'Worktable'. Scan count 2, logical reads 82316, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 11731 ms, elapsed time = 4946 ms.

(3155041 row(s) affected)
Batch execution completed 3 times.


The first question to answer is how is it that CPU time could be longer than elapsed time? The answer is “Parallelism”. If you turn on the Actual Execution plan and run the code again, you’ll see multiple areas where Parallelism came into play.

The code took almost twice as long to execute and used almost 4 times the amount of CPU to do the same job. Why did it take so long and use so much CPU? If you look at the left-most nested loop in the execution plan, you’ll see a rather large arrow coming up to it from below and that large arrow has 10,000,000 rows in it. Even though those rows aren’t exposed externally, they do take time to build even on today’s fast machines.

Another problem is where did the 82,316 reads come from (those are PAGES of I/O which adds up to 674,332,672 bytes of I/O)? Someone like Grant Fritchey, Paul White, or Gail Shaw might be able to explain why it formed but the reads come from the “Lazy Table Spool” at the bottom of that same large arrow.

Why did all of this happen? Because the code has no way of determining what the maximum number it should count to is. It first calculates all 10 million numbers and then filters out all but the 3 million or so that we actually need.

What’s the fix for this code? Rewrite it so it doesn’t calculate all 10 million rows and I don’t mean simply by limiting the largest number it can count to and still filtering out what we don’t want.

Jeff’s Code
I believe it was Scott that said that code should be mostly self-documenting and names should be used to properly represent what is being done and what each object contains. I absolutely agree with that but I also believe in the power of well written comments because they just make life easier all the way around. Yes, you can read the code to find out what’s going on (or not. The cascading CTE {for example} would take anyone a bit of time to figure out if they've never seen it before) but it’s so much easier to first find a comment that identifies the section of code that addresses a particular problem and then read just that code. For functions, I document them as if I were writing them into Books Online. For procs, I believe that you should be able to remove all of the code and the comments that remain should be good enough for a functional flowchart that someone could rewrite the code from. Of course, that’s just my opinion. Some folks think that comments get in the way. I’ll let you be the judge of that… well, unless you work in my shop and then you’ll need to get really good at commenting code.

Being an old front-end programmer from long ago (hung those shoes up in 2002), I still believe in making life easy through the use of proper encapsulation and abstraction. With that in mind, I built a Tally-Table-like function (others have done similar... not my original idea) using a version of Itzik Ben-Gan’s fine cascading CTE’s to do the job. I’m also one that doesn’t believe in “Hungarian Notation” prefixes for most things but, since I also keep a Tally Table active in the same databases, I couldn’t call the function just “Tally”. I also didn’t want to change the name much because I didn’t want it to be difficult to remember so, I broke my “Hungarian Notation” rule for this one function and called it “fnTally”. It also allows this old sailor to have a bit of fun without it being an HR violation. When someone has a performance problem with some form of sequence code, I ask them if they used a Recursive CTE, While Loop, or Cursor as the sequence generator. If they say “yes”, then I can say “If you used the fnTally function like I said, you wouldn’t have that problem”. It allows me to vent without getting into trouble with HR. If you don't get that joke, it's ok. Thankfully, HR doesn't either.

On with the show. Here’s the fnTally function that I currently use in most of my databases. We do need it for the code that follows. I’d tell you how to use it and what the inputs mean and some things to watch out for and how to get around those things but, guess what? ITS COMMENTED!

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE FUNCTION [dbo].[fnTally]
/**********************************************************************************************************************
Purpose:
Return a column of BIGINTs from @ZeroOrOne up to and including @MaxN with a max value of 1 Billion.

As a performance note, it takes about 00:02:10 (hh:mm:ss) to generate 1 Billion numbers to a throw-away variable.

Usage:
--===== Syntax example (Returns BIGINT)
SELECT t.N
FROM dbo.fnTally(@ZeroOrOne,@MaxN) t
;

Notes:
1. Based on Itzik Ben-Gan's cascading CTE (cCTE) method for creating a "readless" Tally Table source of BIGINTs.
Refer to the following URL for how it works. http://sqlmag.com/sql-server/virtual-auxiliary-table-numbers
2. To start a sequence at 0, @ZeroOrOne must be 0 or NULL. Any other value that's convertable to the BIT data-type
will cause the sequence to start at 1.
3. If @ZeroOrOne = 1 and @MaxN = 0, no rows will be returned.
5. If @MaxN is negative or NULL, a "TOP" error will be returned.
6. @MaxN must be a positive number from >= the value of @ZeroOrOne up to and including 1 Billion. If a larger
number is used, the function will silently truncate after 1 Billion. If you actually need a sequence with
that many values, you should consider using a different tool.
7. There will be a substantial reduction in performance if "N" is sorted in descending order. If a descending
sort is required, use code similar to the following. Performance will decrease by about 27% but it's still
very fast especially compared with just doing a simple descending sort on "N", which is about 20 times slower.
If @ZeroOrOne is a 0, in this case, remove the "+1" from the code.

DECLARE @MaxN BIGINT;
SELECT @MaxN = 1000;
SELECT DescendingN = @MaxN-N+1
FROM dbo.fnTally(1,@MaxN);

8. There is no performance penalty for sorting "N" in ascending order because the output is explicity sorted by
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

Revision History:
Rev 00 - Unknown - Jeff Moden
- Initial creation with error handling for @MaxN.
Rev 01 - 09 Feb 2013 - Jeff Moden
- Modified to start at 0 or 1.
Rev 02 - 16 May 2013 - Jeff Moden
- Removed error handling for @MaxN because of exceptional cases.
**********************************************************************************************************************/
(@ZeroOrOne BIT, @MaxN BIGINT)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN WITH
E1(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1) --10E1 or 10 rows
, E3(N) AS (SELECT 1 FROM E1 a, E1 b, E1 c) --10E3 or 1 Thousand rows
, E9(N) AS (SELECT 1 FROM E3 a, E3 b, E3 c) --10E9 or 1 Billion rows
SELECT N = 0 WHERE ISNULL(@ZeroOrOne,0)= 0 --Conditionally start at 0.
UNION ALL
SELECT TOP(@MaxN) N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E9 -- Values from 1 to @MaxN
;


And here’s my rendition of the code on how to solve this problem. I tend to calculate any “run time” constants up front. It allows some simplification and keeps all such things in one place. Run the code the same way you did the other making sure that the Actual Execution plan is turned off.
PRINT '================================================================================';
PRINT '===== Jeff Test';
SET STATISTICS TIME, IO ON;
WITH
cteStartMinute AS
( --=== Precalculate the start date/time of a whole minute two years ago.
-- The "0s" are the same as '1900-01-01'.
SELECT StartDTMinute = DATEADD(yy,-2,DATEADD(mi,DATEDIFF(mi,0,GETDATE()),0))
)
,cteMinuteSpan AS
( --=== Precalculate how many minutes there are from 2 years ago to 4 years from now (6 years total).
SELECT StartDtMinute
,Minutes = DATEDIFF(mi,StartDTMinute,DATEADD(yy,6,StartDTMinute))
FROM cteStartMinute
)
--===== Use the fnTally function to provide the number of minutes to add.
-- Note that we're using the function in the "start at 0" mode.
SELECT DtByMinute = DATEADD(mi, t.N, m.StartDTMinute)
INTO dbo.JeffTest
FROM cteMinuteSpan m
CROSS APPLY (SELECT N FROM dbo.fnTally(0,m.Minutes)) t (N)
;
SET STATISTICS TIME, IO OFF;
DROP TABLE dbo.JeffTest;
GO 3


Here are the displayable results…

Beginning execution loop
================================================================================
===== Jeff Test

SQL Server Execution Times:
CPU time = 1388 ms, elapsed time = 1412 ms.

(3155041 row(s) affected)
================================================================================
===== Jeff Test

SQL Server Execution Times:
CPU time = 1389 ms, elapsed time = 1427 ms.

(3155041 row(s) affected)
================================================================================
===== Jeff Test

SQL Server Execution Times:
CPU time = 1342 ms, elapsed time = 1386 ms.

(3155041 row(s) affected)
Batch execution completed 3 times.


OK… big deal, right? On today’s machines, the differences across more than 3 million rows are measured as a couple of seconds.

We wrote code that runs 2 to almost 4 times faster, uses up to 89% less CPU, produces no reads, and is a bit shorter (hopefully, easier to read, as well), which also means easier to modify or troubleshoot in the future. And if someone is looking for some code that’s similar and they find this code and use it for something that will be called 10’s of thousands of times per day, how much of a machine would you need to handle dozens of similar routines?

Like Granny used to say, “Mind the pennies and the dollars will take care of themselves.”


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1568276
Posted Wednesday, May 7, 2014 8:30 AM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Friday, August 29, 2014 2:25 PM
Points: 2,044, Visits: 3,059
Jeff Moden (5/6/2014)
KoldCoffee (5/6/2014)
OK. I'm interested.


In that case, here we go… let’s test the “Original” code, Scott’s code, and my humble submittal. With any luck at all, someone will come along with a method to beat all three methods but I’ll be able to get a point across, first.

Just as a side note, we’re not testing any Scalar or mTVF (Multi-Statement Table Valued Functions) or rCTE’s, or While Loops so we can get away with using SET STATISTICS to measure performance in these tests. For more information on how badly the use of SET STATISTICS can skew performance measurements, please see the following article. It’s worth the read for more than 1 reason even if I do say so myself.
http://www.sqlservercentral.com/articles/T-SQL/91724/

Original Code
Pick your favorite test database and run the following code making sure that the Actual Execution plan is turned off (because it does add to duration and has some overhead of its own). It will run the original code 3 times just to make sure we don’t hit a speed bump during testing.
PRINT '================================================================================';
PRINT '===== Original Test';
SET STATISTICS TIME, IO ON;
;WITH e1(n) AS(
SELECT * FROM (VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1))x(n)
),
e2(n) AS(
SELECT e1.n FROM e1, e1 x
),
e4(n) AS(
SELECT e2.n FROM e2, e2 x
),
e8(n) AS(
SELECT e4.n FROM e4, e4 x
),
cteTally(n) AS(
SELECT TOP 6307204 ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) - 1
FROM e8
),
Test(n) AS(
SELECT DATEADD( minute, n, DATEADD( YEAR, -2, GETDATE()))
FROM cteTally)
select DATEADD(mi,datediff(mi,0,n),0)as cte_start_date
INTO dbo.OriginalTest
FROM Test
WHERE n <= DATEADD( YEAR, 4, GETDATE())
SET STATISTICS TIME, IO OFF;
DROP TABLE dbo.OriginalTest;
GO 3


I checked the results in the table that it created (which is dropped in the code above) and we’re OK there. Here are the displayable run results.

Beginning execution loop
================================================================================
===== Original Test

SQL Server Execution Times:
CPU time = 2668 ms, elapsed time = 2669 ms.

(3155041 row(s) affected)
================================================================================
===== Original Test

SQL Server Execution Times:
CPU time = 2652 ms, elapsed time = 2726 ms.

(3155041 row(s) affected)
================================================================================
===== Original Test

SQL Server Execution Times:
CPU time = 2637 ms, elapsed time = 2694 ms.

(3155041 row(s) affected)
Batch execution completed 3 times.


Heh… you’ve just got to love these modern 64 bit, multiple high performance processor laptops with more memory than all my previous computers combined. My old 32 bit box with a gig of memory took more than 10 seconds to run the same thing. But, as they say, I digress. Back to the problem.

Turn on the Actual Execution plan and run the code again. Look at the arrows closely… especially the one just to the left of the left-most nested loop formed by the cascading CTE. If you click on that arrow, you’ll see an actual rowcount of 6,307,204, which is about twice as many rows that we need and it’s because of that same hardcoded value in the code.

I won’t take the time to rewrite the code but if we change that large number to 3,155,041, which is precisely the number of rows we need, here are the run results.

Beginning execution loop
================================================================================
===== Original Test

SQL Server Execution Times:
CPU time = 2043 ms, elapsed time = 2093 ms.

(3155041 row(s) affected)
================================================================================
===== Original Test

SQL Server Execution Times:
CPU time = 1966 ms, elapsed time = 2005 ms.

(3155041 row(s) affected)
================================================================================
===== Original Test

SQL Server Execution Times:
CPU time = 1965 ms, elapsed time = 1999 ms.

(3155041 row(s) affected)
Batch execution completed 3 times.


Not bad. We got about a 34% increase in performance from that. We can do better, though.

Scott’s Code
Let’s take a look at Scott’s code next. And, no... I'm absolutely not picking on Scott. He's using code that I've seen dozens of times before by other people and I just want to take the opportunity to tell people to be extremely leery of such code.

Again, pick your favorite test database and run the following code making sure that the Actual Execution plan is turned off.
PRINT '================================================================================';
PRINT '===== Scott Test';
SET STATISTICS TIME, IO ON;
;WITH
cteDigits AS (
SELECT 0 AS digit UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL
SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9
),
cteTally AS (
SELECT
[1000000s].digit * 1000000 + [100000s].digit * 100000 + [10000s].digit * 10000 +
[1000s].digit * 1000 + [100s].digit * 100 + [10s].digit * 10 + [1s].digit AS minutes_increment
FROM cteDigits [1s]
CROSS JOIN cteDigits [10s]
CROSS JOIN cteDigits [100s]
CROSS JOIN cteDigits [1000s]
CROSS JOIN cteDigits [10000s]
CROSS JOIN cteDigits [100000s]
CROSS JOIN cteDigits [1000000s]
WHERE
[1000000s].digit * 1000000 + [100000s].digit * 100000 + [10000s].digit * 10000 +
[1000s].digit * 1000 + [100s].digit * 100 + [10s].digit * 10 + [1s].digit
--limit rows generated to 6 yrs' worth of minutes
<= DATEDIFF(MINUTE, '19000101', '19060101')
),
cteTime_by_minute AS (
SELECT DATEADD(MINUTE, minutes_increment, DATEADD(YEAR, -2, DATEADD(MINUTE, DATEDIFF(MINUTE, 0, GETDATE()), 0)))
AS time_by_minute
FROM cteTally t
)
SELECT time_by_minute
INTO dbo.ScottTest
FROM cteTime_by_minute
SET STATISTICS TIME, IO OFF;
DROP TABLE dbo.ScottTest;
GO 3


Here are the displayable results…

Beginning execution loop
================================================================================
===== Scott Test
Table 'Worktable'. Scan count 2, logical reads 82316, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 11529 ms, elapsed time = 4834 ms.

(3155041 row(s) affected)
================================================================================
===== Scott Test
Table 'Worktable'. Scan count 2, logical reads 82316, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 11762 ms, elapsed time = 4939 ms.

(3155041 row(s) affected)
================================================================================
===== Scott Test
Table 'Worktable'. Scan count 2, logical reads 82316, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 11731 ms, elapsed time = 4946 ms.

(3155041 row(s) affected)
Batch execution completed 3 times.


The first question to answer is how is it that CPU time could be longer than elapsed time? The answer is “Parallelism”. If you turn on the Actual Execution plan and run the code again, you’ll see multiple areas where Parallelism came into play.

The code took almost twice as long to execute and used almost 4 times the amount of CPU to do the same job. Why did it take so long and use so much CPU? If you look at the left-most nested loop in the execution plan, you’ll see a rather large arrow coming up to it from below and that large arrow has 10,000,000 rows in it. Even though those rows aren’t exposed externally, they do take time to build even on today’s fast machines.

Another problem is where did the 82,316 reads come from (those are PAGES of I/O which adds up to 674,332,672 bytes of I/O)? Someone like Grant Fritchey, Paul White, or Gail Shaw might be able to explain why it formed but the reads come from the “Lazy Table Spool” at the bottom of that same large arrow.

Why did all of this happen? Because the code has no way of determining what the maximum number it should count to is. It first calculates all 10 million numbers and then filters out all but the 3 million or so that we actually need.

What’s the fix for this code? Rewrite it so it doesn’t calculate all 10 million rows and I don’t mean simply by limiting the largest number it can count to and still filtering out what we don’t want.

Jeff’s Code
I believe it was Scott that said that code should be mostly self-documenting and names should be used to properly represent what is being done and what each object contains. I absolutely agree with that but I also believe in the power of well written comments because they just make life easier all the way around. Yes, you can read the code to find out what’s going on (or not. The cascading CTE {for example} would take anyone a bit of time to figure out if they've never seen it before) but it’s so much easier to first find a comment that identifies the section of code that addresses a particular problem and then read just that code. For functions, I document them as if I were writing them into Books Online. For procs, I believe that you should be able to remove all of the code and the comments that remain should be good enough for a functional flowchart that someone could rewrite the code from. Of course, that’s just my opinion. Some folks think that comments get in the way. I’ll let you be the judge of that… well, unless you work in my shop and then you’ll need to get really good at commenting code.

Being an old front-end programmer from long ago (hung those shoes up in 2002), I still believe in making life easy through the use of proper encapsulation and abstraction. With that in mind, I built a Tally-Table-like function (others have done similar... not my original idea) using a version of Itzik Ben-Gan’s fine cascading CTE’s to do the job. I’m also one that doesn’t believe in “Hungarian Notation” prefixes for most things but, since I also keep a Tally Table active in the same databases, I couldn’t call the function just “Tally”. I also didn’t want to change the name much because I didn’t want it to be difficult to remember so, I broke my “Hungarian Notation” rule for this one function and called it “fnTally”. It also allows this old sailor to have a bit of fun without it being an HR violation. When someone has a performance problem with some form of sequence code, I ask them if they used a Recursive CTE, While Loop, or Cursor as the sequence generator. If they say “yes”, then I can say “If you used the fnTally function like I said, you wouldn’t have that problem”. It allows me to vent without getting into trouble with HR. If you don't get that joke, it's ok. Thankfully, HR doesn't either.

On with the show. Here’s the fnTally function that I currently use in most of my databases. We do need it for the code that follows. I’d tell you how to use it and what the inputs mean and some things to watch out for and how to get around those things but, guess what? ITS COMMENTED!

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE FUNCTION [dbo].[fnTally]
/**********************************************************************************************************************
Purpose:
Return a column of BIGINTs from @ZeroOrOne up to and including @MaxN with a max value of 1 Billion.

As a performance note, it takes about 00:02:10 (hh:mm:ss) to generate 1 Billion numbers to a throw-away variable.

Usage:
--===== Syntax example (Returns BIGINT)
SELECT t.N
FROM dbo.fnTally(@ZeroOrOne,@MaxN) t
;

Notes:
1. Based on Itzik Ben-Gan's cascading CTE (cCTE) method for creating a "readless" Tally Table source of BIGINTs.
Refer to the following URL for how it works. http://sqlmag.com/sql-server/virtual-auxiliary-table-numbers
2. To start a sequence at 0, @ZeroOrOne must be 0 or NULL. Any other value that's convertable to the BIT data-type
will cause the sequence to start at 1.
3. If @ZeroOrOne = 1 and @MaxN = 0, no rows will be returned.
5. If @MaxN is negative or NULL, a "TOP" error will be returned.
6. @MaxN must be a positive number from >= the value of @ZeroOrOne up to and including 1 Billion. If a larger
number is used, the function will silently truncate after 1 Billion. If you actually need a sequence with
that many values, you should consider using a different tool.
7. There will be a substantial reduction in performance if "N" is sorted in descending order. If a descending
sort is required, use code similar to the following. Performance will decrease by about 27% but it's still
very fast especially compared with just doing a simple descending sort on "N", which is about 20 times slower.
If @ZeroOrOne is a 0, in this case, remove the "+1" from the code.

DECLARE @MaxN BIGINT;
SELECT @MaxN = 1000;
SELECT DescendingN = @MaxN-N+1
FROM dbo.fnTally(1,@MaxN);

8. There is no performance penalty for sorting "N" in ascending order because the output is explicity sorted by
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

Revision History:
Rev 00 - Unknown - Jeff Moden
- Initial creation with error handling for @MaxN.
Rev 01 - 09 Feb 2013 - Jeff Moden
- Modified to start at 0 or 1.
Rev 02 - 16 May 2013 - Jeff Moden
- Removed error handling for @MaxN because of exceptional cases.
**********************************************************************************************************************/
(@ZeroOrOne BIT, @MaxN BIGINT)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN WITH
E1(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1) --10E1 or 10 rows
, E3(N) AS (SELECT 1 FROM E1 a, E1 b, E1 c) --10E3 or 1 Thousand rows
, E9(N) AS (SELECT 1 FROM E3 a, E3 b, E3 c) --10E9 or 1 Billion rows
SELECT N = 0 WHERE ISNULL(@ZeroOrOne,0)= 0 --Conditionally start at 0.
UNION ALL
SELECT TOP(@MaxN) N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E9 -- Values from 1 to @MaxN
;


And here’s my rendition of the code on how to solve this problem. I tend to calculate any “run time” constants up front. It allows some simplification and keeps all such things in one place. Run the code the same way you did the other making sure that the Actual Execution plan is turned off.
PRINT '================================================================================';
PRINT '===== Jeff Test';
SET STATISTICS TIME, IO ON;
WITH
cteStartMinute AS
( --=== Precalculate the start date/time of a whole minute two years ago.
-- The "0s" are the same as '1900-01-01'.
SELECT StartDTMinute = DATEADD(yy,-2,DATEADD(mi,DATEDIFF(mi,0,GETDATE()),0))
)
,cteMinuteSpan AS
( --=== Precalculate how many minutes there are from 2 years ago to 4 years from now (6 years total).
SELECT StartDtMinute
,Minutes = DATEDIFF(mi,StartDTMinute,DATEADD(yy,6,StartDTMinute))
FROM cteStartMinute
)
--===== Use the fnTally function to provide the number of minutes to add.
-- Note that we're using the function in the "start at 0" mode.
SELECT DtByMinute = DATEADD(mi, t.N, m.StartDTMinute)
INTO dbo.JeffTest
FROM cteMinuteSpan m
CROSS APPLY (SELECT N FROM dbo.fnTally(0,m.Minutes)) t (N)
;
SET STATISTICS TIME, IO OFF;
DROP TABLE dbo.JeffTest;
GO 3


Here are the displayable results…

Beginning execution loop
================================================================================
===== Jeff Test

SQL Server Execution Times:
CPU time = 1388 ms, elapsed time = 1412 ms.

(3155041 row(s) affected)
================================================================================
===== Jeff Test

SQL Server Execution Times:
CPU time = 1389 ms, elapsed time = 1427 ms.

(3155041 row(s) affected)
================================================================================
===== Jeff Test

SQL Server Execution Times:
CPU time = 1342 ms, elapsed time = 1386 ms.

(3155041 row(s) affected)
Batch execution completed 3 times.


OK… big deal, right? On today’s machines, the differences across more than 3 million rows are measured as a couple of seconds.

We wrote code that runs 2 to almost 4 times faster, uses up to 89% less CPU, produces no reads, and is a bit shorter (hopefully, easier to read, as well), which also means easier to modify or troubleshoot in the future. And if someone is looking for some code that’s similar and they find this code and use it for something that will be called 10’s of thousands of times per day, how much of a machine would you need to handle dozens of similar routines?

Like Granny used to say, “Mind the pennies and the dollars will take care of themselves.”



I openly stated that the intent of the code I posted was not pure efficiency:
Of course you may prefer to adjust the CROSS JOINs for efficiency and/or adjust the names, but make the names meaningful.


But to avoid awful names like "N", "E", etc.. Reams of comments have to be cached as well, and must be re-read every time the code is analyzed by anyone. Moreover, all developers know that ultimately you can never trust comments but must rely on the code itself.


SQL DBA,SQL Server MVP('07, '08, '09)
"In America, every man is innocent until proven broke!" Brant Parker
Post #1568505
Posted Wednesday, May 7, 2014 9:57 AM
SSChasing Mays

SSChasing MaysSSChasing MaysSSChasing MaysSSChasing MaysSSChasing MaysSSChasing MaysSSChasing MaysSSChasing Mays

Group: General Forum Members
Last Login: Monday, August 25, 2014 11:12 AM
Points: 631, Visits: 1,470
Hey Scott, I get it. You were trying to show how different syntax might yield same results...perhaps to expose underlying logic of the original query I posted in the opening post. But, I'm still working through Moden's post and for me it's eye opening to see the impact of choosing one syntax over another. Crazy how much you have to know SQL to be thinking about performance and syntax at same time (as I never think about performance - really truly)...very helpful for me in the long run.
Where I've worked, just getting at the data, and bringing it to the light of day, has been the rewarded effort. I have yet to encounter a developer who will tell me that my queries totally suck and why didn't I write them optimally. I can't wait...:-(
Post #1568563
« Prev Topic | Next Topic »

Add to briefcase ««12

Permissions Expand / Collapse