Sins of SQL: The Time Table

  • Special note, I found no real appreciable difference between the following:

    select * into #Test2 from dbo.fn_GetTimestampRangeRCTE ('2006-01-01','2006-02-01', 0) OPTION (MAXRECURSION 0);

    select * into #Test2 from dbo.fn_GetTimestampRangeRCTE ('2006-01-01','2006-02-01', 0) OPTION (MAXRECURSION 60);

  • Lynn Pettis (6/6/2010)


    Actually inefficiency is not relative, it may just not be noticable. Inefficient is still inefficient if there is another means to generate the same results that is better.

    I'd never said this was the most efficient method - but even your examples for processing 100 years of data show gains that would only ever be realized in an extreme scenario, and that's assuming you loaded it all at once. A few minutes per-century translates to a sub-second difference per day, so you'll understand why I'm not racing to rewrite my ETL packages. I'm not trying to knock the achievement, you should be pleased with your optimisations, but they're not ones I've ever had to seek.

    Care to prove your statement?

    I was demonstrating that the recursion is not at the lower levels, merely the day to day. As you know the parameter MAXRECUSION prevents any CTE from recursing more than the specified limit. By setting a MAXRECURSION of 60 I was demonstrating that rather than (as has been implied), this being massively recursive, the actual recursion is split into two smaller recursive elements (the generation of the NumbersCTE, with values 0-60) and the iteration over the dates. The only recursion after the first 'day' is processed is the DATEADD region.

    If you review the query plan for the statement post-execution, you'll see the branch of the plan that generates the table-spool for the HH:MM:SS divisions is executed once ('Number of Executions') but the spool is read many times by a Nested Loops operator. As whilst it's not as fast, it's nowhere near as inefficient as some seemed to imply. At least you've had the good sport to put it into SSMS and try it.

    I'm pretty sure that my routine (with a slight modification thanks to you allowing me to see an ineffeciency in my code, which I have fixed in my blog) will consistantly beat your code in head to head competition.

    As per above. Don't dispute yours will be faster, but your method only gives a benefit of ~20 minutes of time saved for each century of data. You'd never load that much data at once in practice, and because the degredation is non-linear it means the benefit is lower as you scale down the range (I tend to populate incrementally each day), and the difference I've found when feeding both your example and my own into SSIS for a daily population is that both take more or less a second to run.

  • Steve,

    Unfortunately, you are using the same logic everyone else uses to defend the use of inefficient code, "It works for my case." Problem is, it is still inefficient. Recursion is by nature RBAR, Row By Agonizing Row. My routine eliminates both the recursion and the IO generated by using worktable.

    Also, you seem to be indicating that using OPTION (MAXRECURSION 60) makes the call to your routine more efficient than OPTION (MAXRECURSION 0), which it doesn't. I ran multiple tests using both and they were comparable, in fact the OPTION (MAXRECURSION) was actually faster in many cases (although I only ran about 10 runs of each).

    Some one is going to look at your code and say, "Hey, that is exactly what I need for what I am working on!", when in actuality it isn't because what they may be doing is against several hundred thousand rows or more with a high frequency of usage.

  • Lynn Pettis (6/6/2010)


    Steve,

    Unfortunately, you are using the same logic everyone else uses to defend the use of inefficient code, "It works for my case." Problem is, it is still inefficient. Recursion is by nature RBAR, Row By Agonizing Row. My routine eliminates both the recursion and the IO generated by using worktable.

    When considering optimisation, I consider how often the code is called, the trade-off of complexity for performance other factors.

    Also, you seem to be indicating that using OPTION (MAXRECURSION 60) makes the call to your routine more efficient than OPTION (MAXRECURSION 0), which it doesn't. I ran multiple tests using both and they were comparable, in fact the OPTION (MAXRECURSION) was actually faster in many cases (although I only ran about 10 runs of each).

    I never said that it made it faster, but used it to demonstrate that the recursion is nowhere near as deep as yourself and others implied. Producing a 40,000 row result set with only sixty iterative steps in the example.

    Some one is going to look at your code and say, "Hey, that is exactly what I need for what I am working on!", when in actuality it isn't because what they may be doing is against several hundred thousand rows or more with a high frequency of usage.

    This code can produce a centuries worth of date-time values in half hour, as demonstrated by your own tests. Unless they're constantly re-computing centuries worth of data (which would lend itself to a persistent table, rather than a dynamic rowset), they're not going to get much win for the effort.

  • Steven James Gray (6/6/2010)


    Lynn Pettis (6/6/2010)


    Steve,

    Unfortunately, you are using the same logic everyone else uses to defend the use of inefficient code, "It works for my case." Problem is, it is still inefficient. Recursion is by nature RBAR, Row By Agonizing Row. My routine eliminates both the recursion and the IO generated by using worktable.

    When considering optimisation, I consider how often the code is called, the trade-off of complexity for performance other factors.

    Also, you seem to be indicating that using OPTION (MAXRECURSION 60) makes the call to your routine more efficient than OPTION (MAXRECURSION 0), which it doesn't. I ran multiple tests using both and they were comparable, in fact the OPTION (MAXRECURSION) was actually faster in many cases (although I only ran about 10 runs of each).

    I never said that it made it faster, but used it to demonstrate that the recursion is nowhere near as deep as yourself and others implied. Producing a 40,000 row result set with only sixty iterative steps in the example.

    Some one is going to look at your code and say, "Hey, that is exactly what I need for what I am working on!", when in actuality it isn't because what they may be doing is against several hundred thousand rows or more with a high frequency of usage.

    This code can produce a centuries worth of date-time values in half hour, as demonstrated by your own tests. Unless they're constantly re-computing centuries worth of data (which would lend itself to a persistent table, rather than a dynamic rowset), they're not going to get much win for the effort.

    Steve,

    You really are missing the entire point. I am not disputing what the function does but HOW it does it. Your use of a recursive CTE is inefficient REGARDLESS of the application of the overall code in this regard. What, may I ask, is so complex about my code in comparision to yours? Is it how I have formatted the code?

  • Lynn Pettis (6/6/2010)


    Steve,

    You really are missing the entire point. I am not disputing what the function does but HOW it does it. Your use of a recursive CTE is inefficient REGARDLESS of the application of the overall code in this regard. What, may I ask, is so complex about my code in comparision to yours? Is it how I have formatted the code?

    I'm not missing the point. The benefit you will extract from said code improvements over your lifetime has likely been exceeded by the energy you've exhausted debating/examining the matter. If this were code I were recommending for a high volume transactional OLTP application, I'd likely be more enthused about a performance debate, and put as much energy into that aspect as I have with the other areas of discussion. It's not, so I won't.

    I look forward to our next discussion.

  • Lynn Pettis (6/1/2010)


    Its lack of scalability is its biggest problem. A recursive CTE is still RBAR (a Modenism for Row By Agonizing Row).

    Here's a little comparison between "quirky update" and rCTE which supports most folks' observation that the rCTE is slower. However, I'd disagree with you Lynn that the rCTE fails to scale - the "quirky update" is 5-7 times faster whether the dataset is 100k rows or 1 million rows. Sure it's slower - but it does scale.

    Here's the test with 1 million rows:

    DROP TABLE #Numbers

    SELECT TOP 1000000 --000

    n = ROW_NUMBER() OVER (ORDER BY a.name),

    CalcValue = CAST(NULL AS BIGINT)

    INTO #Numbers

    FROM master.dbo.syscolumns a, master.dbo.syscolumns b

    CREATE UNIQUE CLUSTERED INDEX CIn ON #Numbers ([n] ASC)

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    -- 'Quirky' update

    DECLARE @Lastval INT = 0, @CalcValue BIGINT

    UPDATE #Numbers SET

    @CalcValue = CalcValue = (@Lastval + n),

    @Lastval = n

    -- (1,000,000 row(s) affected) / CPU time = 4218 ms, elapsed time = 5719 ms.

    -- Table #Numbers... Scan count 1, logical reads 3113, physical reads 6, read-ahead reads 3146, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    -- Recursive CTE

    ;WITH Calculator (n, CalcValue) AS (

    SELECT n.n,

    CalcValue = CAST(n.n AS BIGINT)

    FROM #Numbers n

    WHERE n.n = 1

    UNION ALL

    SELECT n.n,

    CalcValue = n.n + c.n

    FROM #Numbers n

    INNER JOIN Calculator c ON c.n + 1 = n.n -- nice

    )

    SELECT n, CalcValue

    FROM Calculator

    OPTION (MAXRECURSION 0)

    -- (1,000,000 row(s) affected) / CPU time = 32438 ms, elapsed time = 35148 ms.

    -- Table 'Worktable'. Scan count 2, logical reads 6000001, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SET STATISTICS IO Off

    SET STATISTICS TIME Off

    And with 100k rows:

    DROP TABLE #Numbers

    SELECT TOP 100000

    n = ROW_NUMBER() OVER (ORDER BY a.name),

    CalcValue = CAST(NULL AS BIGINT)

    INTO #Numbers

    FROM master.dbo.syscolumns a, master.dbo.syscolumns b

    CREATE UNIQUE CLUSTERED INDEX CIn ON #Numbers ([n] ASC)

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    -- 'Quirky' update

    DECLARE @Lastval INT = 0, @CalcValue BIGINT

    UPDATE #Numbers SET

    @CalcValue = CalcValue = (@Lastval + n),

    @Lastval = n

    -- (100000 row(s) affected) / CPU time = 454 ms, elapsed time = 526 ms.

    -- Table #Numbers... Scan count 1, logical reads 314, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    -- Recursive CTE

    ;WITH Calculator (n, CalcValue) AS (

    SELECT n.n,

    CalcValue = CAST(n.n AS BIGINT)

    FROM #Numbers n

    WHERE n.n = 1

    UNION ALL

    SELECT n.n,

    CalcValue = n.n + c.n

    FROM #Numbers n

    INNER JOIN Calculator c ON c.n + 1 = n.n -- nice

    )

    SELECT n, CalcValue

    FROM Calculator

    OPTION (MAXRECURSION 0)

    -- (100000 row(s) affected) / CPU time = 3203 ms, elapsed time = 3483 ms.

    -- Table 'Worktable'. Scan count 2, logical reads 600001, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SET STATISTICS IO Off

    SET STATISTICS TIME Off

    Whichever measure of "resource used" you choose for comparison, both methods appear to scale in a linear manner.

    If you've absolutely got to do a job row by row, such as a running totals, then a quirky update is virtually guaranteed to complete before a rCTE. That's not the whole picture though. The rCTE gives you output which you may have to write back: the quirky update does the opposite. The rCTE is often quicker to write with a little practice too. I use both methods, sometimes even writing both for a particular job then choosing one or the other on merit - which might be readability over performance.

    Cheers

    ChrisM

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • The other point not mentioned, is that, for DimDate and DimTime, the real reason over and above the performance of joining via an int (and typically not going via the DimDate either) is the additional information required.

    Almost all places have a financial calendar that is NOT the normal calendar, and having both of these in the table is essential.

    Same for time, shifts are NOT (legislatively in most countries) 24 hours long, and also don't match up to the 24 hour period anyway

  • Hi Steve,

    Great article on the fn_gettimerangestamp. I've been using a date table for some time at a few people's suggestions for some a variety of issues at work. I'm not in IT, just a manager with limited resources and a bit of a SQL aficionado. Anyway, I'm in the process of setting up a reporting database/cube and I'm definitely going to try to make use of your function. I work in transit and everything is measured in seconds (ie revenue hours, deadhead hours, layover hours etc all need to be calculated to at least the minute, but are stored in seconds in the source database). So this will be a godsend if it helps improve performance.

    I was wondering if you've ever had any ideas on setting up time tables (not in the SQL sense, but schedules). I've been working on pivoting times on routes using SQL Pivot, but it's a bit cumbersome. Some of the variables that I encounter are that I have fixed routes with fixed patterns. However, routes will sometimes start service mid trip, so the earliest time is not actually at the normal starting point. The other issue is that there are a lot of repeat stops when a route does a loop.

    I can provide more background if you've worked on anything like this, just curious.

    Thanks for the great info!

    Craig

Viewing 9 posts - 76 through 83 (of 83 total)

You must be logged in to reply to this topic. Login to reply