Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««56789»»

Sins of SQL: The Time Table Expand / Collapse
Author
Message
Posted Friday, June 4, 2010 12:10 AM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Friday, March 4, 2011 11:38 AM
Points: 53, Visits: 99
But to combine date and time just increases the sparcity of the multidimensional database along a dimension that is most likely going to be used to partition a database of any size whatsoever.

The only change this would require for partitioning is that the partitions are built off of the date/time values itself, not arbitrary key ranges in the time table.


It would just be a huge mistake to do so. Like said earlier, separate date and time dimensions for 10 years worth of data at the minute resolution results in a dimension with over 5 million (where'd I get 13M from?) stored members while separate dimensions results in a date dimension with 3650 members and a time dimension with 1440 members. I don't need benchmarks to tell me the separate dimension approach will best the consolidated dimension approach in every aspect; dimension processing, aggregation processing, ease of maintenance. Everything.

If you have seperated date/time dimensions, the number of possible cells in the multidimensional space that need storing individually will be the same, since the conceptual space is the cross product of the dimensions. To give another example: Say I have a table of products with 8 things in it, a table of 100 customers, and 5 stores. A cube storing the sums may keep 4,000 distinct entries. If two of the dimensions are merged to form a hierarchy, since perhaps, each customer has a relationship 1:1 with a store, then you'd end up with a Customer (by store) dimension of cardinality 500. This gives, likewise, a dimensional space with up to 5000 cells.

The same principle holds here for date/time. Whether you have a single dimension, or two, the maximum number of values themselves at the lowest level is absolutely the same. When it comes to aggregations etc, you'd only loose out on aggregation design if you needed to group by time values independantly of days - and you can keep things as simple/complex internally by using the appropriate hierarchy definitions that suit your querying needs.

The only argument I've heard that seems to be based on anything is the notion that the large number of members overall in the dimension may cause slowness as the members of the hierarchy are resolved - although I'll admit that in my tests with around ~1B fact rows haven't had any demonstrable problems when linking it to a pseudo-time dimension covering a range of ten years. Each day the next few records are incrementally loaded into the date dimension and the days data is pushed in shortly thereafter - runs just as quickly as it used to beforehand. Part of the reason I suspect is the fact that SSAS does not linearlly scan a dimension looking for the appropriate key when loading data, but uses a tree. The effort of searching one tree of combined values, versus two smaller trees is is not materially higher - any basic algorithms class will tell you that (doubling the amount of data in a B+ tree/Binary tree does not double search time, since the performance of lookups is always logarithmic).


But wait. If I convert a datetime type to an int [CAST(FactDate AS int)], excluding the time portion (assumes its AM), gives me the number of days elapsed since 1/1/1900. Hey, why not use that as the surrogate key of your date dimension! (Which is exactly what I and others do) I don't even have to do a look up for a date key at all in my ETL and if I haven't already, I could create a calculated column in my DSV. Similarly, I could use the number of minutes (or seconds) that have elapsed since midnight as the surrogate key of my time dimension so that I don't have to do a lookup for the Time Key either.
(And could still create a calculated column for that key in my Dg fast.


That approach is valid and workable, and one I've used myself where my reporting periods cover all possible date/time combinations. If you look at how I recommended things in the article, you'll notice this precisely the mechanism by which the milliseconds etc are normalized out of the data.


(And not nearly as "evil" as the CTE Steve purposes as the replacement of the time dimension)


I've yet to see any demonstration that the CTE itself is unsuitable for synthesising the data on-demand - it's not the fastest mechanism (and I've never claimed it would be), but it's also not an order of magnitute worse. Like procedural texture generation in 3D modelling systems, it trades some performance for the ability to switch to arbitrary levels of granularity by modifications to a single area.


So to summarize, I still would give this article zero stars. I think it subtracts, not adds, to the body of knowledge.


To each their own.
Post #932596
Posted Friday, June 4, 2010 9:35 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Friday, June 13, 2014 3:47 PM
Points: 14, Visits: 110

If you have seperated date/time dimensions, the number of possible cells in the multidimensional space that need storing individually will be the same, since the conceptual space is the cross product of the dimensions. To give another example: Say I have a table of products with 8 things in it, a table of 100 customers, and 5 stores. A cube storing the sums may keep 4,000 distinct entries. If two of the dimensions are merged to form a hierarchy, since perhaps, each customer has a relationship 1:1 with a store, then you'd end up with a Customer (by store) dimension of cardinality 500. This gives, likewise, a dimensional space with up to 5000 (correction - 4000) cells.

Hey. Why stop there! If separate date/time dimensions leads to the same number of possible cells in the resultant cube space as the a single date/time dimension, that same logic should apply to everything. If sound logic, then that principle should be true. Then why separate customer/product/store dimensions? An equally sound dimensional design would be to have a single fact table related to one dimension with a composite key of product/customer/store and the hierarchies built from that key.

I hope we both agree that that design is NOT sound. Why that design is not sound is the same reasoning that the composite date/time dimension is not sound (Again, read chapter 2 of The Data Warehouse Toolkit for lengthy justification why it's not.) Because a Time member, i.e. 12:01 AM is not uniquely related to only one member of the date dimension. It's related to every member of the date dimension and the same justification to separate Product and Customer and Store is the same logic applied to separate date and time. A date is a date and a time is a time and they are not the same.

I'm done. No mas.
Post #932898
Posted Friday, June 4, 2010 10:15 AM


SSC-Insane

SSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-Insane

Group: General Forum Members
Last Login: Today @ 2:43 PM
Points: 22,895, Visits: 31,297
Steven James Gray (6/4/2010)
...
I've yet to see any demonstration that the CTE itself is unsuitable for synthesising the data on-demand - it's not the fastest mechanism (and I've never claimed it would be), but it's also not an order of magnitute worse. Like procedural texture generation in 3D modelling systems, it trades some performance for the ability to switch to arbitrary levels of granularity by modifications to a single area.
...


Here is the problem, you are using an inefficient method of generating the data. Another developer is going to come around, see this code and say "Great, just what I need for what I am working on!" Problem, turns out this developer is trying to use it on multi-million row data sets and can't figure out why it doesn't work well.

Also, what happens if you start using this same code on much larger datasets and start seeing performace issues?

Some very simple testing showed the inefficiences in the recursive CTE, including additional cpu time and IO requirements of the method.



Lynn Pettis

For better assistance in answering your questions, click here
For tips to get better help with Performance Problems, click here
For Running Totals and its variations, click here or when working with partitioned tables
For more about Tally Tables, click here
For more about Cross Tabs and Pivots, click here and here
Managing Transaction Logs

SQL Musings from the Desert Fountain Valley SQL (My Mirror Blog)
Post #932934
Posted Sunday, June 6, 2010 10:36 AM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Friday, March 4, 2011 11:38 AM
Points: 53, Visits: 99
Martin Mason (6/4/2010)

Hey. Why stop there! If separate date/time dimensions leads to the same number of possible cells in the resultant cube space as the a single date/time dimension, that same logic should apply to everything. If sound logic, then that principle should be true. Then why separate customer/product/store dimensions? An equally sound dimensional design would be to have a single fact table related to one dimension with a composite key of product/customer/store and the hierarchies built from that key.


I never said it was a good idea to not have dimensions - but I think the argument for/against seperate/unified date time dimensions depends on your view of time. When I model things I tend to view date/time as a singular axis. For reporting that needs to look at time valuations, irrespective of their dates (not something that tends to be valuable in the work I do), the hierarchies support in SSAS allows those needs to be met. So whilst I'd still maintain seperate customer/product/store dimensions, I wouldn't ever opt for a seperate year/month dimensions.


I hope we both agree that that design is NOT sound. Why that design is not sound is the same reasoning that the composite date/time dimension is not sound (Again, read chapter 2 of The Data Warehouse Toolkit for lengthy justification why it's not.) Because a Time member, i.e. 12:01 AM is not uniquely related to only one member of the date dimension. It's related to every member of the date dimension and the same justification to separate Product and Customer and Store is the same logic applied to separate date and time. A date is a date and a time is a time and they are not the same.


The example I gave was purely to demonstrate the concept that the overal cardinality of the dimensional space is the same, regardless of which horse you ride when it comes to date/time tables, and merging store/product/customer dimensions would never be a real scenario someone would likely do.

That particular book sits on my shelf, and whilst it has it's views on things, no one book is cannon. Time represents subdivisions of a date, and if it was genuinely so valuable to seperate date/time, why would you not maintain seperate year, month, hour, minute, second dimensions for the same reasons. There's always 60 seconds per minute, 60 minutes per hour, 12 months per year, etc - but you'd be hard pressed to find a solution out there that does things that way.

Lynn Pettis (6/4/2010)

Here is the problem, you are using an inefficient method of generating the data. Another developer is going to come around, see this code and say "Great, just what I need for what I am working on!" Problem, turns out this developer is trying to use it on multi-million row data sets and can't figure out why it doesn't work well.


Inefficiency is relative. As slow as it is, it's possible to generate enough data to cover a century in less than half an hour when dealing at the minute resolution, and without any effort applied to optimise what's there. Relative to the time and effort to maintain and keep on top of populating a time table manually, it seems like small-fry. You'd not reprocess the whole dimension every time you loaded a fact in to the tables either but rather only incrementally populate ranges as you go.


Also, what happens if you start using this same code on much larger datasets and start seeing performace issues?
....
Some very simple testing showed the inefficiences in the recursive CTE, including additional cpu time and IO requirements of the method.


If you look at the code, the recursive steps in the example given are broken into two recursive elements - the date range and the time one, rather than a single second-level recursive process. The query plans in this case take the date/time combinations and do a table spool over them, and then as each day value is generated it is cross-joined to the time combinations. That's a sum total of 36,500 total recursions per century, total! To prove this is the case try running the following query using the TVF from the article:

SELECT * FROM dbo.[fn_GetTimestampRange]('1 January 2006','1 February 2006', 0) OPTION(MAXRECURSION 60)

This will produce a rowset of 44641 rows, but never recurses more than 60 levels as a result of the optimisations obtained by the multiple re-use of the numbers expression within the function. The only part of it that actually grows in cost is the recursive step-per-day. Your maxrecursion value needs to be at least 60, and at most the number of distinct days in the range covered. As a result, this is substantially more efficient than your article would imply.

It's possible to optimise this even further by doing the same trick when resolving all the possible year/month permutations, meaning the recursive cost would shrink further, but I thought that'd impact readability and not gain much for performance.

As it stands the current design has the advantage being an inlinable TVF (and most performance work would likely prevent that), which means it can be used directly in a DSV, and streams the rows as fast as they're produced, rather than waiting for all rows to be computed in a temp-table, as the multi-step TVF approach would do.
Post #933250
Posted Sunday, June 6, 2010 12:55 PM


SSC-Insane

SSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-Insane

Group: General Forum Members
Last Login: Today @ 2:43 PM
Points: 22,895, Visits: 31,297
Steven James Gray (6/6/2010)

Inefficiency is relative. As slow as it is, it's possible to generate enough data to cover a century in less than half an hour when dealing at the minute resolution, and without any effort applied to optimise what's there. Relative to the time and effort to maintain and keep on top of populating a time table manually, it seems like small-fry. You'd not reprocess the whole dimension every time you loaded a fact in to the tables either but rather only incrementally populate ranges as you go.


Actually inefficiency is not relative, it may just not be noticable. Inefficient is still inefficient if there is another means to generate the same results that is better.


If you look at the code, the recursive steps in the example given are broken into two recursive elements - the date range and the time one, rather than a single second-level recursive process. The query plans in this case take the date/time combinations and do a table spool over them, and then as each day value is generated it is cross-joined to the time combinations. That's a sum total of 36,500 total recursions per century, total! To prove this is the case try running the following query using the TVF from the article:

SELECT * FROM dbo.[fn_GetTimestampRange]('1 January 2006','1 February 2006', 0) OPTION(MAXRECURSION 60)

This will produce a rowset of 44641 rows, but never recurses more than 60 levels as a result of the optimisations obtained by the multiple re-use of the numbers expression within the function. The only part of it that actually grows in cost is the recursive step-per-day. Your maxrecursion value needs to be at least 60, and at most the number of distinct days in the range covered. As a result, this is substantially more efficient than your article would imply.

It's possible to optimise this even further by doing the same trick when resolving all the possible year/month permutations, meaning the recursive cost would shrink further, but I thought that'd impact readability and not gain much for performance.


Care to prove your statement? I'm pretty sure that my routine (with a slight modification thanks to you allowing me to see an ineffeciency in my code, which I have fixed in my blog) will consistantly beat your code in head to head competition.


As it stands the current design has the advantage being an inlinable TVF (and most performance work would likely prevent that), which means it can be used directly in a DSV, and streams the rows as fast as they're produced, rather than waiting for all rows to be computed in a temp-table, as the multi-step TVF approach would do.


Have you taken a close look at my code (either version)? It is also written as an inline-TVF and can be used in exactly the same fashion as yours. It has the benefit of being more efficient. After making a small modification to my code, I ran both functions to generate 100 years of data at one minute intervals, writing the output to temporary tables:

set statistics io on;
set statistics time on;

select * into #Test1 from dbo.fn_GetTimestampRangeCTE ('2000-01-01','2100-01-01', 0)

set statistics time off;
set statistics io off;

set statistics io on;
set statistics time on;

select * into #Test2 from dbo.fn_GetTimestampRangeRCTE ('2000-01-01','2100-01-01', 0) OPTION (MAXRECURSION 0);

set statistics time off;
set statistics io off;


SQL Server Execution Times:
CPU time = 432079 ms, elapsed time = 453693 ms.

(52596001 row(s) affected)
Table 'Worktable'. Scan count 8, logical reads 2046345, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 1525828 ms, elapsed time = 1553694 ms.

(52596001 row(s) affected)

Once again, you'll see that your function, using a recursive CTE is quite a bit slower. And even generating 52,596,001 rows of data, mine still isn't generating additional IO like the recursive CTE does.



Lynn Pettis

For better assistance in answering your questions, click here
For tips to get better help with Performance Problems, click here
For Running Totals and its variations, click here or when working with partitioned tables
For more about Tally Tables, click here
For more about Cross Tabs and Pivots, click here and here
Managing Transaction Logs

SQL Musings from the Desert Fountain Valley SQL (My Mirror Blog)
Post #933264
Posted Sunday, June 6, 2010 1:04 PM


SSC-Insane

SSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-Insane

Group: General Forum Members
Last Login: Today @ 2:43 PM
Points: 22,895, Visits: 31,297
Special note, I found no real appreciable difference between the following:

select * into #Test2 from dbo.fn_GetTimestampRangeRCTE ('2006-01-01','2006-02-01', 0) OPTION (MAXRECURSION 0);

select * into #Test2 from dbo.fn_GetTimestampRangeRCTE ('2006-01-01','2006-02-01', 0) OPTION (MAXRECURSION 60);




Lynn Pettis

For better assistance in answering your questions, click here
For tips to get better help with Performance Problems, click here
For Running Totals and its variations, click here or when working with partitioned tables
For more about Tally Tables, click here
For more about Cross Tabs and Pivots, click here and here
Managing Transaction Logs

SQL Musings from the Desert Fountain Valley SQL (My Mirror Blog)
Post #933267
Posted Sunday, June 6, 2010 2:15 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Monday, October 28, 2013 6:59 AM
Points: 2, Visits: 63
Lynn Pettis (6/6/2010)

Actually inefficiency is not relative, it may just not be noticable. Inefficient is still inefficient if there is another means to generate the same results that is better.


I'd never said this was the most efficient method - but even your examples for processing 100 years of data show gains that would only ever be realized in an extreme scenario, and that's assuming you loaded it all at once. A few minutes per-century translates to a sub-second difference per day, so you'll understand why I'm not racing to rewrite my ETL packages. I'm not trying to knock the achievement, you should be pleased with your optimisations, but they're not ones I've ever had to seek.


Care to prove your statement?


I was demonstrating that the recursion is not at the lower levels, merely the day to day. As you know the parameter MAXRECUSION prevents any CTE from recursing more than the specified limit. By setting a MAXRECURSION of 60 I was demonstrating that rather than (as has been implied), this being massively recursive, the actual recursion is split into two smaller recursive elements (the generation of the NumbersCTE, with values 0-60) and the iteration over the dates. The only recursion after the first 'day' is processed is the DATEADD region.

If you review the query plan for the statement post-execution, you'll see the branch of the plan that generates the table-spool for the HH:MM:SS divisions is executed once ('Number of Executions') but the spool is read many times by a Nested Loops operator. As whilst it's not as fast, it's nowhere near as inefficient as some seemed to imply. At least you've had the good sport to put it into SSMS and try it.


I'm pretty sure that my routine (with a slight modification thanks to you allowing me to see an ineffeciency in my code, which I have fixed in my blog) will consistantly beat your code in head to head competition.


As per above. Don't dispute yours will be faster, but your method only gives a benefit of ~20 minutes of time saved for each century of data. You'd never load that much data at once in practice, and because the degredation is non-linear it means the benefit is lower as you scale down the range (I tend to populate incrementally each day), and the difference I've found when feeding both your example and my own into SSIS for a daily population is that both take more or less a second to run.
Post #933277
Posted Sunday, June 6, 2010 3:42 PM


SSC-Insane

SSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-Insane

Group: General Forum Members
Last Login: Today @ 2:43 PM
Points: 22,895, Visits: 31,297
Steve,

Unfortunately, you are using the same logic everyone else uses to defend the use of inefficient code, "It works for my case." Problem is, it is still inefficient. Recursion is by nature RBAR, Row By Agonizing Row. My routine eliminates both the recursion and the IO generated by using worktable.

Also, you seem to be indicating that using OPTION (MAXRECURSION 60) makes the call to your routine more efficient than OPTION (MAXRECURSION 0), which it doesn't. I ran multiple tests using both and they were comparable, in fact the OPTION (MAXRECURSION) was actually faster in many cases (although I only ran about 10 runs of each).

Some one is going to look at your code and say, "Hey, that is exactly what I need for what I am working on!", when in actuality it isn't because what they may be doing is against several hundred thousand rows or more with a high frequency of usage.



Lynn Pettis

For better assistance in answering your questions, click here
For tips to get better help with Performance Problems, click here
For Running Totals and its variations, click here or when working with partitioned tables
For more about Tally Tables, click here
For more about Cross Tabs and Pivots, click here and here
Managing Transaction Logs

SQL Musings from the Desert Fountain Valley SQL (My Mirror Blog)
Post #933285
Posted Sunday, June 6, 2010 4:38 PM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Friday, March 4, 2011 11:38 AM
Points: 53, Visits: 99
Lynn Pettis (6/6/2010)
Steve,

Unfortunately, you are using the same logic everyone else uses to defend the use of inefficient code, "It works for my case." Problem is, it is still inefficient. Recursion is by nature RBAR, Row By Agonizing Row. My routine eliminates both the recursion and the IO generated by using worktable.


When considering optimisation, I consider how often the code is called, the trade-off of complexity for performance other factors.


Also, you seem to be indicating that using OPTION (MAXRECURSION 60) makes the call to your routine more efficient than OPTION (MAXRECURSION 0), which it doesn't. I ran multiple tests using both and they were comparable, in fact the OPTION (MAXRECURSION) was actually faster in many cases (although I only ran about 10 runs of each).


I never said that it made it faster, but used it to demonstrate that the recursion is nowhere near as deep as yourself and others implied. Producing a 40,000 row result set with only sixty iterative steps in the example.


Some one is going to look at your code and say, "Hey, that is exactly what I need for what I am working on!", when in actuality it isn't because what they may be doing is against several hundred thousand rows or more with a high frequency of usage.


This code can produce a centuries worth of date-time values in half hour, as demonstrated by your own tests. Unless they're constantly re-computing centuries worth of data (which would lend itself to a persistent table, rather than a dynamic rowset), they're not going to get much win for the effort.
Post #933294
Posted Sunday, June 6, 2010 4:56 PM


SSC-Insane

SSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-Insane

Group: General Forum Members
Last Login: Today @ 2:43 PM
Points: 22,895, Visits: 31,297
Steven James Gray (6/6/2010)
Lynn Pettis (6/6/2010)
Steve,

Unfortunately, you are using the same logic everyone else uses to defend the use of inefficient code, "It works for my case." Problem is, it is still inefficient. Recursion is by nature RBAR, Row By Agonizing Row. My routine eliminates both the recursion and the IO generated by using worktable.


When considering optimisation, I consider how often the code is called, the trade-off of complexity for performance other factors.


Also, you seem to be indicating that using OPTION (MAXRECURSION 60) makes the call to your routine more efficient than OPTION (MAXRECURSION 0), which it doesn't. I ran multiple tests using both and they were comparable, in fact the OPTION (MAXRECURSION) was actually faster in many cases (although I only ran about 10 runs of each).


I never said that it made it faster, but used it to demonstrate that the recursion is nowhere near as deep as yourself and others implied. Producing a 40,000 row result set with only sixty iterative steps in the example.


Some one is going to look at your code and say, "Hey, that is exactly what I need for what I am working on!", when in actuality it isn't because what they may be doing is against several hundred thousand rows or more with a high frequency of usage.


This code can produce a centuries worth of date-time values in half hour, as demonstrated by your own tests. Unless they're constantly re-computing centuries worth of data (which would lend itself to a persistent table, rather than a dynamic rowset), they're not going to get much win for the effort.


Steve,

You really are missing the entire point. I am not disputing what the function does but HOW it does it. Your use of a recursive CTE is inefficient REGARDLESS of the application of the overall code in this regard. What, may I ask, is so complex about my code in comparision to yours? Is it how I have formatted the code?



Lynn Pettis

For better assistance in answering your questions, click here
For tips to get better help with Performance Problems, click here
For Running Totals and its variations, click here or when working with partitioned tables
For more about Tally Tables, click here
For more about Cross Tabs and Pivots, click here and here
Managing Transaction Logs

SQL Musings from the Desert Fountain Valley SQL (My Mirror Blog)
Post #933295
« Prev Topic | Next Topic »

Add to briefcase «««56789»»

Permissions Expand / Collapse