Log in  ::  Register  ::  Not logged in

 Recent PostsRecent Posts Popular TopicsPopular Topics
 Home Search Members Calendar Who's On

 Sins of SQL: The Time Table Rate Topic Display Mode Topic Options
Author
 Message
 Posted Friday, June 04, 2010 12:10 AM
 Valued Member Group: General Forum Members Last Login: Friday, March 04, 2011 11:38 AM Points: 53, Visits: 99
Post #932596
 Posted Friday, June 04, 2010 9:35 AM
 Grasshopper Group: General Forum Members Last Login: Saturday, September 14, 2013 1:21 PM Points: 14, Visits: 102
 If you have seperated date/time dimensions, the number of possible cells in the multidimensional space that need storing individually will be the same, since the conceptual space is the cross product of the dimensions. To give another example: Say I have a table of products with 8 things in it, a table of 100 customers, and 5 stores. A cube storing the sums may keep 4,000 distinct entries. If two of the dimensions are merged to form a hierarchy, since perhaps, each customer has a relationship 1:1 with a store, then you'd end up with a Customer (by store) dimension of cardinality 500. This gives, likewise, a dimensional space with up to 5000 (correction - 4000) cells.Hey. Why stop there! If separate date/time dimensions leads to the same number of possible cells in the resultant cube space as the a single date/time dimension, that same logic should apply to everything. If sound logic, then that principle should be true. Then why separate customer/product/store dimensions? An equally sound dimensional design would be to have a single fact table related to one dimension with a composite key of product/customer/store and the hierarchies built from that key.I hope we both agree that that design is NOT sound. Why that design is not sound is the same reasoning that the composite date/time dimension is not sound (Again, read chapter 2 of The Data Warehouse Toolkit for lengthy justification why it's not.) Because a Time member, i.e. 12:01 AM is not uniquely related to only one member of the date dimension. It's related to every member of the date dimension and the same justification to separate Product and Customer and Store is the same logic applied to separate date and time. A date is a date and a time is a time and they are not the same. I'm done. No mas.
Post #932898
 Posted Friday, June 04, 2010 10:15 AM
 SSC-Insane Group: General Forum Members Last Login: Today @ 9:12 AM Points: 22,083, Visits: 28,982
 Steven James Gray (6/4/2010) ...I've yet to see any demonstration that the CTE itself is unsuitable for synthesising the data on-demand - it's not the fastest mechanism (and I've never claimed it would be), but it's also not an order of magnitute worse. Like procedural texture generation in 3D modelling systems, it trades some performance for the ability to switch to arbitrary levels of granularity by modifications to a single area....Here is the problem, you are using an inefficient method of generating the data. Another developer is going to come around, see this code and say "Great, just what I need for what I am working on!" Problem, turns out this developer is trying to use it on multi-million row data sets and can't figure out why it doesn't work well.Also, what happens if you start using this same code on much larger datasets and start seeing performace issues?Some very simple testing showed the inefficiences in the recursive CTE, including additional cpu time and IO requirements of the method.
Post #932934
 Posted Sunday, June 06, 2010 10:36 AM
 Valued Member Group: General Forum Members Last Login: Friday, March 04, 2011 11:38 AM Points: 53, Visits: 99
 Martin Mason (6/4/2010)Hey. Why stop there! If separate date/time dimensions leads to the same number of possible cells in the resultant cube space as the a single date/time dimension, that same logic should apply to everything. If sound logic, then that principle should be true. Then why separate customer/product/store dimensions? An equally sound dimensional design would be to have a single fact table related to one dimension with a composite key of product/customer/store and the hierarchies built from that key.I never said it was a good idea to not have dimensions - but I think the argument for/against seperate/unified date time dimensions depends on your view of time. When I model things I tend to view date/time as a singular axis. For reporting that needs to look at time valuations, irrespective of their dates (not something that tends to be valuable in the work I do), the hierarchies support in SSAS allows those needs to be met. So whilst I'd still maintain seperate customer/product/store dimensions, I wouldn't ever opt for a seperate year/month dimensions.I hope we both agree that that design is NOT sound. Why that design is not sound is the same reasoning that the composite date/time dimension is not sound (Again, read chapter 2 of The Data Warehouse Toolkit for lengthy justification why it's not.) Because a Time member, i.e. 12:01 AM is not uniquely related to only one member of the date dimension. It's related to every member of the date dimension and the same justification to separate Product and Customer and Store is the same logic applied to separate date and time. A date is a date and a time is a time and they are not the same. The example I gave was purely to demonstrate the concept that the overal cardinality of the dimensional space is the same, regardless of which horse you ride when it comes to date/time tables, and merging store/product/customer dimensions would never be a real scenario someone would likely do. That particular book sits on my shelf, and whilst it has it's views on things, no one book is cannon. Time represents subdivisions of a date, and if it was genuinely so valuable to seperate date/time, why would you not maintain seperate year, month, hour, minute, second dimensions for the same reasons. There's always 60 seconds per minute, 60 minutes per hour, 12 months per year, etc - but you'd be hard pressed to find a solution out there that does things that way.Lynn Pettis (6/4/2010)Here is the problem, you are using an inefficient method of generating the data. Another developer is going to come around, see this code and say "Great, just what I need for what I am working on!" Problem, turns out this developer is trying to use it on multi-million row data sets and can't figure out why it doesn't work well.Inefficiency is relative. As slow as it is, it's possible to generate enough data to cover a century in less than half an hour when dealing at the minute resolution, and without any effort applied to optimise what's there. Relative to the time and effort to maintain and keep on top of populating a time table manually, it seems like small-fry. You'd not reprocess the whole dimension every time you loaded a fact in to the tables either but rather only incrementally populate ranges as you go.Also, what happens if you start using this same code on much larger datasets and start seeing performace issues?....Some very simple testing showed the inefficiences in the recursive CTE, including additional cpu time and IO requirements of the method.If you look at the code, the recursive steps in the example given are broken into two recursive elements - the date range and the time one, rather than a single second-level recursive process. The query plans in this case take the date/time combinations and do a table spool over them, and then as each day value is generated it is cross-joined to the time combinations. That's a sum total of 36,500 total recursions per century, total! To prove this is the case try running the following query using the TVF from the article:SELECT * FROM dbo.[fn_GetTimestampRange]('1 January 2006','1 February 2006', 0) OPTION(MAXRECURSION 60) This will produce a rowset of 44641 rows, but never recurses more than 60 levels as a result of the optimisations obtained by the multiple re-use of the numbers expression within the function. The only part of it that actually grows in cost is the recursive step-per-day. Your maxrecursion value needs to be at least 60, and at most the number of distinct days in the range covered. As a result, this is substantially more efficient than your article would imply.It's possible to optimise this even further by doing the same trick when resolving all the possible year/month permutations, meaning the recursive cost would shrink further, but I thought that'd impact readability and not gain much for performance. As it stands the current design has the advantage being an inlinable TVF (and most performance work would likely prevent that), which means it can be used directly in a DSV, and streams the rows as fast as they're produced, rather than waiting for all rows to be computed in a temp-table, as the multi-step TVF approach would do.
Post #933250
 Posted Sunday, June 06, 2010 12:55 PM
 SSC-Insane Group: General Forum Members Last Login: Today @ 9:12 AM Points: 22,083, Visits: 28,982
 Steven James Gray (6/6/2010)Inefficiency is relative. As slow as it is, it's possible to generate enough data to cover a century in less than half an hour when dealing at the minute resolution, and without any effort applied to optimise what's there. Relative to the time and effort to maintain and keep on top of populating a time table manually, it seems like small-fry. You'd not reprocess the whole dimension every time you loaded a fact in to the tables either but rather only incrementally populate ranges as you go.Actually inefficiency is not relative, it may just not be noticable. Inefficient is still inefficient if there is another means to generate the same results that is better.If you look at the code, the recursive steps in the example given are broken into two recursive elements - the date range and the time one, rather than a single second-level recursive process. The query plans in this case take the date/time combinations and do a table spool over them, and then as each day value is generated it is cross-joined to the time combinations. That's a sum total of 36,500 total recursions per century, total! To prove this is the case try running the following query using the TVF from the article:SELECT * FROM dbo.[fn_GetTimestampRange]('1 January 2006','1 February 2006', 0) OPTION(MAXRECURSION 60) This will produce a rowset of 44641 rows, but never recurses more than 60 levels as a result of the optimisations obtained by the multiple re-use of the numbers expression within the function. The only part of it that actually grows in cost is the recursive step-per-day. Your maxrecursion value needs to be at least 60, and at most the number of distinct days in the range covered. As a result, this is substantially more efficient than your article would imply.It's possible to optimise this even further by doing the same trick when resolving all the possible year/month permutations, meaning the recursive cost would shrink further, but I thought that'd impact readability and not gain much for performance.Care to prove your statement? I'm pretty sure that my routine (with a slight modification thanks to you allowing me to see an ineffeciency in my code, which I have fixed in my blog) will consistantly beat your code in head to head competition.As it stands the current design has the advantage being an inlinable TVF (and most performance work would likely prevent that), which means it can be used directly in a DSV, and streams the rows as fast as they're produced, rather than waiting for all rows to be computed in a temp-table, as the multi-step TVF approach would do.Have you taken a close look at my code (either version)? It is also written as an inline-TVF and can be used in exactly the same fashion as yours. It has the benefit of being more efficient. After making a small modification to my code, I ran both functions to generate 100 years of data at one minute intervals, writing the output to temporary tables:`set statistics io on;set statistics time on;select * into #Test1 from dbo.fn_GetTimestampRangeCTE ('2000-01-01','2100-01-01', 0)set statistics time off;set statistics io off;set statistics io on;set statistics time on;select * into #Test2 from dbo.fn_GetTimestampRangeRCTE ('2000-01-01','2100-01-01', 0) OPTION (MAXRECURSION 0);set statistics time off;set statistics io off;` SQL Server Execution Times: CPU time = 432079 ms, elapsed time = 453693 ms.(52596001 row(s) affected)Table 'Worktable'. Scan count 8, logical reads 2046345, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 1525828 ms, elapsed time = 1553694 ms.(52596001 row(s) affected)Once again, you'll see that your function, using a recursive CTE is quite a bit slower. And even generating 52,596,001 rows of data, mine still isn't generating additional IO like the recursive CTE does.
Post #933264
 Posted Sunday, June 06, 2010 1:04 PM
 SSC-Insane Group: General Forum Members Last Login: Today @ 9:12 AM Points: 22,083, Visits: 28,982
 Special note, I found no real appreciable difference between the following:`select * into #Test2 from dbo.fn_GetTimestampRangeRCTE ('2006-01-01','2006-02-01', 0) OPTION (MAXRECURSION 0);select * into #Test2 from dbo.fn_GetTimestampRangeRCTE ('2006-01-01','2006-02-01', 0) OPTION (MAXRECURSION 60);`
Post #933267
 Posted Sunday, June 06, 2010 2:15 PM
 Forum Newbie Group: General Forum Members Last Login: Monday, October 28, 2013 6:59 AM Points: 2, Visits: 63
Post #933277
 Posted Sunday, June 06, 2010 3:42 PM
 SSC-Insane Group: General Forum Members Last Login: Today @ 9:12 AM Points: 22,083, Visits: 28,982
 Steve,Unfortunately, you are using the same logic everyone else uses to defend the use of inefficient code, "It works for my case." Problem is, it is still inefficient. Recursion is by nature RBAR, Row By Agonizing Row. My routine eliminates both the recursion and the IO generated by using worktable.Also, you seem to be indicating that using OPTION (MAXRECURSION 60) makes the call to your routine more efficient than OPTION (MAXRECURSION 0), which it doesn't. I ran multiple tests using both and they were comparable, in fact the OPTION (MAXRECURSION) was actually faster in many cases (although I only ran about 10 runs of each).Some one is going to look at your code and say, "Hey, that is exactly what I need for what I am working on!", when in actuality it isn't because what they may be doing is against several hundred thousand rows or more with a high frequency of usage.
Post #933285
 Posted Sunday, June 06, 2010 4:38 PM
 Valued Member Group: General Forum Members Last Login: Friday, March 04, 2011 11:38 AM Points: 53, Visits: 99
 Lynn Pettis (6/6/2010)Steve,Unfortunately, you are using the same logic everyone else uses to defend the use of inefficient code, "It works for my case." Problem is, it is still inefficient. Recursion is by nature RBAR, Row By Agonizing Row. My routine eliminates both the recursion and the IO generated by using worktable.When considering optimisation, I consider how often the code is called, the trade-off of complexity for performance other factors.Also, you seem to be indicating that using OPTION (MAXRECURSION 60) makes the call to your routine more efficient than OPTION (MAXRECURSION 0), which it doesn't. I ran multiple tests using both and they were comparable, in fact the OPTION (MAXRECURSION) was actually faster in many cases (although I only ran about 10 runs of each).I never said that it made it faster, but used it to demonstrate that the recursion is nowhere near as deep as yourself and others implied. Producing a 40,000 row result set with only sixty iterative steps in the example.Some one is going to look at your code and say, "Hey, that is exactly what I need for what I am working on!", when in actuality it isn't because what they may be doing is against several hundred thousand rows or more with a high frequency of usage.This code can produce a centuries worth of date-time values in half hour, as demonstrated by your own tests. Unless they're constantly re-computing centuries worth of data (which would lend itself to a persistent table, rather than a dynamic rowset), they're not going to get much win for the effort.
Post #933294
 Posted Sunday, June 06, 2010 4:56 PM
 SSC-Insane Group: General Forum Members Last Login: Today @ 9:12 AM Points: 22,083, Visits: 28,982
 Steven James Gray (6/6/2010)Lynn Pettis (6/6/2010)Steve,Unfortunately, you are using the same logic everyone else uses to defend the use of inefficient code, "It works for my case." Problem is, it is still inefficient. Recursion is by nature RBAR, Row By Agonizing Row. My routine eliminates both the recursion and the IO generated by using worktable.When considering optimisation, I consider how often the code is called, the trade-off of complexity for performance other factors.Also, you seem to be indicating that using OPTION (MAXRECURSION 60) makes the call to your routine more efficient than OPTION (MAXRECURSION 0), which it doesn't. I ran multiple tests using both and they were comparable, in fact the OPTION (MAXRECURSION) was actually faster in many cases (although I only ran about 10 runs of each).I never said that it made it faster, but used it to demonstrate that the recursion is nowhere near as deep as yourself and others implied. Producing a 40,000 row result set with only sixty iterative steps in the example.Some one is going to look at your code and say, "Hey, that is exactly what I need for what I am working on!", when in actuality it isn't because what they may be doing is against several hundred thousand rows or more with a high frequency of usage.This code can produce a centuries worth of date-time values in half hour, as demonstrated by your own tests. Unless they're constantly re-computing centuries worth of data (which would lend itself to a persistent table, rather than a dynamic rowset), they're not going to get much win for the effort.Steve,You really are missing the entire point. I am not disputing what the function does but HOW it does it. Your use of a recursive CTE is inefficient REGARDLESS of the application of the overall code in this regard. What, may I ask, is so complex about my code in comparision to yours? Is it how I have formatted the code?
Post #933295

 Permissions