RE: Sins of SQL: The Time Table

Mr or Mrs. 500

Points: 501

June 6, 2010 at 10:36 am

Martin Mason (6/4/2010)
Hey. Why stop there! If separate date/time dimensions leads to the same number of possible cells in the resultant cube space as the a single date/time dimension, that same logic should apply to everything. If sound logic, then that principle should be true. Then why separate customer/product/store dimensions? An equally sound dimensional design would be to have a single fact table related to one dimension with a composite key of product/customer/store and the hierarchies built from that key.

I never said it was a good idea to not have dimensions - but I think the argument for/against seperate/unified date time dimensions depends on your view of time. When I model things I tend to view date/time as a singular axis. For reporting that needs to look at time valuations, irrespective of their dates (not something that tends to be valuable in the work I do), the hierarchies support in SSAS allows those needs to be met. So whilst I'd still maintain seperate customer/product/store dimensions, I wouldn't ever opt for a seperate year/month dimensions.

I hope we both agree that that design is NOT sound. Why that design is not sound is the same reasoning that the composite date/time dimension is not sound (Again, read chapter 2 of The Data Warehouse Toolkit for lengthy justification why it's not.) Because a Time member, i.e. 12:01 AM is not uniquely related to only one member of the date dimension. It's related to every member of the date dimension and the same justification to separate Product and Customer and Store is the same logic applied to separate date and time. A date is a date and a time is a time and they are not the same.

The example I gave was purely to demonstrate the concept that the overal cardinality of the dimensional space is the same, regardless of which horse you ride when it comes to date/time tables, and merging store/product/customer dimensions would never be a real scenario someone would likely do.

That particular book sits on my shelf, and whilst it has it's views on things, no one book is cannon. Time represents subdivisions of a date, and if it was genuinely so valuable to seperate date/time, why would you not maintain seperate year, month, hour, minute, second dimensions for the same reasons. There's always 60 seconds per minute, 60 minutes per hour, 12 months per year, etc - but you'd be hard pressed to find a solution out there that does things that way.

Lynn Pettis (6/4/2010)
Here is the problem, you are using an inefficient method of generating the data. Another developer is going to come around, see this code and say "Great, just what I need for what I am working on!" Problem, turns out this developer is trying to use it on multi-million row data sets and can't figure out why it doesn't work well.

Inefficiency is relative. As slow as it is, it's possible to generate enough data to cover a century in less than half an hour when dealing at the minute resolution, and without any effort applied to optimise what's there. Relative to the time and effort to maintain and keep on top of populating a time table manually, it seems like small-fry. You'd not reprocess the whole dimension every time you loaded a fact in to the tables either but rather only incrementally populate ranges as you go.

Also, what happens if you start using this same code on much larger datasets and start seeing performace issues?
....
Some very simple testing showed the inefficiences in the recursive CTE, including additional cpu time and IO requirements of the method.

If you look at the code, the recursive steps in the example given are broken into two recursive elements - the date range and the time one, rather than a single second-level recursive process. The query plans in this case take the date/time combinations and do a table spool over them, and then as each day value is generated it is cross-joined to the time combinations. That's a sum total of 36,500 total recursions per century, total! To prove this is the case try running the following query using the TVF from the article:

SELECT * FROM dbo.[fn_GetTimestampRange]('1 January 2006','1 February 2006', 0) OPTION(MAXRECURSION 60)

This will produce a rowset of 44641 rows, but never recurses more than 60 levels as a result of the optimisations obtained by the multiple re-use of the numbers expression within the function. The only part of it that actually grows in cost is the recursive step-per-day. Your maxrecursion value needs to be at least 60, and at most the number of distinct days in the range covered. As a result, this is substantially more efficient than your article would imply.

It's possible to optimise this even further by doing the same trick when resolving all the possible year/month permutations, meaning the recursive cost would shrink further, but I thought that'd impact readability and not gain much for performance.

As it stands the current design has the advantage being an inlinable TVF (and most performance work would likely prevent that), which means it can be used directly in a DSV, and streams the rows as fast as they're produced, rather than waiting for all rows to be computed in a temp-table, as the multi-step TVF approach would do.