RE: Sins of SQL: The Time Table

Mr or Mrs. 500

Points: 501

June 4, 2010 at 12:10 am

But to combine date and time just increases the sparcity of the multidimensional database along a dimension that is most likely going to be used to partition a database of any size whatsoever.

The only change this would require for partitioning is that the partitions are built off of the date/time values itself, not arbitrary key ranges in the time table.

It would just be a huge mistake to do so. Like said earlier, separate date and time dimensions for 10 years worth of data at the minute resolution results in a dimension with over 5 million (where'd I get 13M from?) stored members while separate dimensions results in a date dimension with 3650 members and a time dimension with 1440 members. I don't need benchmarks to tell me the separate dimension approach will best the consolidated dimension approach in every aspect; dimension processing, aggregation processing, ease of maintenance. Everything.

If you have seperated date/time dimensions, the number of possible cells in the multidimensional space that need storing individually will be the same, since the conceptual space is the cross product of the dimensions. To give another example: Say I have a table of products with 8 things in it, a table of 100 customers, and 5 stores. A cube storing the sums may keep 4,000 distinct entries. If two of the dimensions are merged to form a hierarchy, since perhaps, each customer has a relationship 1:1 with a store, then you'd end up with a Customer (by store) dimension of cardinality 500. This gives, likewise, a dimensional space with up to 5000 cells.

The same principle holds here for date/time. Whether you have a single dimension, or two, the maximum number of values themselves at the lowest level is absolutely the same. When it comes to aggregations etc, you'd only loose out on aggregation design if you needed to group by time values independantly of days - and you can keep things as simple/complex internally by using the appropriate hierarchy definitions that suit your querying needs.

The only argument I've heard that seems to be based on anything is the notion that the large number of members overall in the dimension may cause slowness as the members of the hierarchy are resolved - although I'll admit that in my tests with around ~1B fact rows haven't had any demonstrable problems when linking it to a pseudo-time dimension covering a range of ten years. Each day the next few records are incrementally loaded into the date dimension and the days data is pushed in shortly thereafter - runs just as quickly as it used to beforehand. Part of the reason I suspect is the fact that SSAS does not linearlly scan a dimension looking for the appropriate key when loading data, but uses a tree. The effort of searching one tree of combined values, versus two smaller trees is is not materially higher - any basic algorithms class will tell you that (doubling the amount of data in a B+ tree/Binary tree does not double search time, since the performance of lookups is always logarithmic).

But wait. If I convert a datetime type to an int [CAST(FactDate AS int)], excluding the time portion (assumes its AM), gives me the number of days elapsed since 1/1/1900. Hey, why not use that as the surrogate key of your date dimension! (Which is exactly what I and others do) I don't even have to do a look up for a date key at all in my ETL and if I haven't already, I could create a calculated column in my DSV. Similarly, I could use the number of minutes (or seconds) that have elapsed since midnight as the surrogate key of my time dimension so that I don't have to do a lookup for the Time Key either.
(And could still create a calculated column for that key in my Dg fast.

That approach is valid and workable, and one I've used myself where my reporting periods cover all possible date/time combinations. If you look at how I recommended things in the article, you'll notice this precisely the mechanism by which the milliseconds etc are normalized out of the data.

(And not nearly as "evil" as the CTE Steve purposes as the replacement of the time dimension)

I've yet to see any demonstration that the CTE itself is unsuitable for synthesising the data on-demand - it's not the fastest mechanism (and I've never claimed it would be), but it's also not an order of magnitute worse. Like procedural texture generation in 3D modelling systems, it trades some performance for the ability to switch to arbitrary levels of granularity by modifications to a single area.

So to summarize, I still would give this article zero stars. I think it subtracts, not adds, to the body of knowledge.

To each their own.