Evolution of code
The thing with any bit of code that has been around for a while, is that when change comes along, the tendency is to cater for the change by adding new stuff, while nothing gets taken away. Some stuff has definitely been taken away from this Date Dimension, but some historical artefacts no doubt remain.
All Date Dimensions are a blend of personal preference and business requirement, frequently with a bunch of history thrown in, and this one is no exception.
No doubt all Date Dimensions share a few common properties, but thereafter may diverge significantly. I have previously developed Date Dimensions that don’t look a whole lot like this one. And indeed I have previous articles on Date Dimensions at SQLServerCentral.com.
This Date Dimension is a blend of a fair amount of history from different sources, and folk are used to using it, and it does what the users want it to do. Its roots are distinctly in SQL Server, so it does things in the way that SQL Server has led, although it has been used in different environments. Despite many different platforms all offering an ANSI SQL environment, they are all subtly different, and one major area of difference is the manner in which Dates are handled, and the built in functions that are available to manipulate Dates.
Date Formats
This Date Dimension uses both integer dates in the format YYYYMMDD and more normal dates – we just choose the flavour that we want in each situation, some situations being more suited to integers, whilst others are more suited to more normal dates.
Built-in Week-ness
When the DatePart function is performed in SQL Server to return a week number at the end of December or the beginning of January, SQL has no qualms about short weeks, and either the last week in December or the first week in January, or both of these will be less than 7 days. I should perhaps note that SQL Server permits a lot of flexibility around its handling of weeks, but it was decided a long time ago that the Date Dimension would go with the default method.
So in this Date Dimension, the field IFDW (integer First Day Of Week) follows SQL’s way of thinking and will always be in the same year as the rest of the dates with that same YYYYWW value. However, there is also CalIFDW, which may not necessarily be in the same year for the first week in January, as it always keeps to 7-day weeks. Where I have prefixed a week-oriented fieldname with Cal, that means that it is working on a full 7-day week, ignoring SQL’s possible truncating with the last week in December and the first week in January.
We do a lot of reporting by week, but not all of the clients we deal with stick to the same definition of a week that we use. This Date Dimension caters for weeks starting on any day of the week. I must hasten to add that we don’t have clients that use every single one of these, but the Date Dimension would handle this, if such a situation were to pertain.
Sequence Numbers
For our purposes, it doesn’t matter if we are in week 27 or 26, or 35 or whatever. We just need to know the current week, the prior week, the week from a month ago, and the week from a year ago. In order to do this, we have Sequence Numbers, and these are simply counts of the different date intervals from the start of the Date Dimension, which in this case is 1st Jan 1950. These fields begin with Seq. All of Jan 1950 has a SeqM value of 1, and Feb 1950 has a SeqM value of 2 and so on. The current month happens to be no 907 on this somewhat meaningless scale, and who could remember that? You might also ask who would need to remember that! But we also have Relative Sequence (RelSeq) fields and these provide the magic.
Relative Sequence Numbers
Our data is reliant on several feeds from outside the company, so we have little control over when we get that data. For some purposes we need to allow for 2 days for the data to arrive in our tables, and for others we need to allow for 3 days. So we have two sets of RelSeq fields, which have the suffix 2 or 3, indicating whether they have allowed for 2 or 3 days.
The 0 values in the RelSeq fields are the latest full set of dates for each given Date Measure. So a 0 in RelSeqM2 tells us that this is the latest full month prior to 2 days ago. And having got the 0 Month, then getting the prior Month is easy – RelSeqM2 = 1 – and the prior Year – RelSeqM2 = 12.
These RelSeq values are proving to be hugely useful in Power BI.
Ideal Architecture
The thing with these RelSeq values is that they potentially change every day. To me, the ideal way of doing this would be to have a fixed Date Dimension table, which would contain the output from the Common Table Expression CTE_ShouldIdeallyBeStoredAsDimDate, and then a View comprising of the code after this CTE.
We principally make use of the RelSeq fields in Power BI, and as our Power BI models refresh daily, with the Date Dimension Refreshing from a View that calculates the RelSeq values, our RelSeq fields within Power BI consistently stay in functional shape.
Query Structure
This Date Dimension implementation makes a great deal of use of Window functions. No doubt the same results could have been achieved with a plethora of self joins and logic, but the Window functions are more efficient than the alternatives with this size and shape of data, and make the code more readable, which is always a laudable aim.
The code utilises 2 parameters, which are in CTE_Params. I have got 1st Jan 1950 as my StartDate, and I have asked for 100 years of records by asking for 36524 days, so it will end on 31st Dec 1949. Changing these values are the easy tweaks. Any other changes you would like are going to involve more significant code changes.
There are three versions of this query – a SQL Server version, a Databricks version, and a BigQuery version. The outputs are not identical, but are nearly so. The differences in output come about because Databricks handles Week numbers differently, and is not flexible with this. With Databricks, the last week in December or the 1st week in January may be allocated a week number from the adjacent year. For us, these differences are immaterial, as the only fields that have differences are WeekNo, YYYYWW, DFDW and DLDW. Everything else is identical. I could have written code to tweak these so that they were the same, but as this would have made no material difference to us, I didn’t.
And whilst BigQuery does offer a smidgen more flexibility with weeks than Databricks, I have got it to return output that is identical to the Databricks query.
Generating the number of rows required
Apart from the syntactical differences caused by the different environments, there are differences in the way that the number of rows are generated.
Within the SQL Server script, the large number of rows comes from a recursive CTE. This means at some stage you are going to need to set MaxRecursion, unless your Date Dimension is going to have less than 100 Rows, which does not seem very likely.
Databricks does things differently, using the Sequence function to generate a horizontal set of values, and the Explode function to turn that into a vertical set.
BigQuery does it in a similar manner to Databricks, using the Generate_Array function to generate the horizontal set of numbers, and the UnNest function to turn it into a vertical set.
All three of these methods work really well for the numbers of rows involved here. Apart from these, the differences are because of differences in the way that the ANSI SQL standard has been implemented in each environment. The scripts go through the same logical steps in the same locations, albeit in a subtly distinct manner.
Efficiency
These are scripts that have come from an evolved source, so will still carry artefacts from previous requirements, etc, and won’t be in their absolutely most efficient state. I don’t doubt that there are improvements that could be made. Or that your requirements will be subtly different, and that you will refresh the code to suit you. However, as they stand, these queries are reasonably efficient, and if the bulk of the code is to be run once in order to store values in a table, then being able to shave a fraction of a second off the performance is not hugely desirable. If however, the pursuit of brownie points is the goal, then there is scope to earn those from improving this script, so go for it!