Mastering Dimensions of Time

  • Comments posted to this topic are about the item Mastering Dimensions of Time

  • Just a couple of thoughts...

    First, these two tables are one of the few places where I'd use the natural key of either the date or the time instead of a surrogate key. There's not much worse than trying to troubleshoot something by looking at rows in a table only to have to lookup every single date or time ID. Using the natural key would allow you to not have to make multiple time tables for the sake of continuing to use a surrogate key while still providing more resolution in the time elements.

    Also, the code isn't actually portable. That possibility ended as soon as you defined a variable. I don't know about other databases but, last I heard, Oracle (for example) does not use @ variables.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • If you need to define your internal, company calendar i.e. "4/4/5" in time table, such a script may become a little more interesting.

  • We use such date tables as well, but structured quite differently, see the script code for details.

    What it amounts to is our own integer based structured date type, with years written as 20140000, months as 20141200 and days as 20141231. It is easy to use in all sorts of code, filters well and even allows for a hierarchy in a result-set (distinct year, month and day results all fit the same mold). Having years, months and days in their own table in turn allows for simple queries when it comes to filtering and covering date ranges, even when there is sparse data to work with.

    The storage is minimal and even less thanks to the use of computed columns where possible. We work with ISO week numbers, but any kind of week number can be stuffed in there, as well as leap years and all that. It is more of a starting framework with favorable properties, on which to build as required. Feel free to use and modify as you see fit.

    create schema [Calendar] authorization dbo;

    go

    -- Calendar Segment table (list of all acceptable IdSegement values)

    --

    create table [Calendar].[Segment]

    (

    [Id] int not null

    , constraint pk_CalendarSegment primary key clustered( [Id] ) with fillfactor = 100

    )

    go

    -- Calendar Year table

    --

    create table [Calendar].[Year]

    (

    [IdSegment] int not null

    , [Year] smallint not null

    , [DayCount] smallint not null

    , [IdSegmentStart] as ( [IdSegment] + 0101 )

    , [IdSegmentEnd] as ( [IdSegment] + 1231 )

    , [DateStart] as ( dateAdd( year, [Year] - 2000, {d '2000-01-01'} ) )

    , [DateNext] as ( dateAdd( year, [Year] - 2000, {d '2001-01-01'} ) )

    , constraint pk_CalendarYear primary key nonclustered( [IdSegment] ) with fillfactor = 100

    , constraint fk_CalendarYear_Segment foreign key ( [IdSegment] ) references [Calendar].[Segment]( Id )

    );

    go

    create unique nonclustered index ak_CalendarYear on [Calendar].[Year]( [Year] ) include ( [IdSegment], DayCount ) with fillfactor = 100; -- fully covering

    go

    -- Calendar Month table

    --

    create table [Calendar].[Month]

    (

    [IdSegment] int not null

    , [Year] smallint not null

    , [Month] tinyint not null

    , [DayCount] smallint not null

    , [IdSegmentStart] as ( [IdSegment] + 01 )

    , [IdSegmentEnd] as ( [IdSegment] + [DayCount] )

    , [DateStart] as ( dateAdd( month, [Month] - 1, dateAdd( year, [Year] - 2000, {d '2000-01-01'} ) ) )

    , [DateNext] as ( dateAdd( month, [Month] - 1, dateAdd( year, [Year] - 2000, {d '2000-02-01'} ) ) )

    , constraint pk_CalendarMonth primary key clustered( [IdSegment] ) with fillfactor = 100

    , constraint fk_CalendarMonth_Segment foreign key ( [IdSegment] ) references [Calendar].[Segment]( Id )

    , constraint ck_CalendarMonth_Month check ([Month] >= 1 and [Month] <= 12)

    , constraint ck_CalendarMonth_DayCount check ([DayCount] >= 1 and [DayCount] <= 31)

    );

    go

    create unique nonclustered index ak_CalendarMonth_year on [Calendar].[Month] ( [Year], [Month] ) include ( [IdSegment], DayCount ) with fillfactor = 100; -- fully covering

    go

    create unique nonclustered index ak_CalendarMonth_month on [Calendar].[Month] ( [Month], [year] ) include ( [IdSegment], DayCount ) with fillfactor = 100; -- fully covering

    go

    -- Calendar Day table

    --

    create table [Calendar].[Day]

    (

    [IdSegment] int not null

    , [Year] smallint not null

    , [Month] tinyint not null

    , [Day] tinyint not null

    , [DayCount] as ( cast( 1 as smallint ) )

    , [IdSegmentStart] as ( [IdSegment] )

    , [IdSegmentEnd] as ( [IdSegment] )

    , [DateStart] as ( dateAdd( day, [Day] - 1, dateAdd( month, [Month] - 1, dateAdd( year, [year] - 2000, {d '2000-01-01'} ) ) ) )

    , [DateNext] as ( dateAdd( day, [Day] , dateAdd( month, [Month] - 1, dateAdd( year, [year] - 2000, {d '2000-01-01'} ) ) ) )

    , [Week] tinyint

    , [WeekDay] tinyint -- [1..7] = monday..sunday

    , [YearDay] smallint

    , constraint pk_CalendarDay primary key clustered( [IdSegment] ) with fillfactor = 100

    , constraint fk_CalendarDay_Segment foreign key ( [IdSegment] ) references [Calendar].[Segment]( Id )

    , constraint ck_CalendarDay_Month check ([Month] >= 1 and [Month] <= 12)

    , constraint ck_CalendarDay_Day check ([Day] >= 1 and [Day] <= 31)

    );

    We fill these tables with some initialization code that is run once:

    -- Generate data to test with

    --

    declare @year as int; set @year = 2000;

    declare @month as int;

    declare @date as datetime;

    while @year <= 2150 begin

    set @date = dateAdd( year, @year - 2000, {d '2000-01-01'} );

    set @month = 1;

    insert into [Calendar].[Segment]( Id )

    select @year * 10000;

    insert into [Calendar].[Year]( IdSegment, Year, DayCount )

    select @year * 10000, @year, datediff( day, @date, dateAdd( year, 1, @date ) );

    while @month <= 12 begin

    insert into [Calendar].[Segment]( Id )

    select @year * 10000 + @month * 100;

    insert into [Calendar].[Month]( IdSegment, Year, Month, DayCount )

    select @year * 10000 + @month * 100, @year, month( @date ), datediff( day, @date, dateAdd( month, 1, @date ) );

    while month( @date ) = @month begin

    insert into [Calendar].[Segment]( Id )

    select @year * 10000 + @month * 100 + day( @date );

    insert into [Calendar].Day( IdSegment, Year, Month, Day, Week, WeekDay, YearDay )

    select @year * 10000 + @month * 100 + day( @date ), @year, month( @date ), day( @date ), datePart( iso_week, @date ), datePart( weekDay , @date ), datePart( dayOfYear, @date );

    set @date = @date + 1;

    end

    set @month = @month + 1;

    end

    set @year = @year + 1;

    end

    go

  • PHYData DBA (1/20/2014)There might be several other ways of doing this without using a Row By Row script. We could not find one that was as easy to read and understand, or that would be as modifiable and portable as this one. Also, since we are not talking about millions of rows it was faster in a big picture sort of way to do it like this.

    As you stated, there are some ways to do this avoiding the row by row processing. Here I can post one I'd use instead (for the dates, but obviously can be done for time too).

    I find it more readable and simpler than a loop (and I believe far more efficient)

    ;with cte as(

    select CONVERT(DATE, '2005-01-01') as [date]

    union all

    select DATEADD(DD, 1, [date] )

    from cte

    where [date] < CONVERT(DATE, '2016-01-01')

    )

    select convert(int, convert(varchar(10), [date], 112)),

    [date],

    convert(varchar(10), [date], 105),

    datepart(dw, [date]),

    datename(dw, [date]),

    day([date]),

    datepart(dy, [date]),

    datepart(wk, [date]),

    datename(mm, [date]),

    datepart(mm, [date]),

    datename(qq, [date]),

    year([date]),

    case when ( year([date]) % 4 = 0) and (year([date]) % 100 != 0 or year([date]) % 400 = 0) then 1 else 0 end as IsLeapYear,

    case when ( datepart(dw, [date]) = 1 or datepart(dw, [date]) = 7 ) then 1 else 0 end as IsWeekend

    from cte option (maxrecursion 0)

    Cheers

  • raulggonzalez (1/21/2014)


    PHYData DBA (1/20/2014)There might be several other ways of doing this without using a Row By Row script. We could not find one that was as easy to read and understand, or that would be as modifiable and portable as this one. Also, since we are not talking about millions of rows it was faster in a big picture sort of way to do it like this.

    As you stated, there are some ways to do this avoiding the row by row processing. Here I can post one I'd use instead (for the dates, but obviously can be done for time too).

    I find it more readable and simpler than a loop (and I believe far more efficient)

    ;with cte as(

    select CONVERT(DATE, '2005-01-01') as [date]

    union all

    select DATEADD(DD, 1, [date] )

    from cte

    where [date] < CONVERT(DATE, '2016-01-01')

    )

    select convert(int, convert(varchar(10), [date], 112)),

    [date],

    convert(varchar(10), [date], 105),

    datepart(dw, [date]),

    datename(dw, [date]),

    day([date]),

    datepart(dy, [date]),

    datepart(wk, [date]),

    datename(mm, [date]),

    datepart(mm, [date]),

    datename(qq, [date]),

    year([date]),

    case when ( year([date]) % 4 = 0) and (year([date]) % 100 != 0 or year([date]) % 400 = 0) then 1 else 0 end as IsLeapYear,

    case when ( datepart(dw, [date]) = 1 or datepart(dw, [date]) = 7 ) then 1 else 0 end as IsWeekend

    from cte option (maxrecursion 0)

    Cheers

    This is equally valid, but it's still row by row. Recursive CTE's are just as much a loop as a while loop. You are also limited to ~32000 iterations of the loop, the maximum for the OPTION (maxrecursion) you've added at the end there. If you were trying to load a dimtime table with every second of a day, you would hit that limit and the code wouldn't complete. At that point you would probably resort to the while loop.

    On Jeff's comment, I completely agree that using date & time data types are much better than a surrogate int with newer versions of SQL server. Oracle does have variables in PL/SQL but no they don't use the @ symbol, they use a colon in front of them, or alternatively the odbc style ? placeholders depending on the client application. I would add that SSAS, at least as far as I recall, is not date type friendly so you might have to use the integers if you are using the DW for a cube.

  • davoscollective (1/21/2014)

    This is equally valid, but it's still row by row. Recursive CTE's are just as much a loop as a while loop. You are also limited to ~32000 iterations of the loop, the maximum for the OPTION (maxrecursion) you've added at the end there. If you were trying to load a dimtime table with every second of a day, you would hit that limit and the code wouldn't complete. At that point you would probably resort to the while loop.

    On Jeff's comment, I completely agree that using date & time data types are much better than a surrogate int with newer versions of SQL server. Oracle does have variables in PL/SQL but no they don't use the @ symbol, they use a colon in front of them, or alternatively the odbc style ? placeholders depending on the client application. I would add that SSAS, at least as far as I recall, is not date type friendly so you might have to use the integers if you are using the DW for a cube.

    I wouldn't say so, the MAXRECURSION option can jump that limit of ~32k when is set to 0. Please check for a time dimension (24h, 1 row per second). This can be inserted into the example table.

    From BOL, http://msdn.microsoft.com/en-us/library/ms175972.aspx

    An incorrectly composed recursive CTE may cause an infinite loop. For example, if the recursive member query definition returns the same values for both the parent and child columns, an infinite loop is created. To prevent an infinite loop, you can limit the number of recursion levels allowed for a particular statement by using the MAXRECURSION hint and a value between 0 and 32,767 in the OPTION clause of the INSERT, UPDATE, DELETE, or SELECT statement. This lets you control the execution of the statement until you resolve the code problem that is creating the loop. The server-wide default is 100. When 0 is specified, no limit is applied. Only one MAXRECURSION value can be specified per statement. For more information, see Query Hints (Transact-SQL).

    --

    --time

    --

    ;with cte as(

    select 1 as interval, CONVERT(time, '00:00:00') as [time]

    union all

    select interval + 1, DATEADD(SECOND, 1, [time] )

    from cte

    where [time] < CONVERT(TIME, '23:59:59')

    )

    select interval,

    [time],

    datename(hh,[time]),

    datename(minute,[time]),

    datename(second,[time]),

    CONVERT(varchar(8), [time])

    from cte option (maxrecursion 0)

    And I run some performance checks and row by row, the dates dimension took 1643ms and the CTE 226ms ... I know 1,7s is not a performance problem, but why doing anything in a loop when you can just do it in a single statement? (considering the cte part of a single statement, of course)

  • Jeff Moden (1/21/2014)


    Just a couple of thoughts...

    First, these two tables are one of the few places where I'd use the natural key of either the date or the time instead of a surrogate key. There's not much worse than trying to troubleshoot something by looking at rows in a table only to have to lookup every single date or time ID. Using the natural key would allow you to not have to make multiple time tables for the sake of continuing to use a surrogate key while still providing more resolution in the time elements.

    Also, the code isn't actually portable. That possibility ended as soon as you defined a variable. I don't know about other databases but, last I heard, Oracle (for example) does not use @ variables.

    And see here... I found using an identity column with a start date equivalent to the beginning of data history i.e. for the last few DW's I've built 2000-01-01 to be very handy.

    To calculate the keys for the DW, I could just DATEDIFF(dd,'1999-12-31',datadate) and not have to do a lookup. For time, 00:00:00 was key 1 and 1440 = 23:59 DATEDIFF(N,cast('00:00:00' as time),cast(datadate as time))+1 generated that key. Typically when a greater depth of granularity was necessary I'd just include the datetime value in the fact table.

    Yes it prevented me from backfilling, but see... I used a Tally table to fill the date dimension... so meh? And to calculate the duration in days, I could just subtract one key from another :).



    --Mark Tassin
    MCITP - SQL Server DBA
    Proud member of the Anti-RBAR alliance.
    For help with Performance click this link[/url]
    For tips on how to post your problems[/url]

  • Ah yes you are correct, my mistake, the maxrecursion 0 removes the limit. I agree, go for the better performing code. It is one statement, but internally it is still a loop. In this example I would use this code, but recursive cte should be avoided if there is a true set based alternative that will perform and scale better.

  • davoscollective (1/21/2014)


    Ah yes you are correct, my mistake, the maxrecursion 0 removes the limit. I agree, go for the better performing code. It is one statement, but internally it is still a loop. In this example I would use this code, but recursive cte should be avoided if there is a true set based alternative that will perform and scale better.

    No worries, what I cannot find though is any documentation where it says that a recursive CTE internally becomes a loop, if you read it somewhere I'd love to have a look at it.

    And for this particular case, I cannot see any 'true set based' alternative, maybe because I'm used to recursive CTE to generate this kind of samples.

    Cheers!

  • raulggonzalez (1/21/2014)


    davoscollective (1/21/2014)


    Ah yes you are correct, my mistake, the maxrecursion 0 removes the limit. I agree, go for the better performing code. It is one statement, but internally it is still a loop. In this example I would use this code, but recursive cte should be avoided if there is a true set based alternative that will perform and scale better.

    No worries, what I cannot find though is any documentation where it says that a recursive CTE internally becomes a loop, if you read it somewhere I'd love to have a look at it.

    And for this particular case, I cannot see any 'true set based' alternative, maybe because I'm used to recursive CTE to generate this kind of samples.

    Cheers!

    Jeff Moden has a great article on exactly that: http://www.sqlservercentral.com/articles/T-SQL/74118/

    The key is that any type of recursive code is recursive because it repeatedly calls itself. There might be some optimization possible (e.g. a while loop has the overhead of multiple statements being executed) but it's still repeating and incrementing in iterations, the same way a loop works.

  • mtassin (1/21/2014)


    Jeff Moden (1/21/2014)


    Just a couple of thoughts...

    First, these two tables are one of the few places where I'd use the natural key of either the date or the time instead of a surrogate key. There's not much worse than trying to troubleshoot something by looking at rows in a table only to have to lookup every single date or time ID. Using the natural key would allow you to not have to make multiple time tables for the sake of continuing to use a surrogate key while still providing more resolution in the time elements.

    Also, the code isn't actually portable. That possibility ended as soon as you defined a variable. I don't know about other databases but, last I heard, Oracle (for example) does not use @ variables.

    And see here... I found using an identity column with a start date equivalent to the beginning of data history i.e. for the last few DW's I've built 2000-01-01 to be very handy.

    To calculate the keys for the DW, I could just DATEDIFF(dd,'1999-12-31',datadate) and not have to do a lookup. For time, 00:00:00 was key 1 and 1440 = 23:59 DATEDIFF(N,cast('00:00:00' as time),cast(datadate as time))+1 generated that key. Typically when a greater depth of granularity was necessary I'd just include the datetime value in the fact table.

    Yes it prevented me from backfilling, but see... I used a Tally table to fill the date dimension... so meh? And to calculate the duration in days, I could just subtract one key from another :).

    Thank you both for your comments. I agree with them both. For our use we wanted something simple that only needs to work with SQL and our custom DW. The Surrogate Key for the date if you look at it is a date value that is readable by every that has to use this. The Surrogate Key for the Time table is the seconds in the day. It won't work for every one but it works well for us. We wanted to keep the Primary Keys and Suragate Keys in all the Dim tables as Integers. We never even considered using Date or Time values.

  • davoscollective (1/21/2014)


    raulggonzalez (1/21/2014)


    PHYData DBA (1/20/2014)There might be several other ways of doing this without using a Row By Row script. We could not find one that was as easy to read and understand, or that would be as modifiable and portable as this one. Also, since we are not talking about millions of rows it was faster in a big picture sort of way to do it like this.

    As you stated, there are some ways to do this avoiding the row by row processing. Here I can post one I'd use instead (for the dates, but obviously can be done for time too).

    I like this solution a lot. However not everyone understands CTE's and how to set Max recursion as you do.

    Thanks for your post!:cool:

  • davoscollective (1/21/2014)


    raulggonzalez (1/21/2014)


    davoscollective (1/21/2014)


    Ah yes you are correct, my mistake, the maxrecursion 0 removes the limit. I agree, go for the better performing code. It is one statement, but internally it is still a loop. In this example I would use this code, but recursive cte should be avoided if there is a true set based alternative that will perform and scale better.

    No worries, what I cannot find though is any documentation where it says that a recursive CTE internally becomes a loop, if you read it somewhere I'd love to have a look at it.

    And for this particular case, I cannot see any 'true set based' alternative, maybe because I'm used to recursive CTE to generate this kind of samples.

    Cheers!

    Jeff Moden has a great article on exactly that: http://www.sqlservercentral.com/articles/T-SQL/74118/

    The key is that any type of recursive code is recursive because it repeatedly calls itself. There might be some optimization possible (e.g. a while loop has the overhead of multiple statements being executed) but it's still repeating and incrementing in iterations, the same way a loop works.

    I do like the CTE method that was posted better than my loop. I honestly did not spend a lot of time trying to research a better method because this is only used to add rows to the existing table and does not execute on more than a yearly basis for our solution.

  • I want to thank everyone that took the time this morning to read my article and comment on it.

    I had no idea that my very first submission would end up being the top link on the daily email newsletter!

    Thank you all for making me feel well read and received! 😀

Viewing 15 posts - 1 through 15 (of 58 total)

You must be logged in to reply to this topic. Login to reply