When is too many columns too many columns?

  • In this particular case it is still a 32-bit system and most queries against it are ultimately driven by factors like "first Monday in march 2016" etc so the date table is usually a driving force. Plus the ints in question are carefully arranged so that CASTing them to a DateTime gets you the date, if push comes to shove.

    If I were doing it from scratch I'd probably just use Date as the key these days, it's more correct semantically and avoids the possibility of a constraint getting missed and time values creeping in where they shouldn't. The performance differential would need to be massive to justify a DateTime type (though I take your point on DateTime2)

  • Kevlarmpowered - Thursday, September 6, 2018 6:16 AM

    Jeff Moden - Wednesday, September 5, 2018 7:47 PM

    Kevlarmpowered - Wednesday, September 5, 2018 6:51 PM

    Without having it right in front of me, I'm willing to say that almost all of the columns are either int, datetime, numeric, float or money.  The design pushes everything into a different dimension ... and I mean everything.  So instead of storing a Y or N, it stores a -1, 0 or 1 which you have to join with a specific dimension to convert -1, 0, 1 to NULL, N or Y.  Dates are stored at int as yyyymmdd and time is stored as seconds elapsed from midnight (if no time is required, it shows 0 indicating NULL).  Some dimensions work because they are repeatedly used (i.e., -1, 0, 1, 2 translating from NA, Black, Red, Blue) but some dimensions are 1:1 because of custom freeform notes.  As in, NoteID 178 ties to another dimension which has NoteID 178 and the actual Note.  There is no default Note because notes are free form so you have thousands of NoteIDs which have an actual Note of (blank) and/or whitespace ... not NULL.

    Almost every column has a dimension to go with it... unless it is an actual factual numeric value.  Factual text gets converted to a dimension even in the smallest of cases such as the Y or N.

    This isn't a shot at you personally.  It's a shot at the whole misbegotten concept and you're definitely not the only one to do this type of thing.

    I guess I don't understand why people do things like this.  For example, the fact that you're storing a -1 means that it's at least a SMALLINT, which is 2 bytes instead of just one for storing a CHAR(1) to hold Y, N, or Null not to mention that you're doing an otherwise unnecessary translation for display purposes.. 

    You're also storing dates as INTs which means that you've lost all temporal calculation capabilities unless you first do the calculation to convert it back to a temporal datatype.  You've done the same with the time by storing it as seconds instead of a temporal datatype.  And, both columns combined into a single temporal datatype take the same number of bytes (or less) as the two INTs and have the ability to quickly and easily be converted to multiple display types without destroying the ability to do temporal math when needed without having to do a shedload of conversions.

    IMHO, data warehouses aren't supposed to be "this is the way I want to display it" warehouses nor should they be made inflexible by morphing temporal datatypes into INTs, which is then all they can be until you do some awful conversion or some lookup in a Dimension table that has been equally crippled.

    I'll just never understand why people think this type of thing is efficient, clever, or even useful.  It's like they're compelled to throw themselves back in time to before temporal data types existed.  Worse yet, it's almost always a duplication of data that is already and actually efficient, doesn't need to be clever (it's damned clever in its own right already), and is incredibly useful as it is.

    Oh I agree with you... I didn't build this DWH. I have built a few DWH and I have worked in others that I have purchased and never have I ever seen one like this.  I came asking questions here because the team who built it promises me it was built the best way possible so much so they had me doubting my experience.  So I travel the internet trying to learn more and see and ask the opinions of others.  I am trying to understand and walk in their shoes if you will... I already learned that some people are OK with this super wide tables where I have never been a fan.  Super wide tables do serve a purpose, but I think not in this way.  I'd rather build out many tables and provides superwide views if they were really necessary.

    This current design would have you joining the date dimension and time dimensions 140 times to convert INTs to date/time.  I find it easier/faster just to try_convert(date, try_convert(varchar, nullif(int, 19000101))) because the added overhead of the joins are slower for the tests I have done.  Plus, if the thing only has one status, you essentially have 139 copies of 19000101.  You have to do the try_convert because you will find ints that don't make logical dates... how this happens I do not know.  i.e., 20180231 (February 31st?) and this is all before you can use the actual temporal calculations that are natively built-in...

    I pulled the datatypes this morning... I was right on the money in my previous guess with the exception of many more money columns than I would have thought.

    Thanks for the feedback and for taking my comments the right way.  It's always interesting that people that design these things claim that they're done "the best way" and yet require "tribal knowledge" to use the damned things and the code necessary to do some of the simplest of things turns out to be necessarily crazy.  Like many things, I think that some of the people that "design" data warehouses actually create the data warehouses because they don't actually know how to write T-SQL well and so need to abstract it to something they understand.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • andycadley - Thursday, September 6, 2018 6:19 AM

    In this particular case it is still a 32-bit system and most queries against it are ultimately driven by factors like "first Monday in march 2016" etc so the date table is usually a driving force. Plus the ints in question are carefully arranged so that CASTing them to a DateTime gets you the date, if push comes to shove.

    If I were doing it from scratch I'd probably just use Date as the key these days, it's more correct semantically and avoids the possibility of a constraint getting missed and time values creeping in where they shouldn't. The performance differential would need to be massive to justify a DateTime type (though I take your point on DateTime2)

    Thanks for the feedback, Andy. 

    For that type of thing (first Monday in March 2016), a date table certainly makes things a whole lot easier.  You'll get no argument from me there.  And, it sounds like you DID make the INTs from the underlying date serial number, so very well done there.

    I do have to admit that I'm a bit gob smacked that folks would still be stuck with a 32 bit system except if it's a micro appliance of some sort.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 3 posts - 46 through 47 (of 47 total)

You must be logged in to reply to this topic. Login to reply