Factless fact to resolve business requirements or other design approach?

  • Our client would like to have the list of candidates/employees with their corresponding position, job capacities (consultant, analyst, etc.) and work history. Employee list will be turned into a dimension table linked to other fact relationships (e.g. fact_salary, fact_proposal_investment, etc.) same with job position as a candidate dimension.

    My worry is on the Work/Job history  on how will i reflect it to the business requirements as it contains relationship with job position.

    Is Work/Job history a candidate for a fact or a dimension?

    If it is a dimension, should i create a factless fact in order to relate employee, job position dimension?

    Appreciate your kind help. Thank you

  • I suppose you could use a Factless Fact Table as a "bridge" or "join" table for this type of thing but what I recommend is paying real close attention to the actual tables themselves, whether they are Dimension or Fact tables.  I recommend that you work for what every client dreams of but usually cannot describe in any way other than a "Point-in-Time" database/warehouse/whatever you want to call it but I'll refer to simply as a database and tables (that's the lingo I most familiar with).  In other words, a database where the tables are built in such a fashion that you can determine the condition of any thing at a given "Point-in-Time" (PIT from here on) with relatively simple queries.

    To do that, you need "Pure Type 6 SCDs" (Slowly Changing Dimensions), which are also known by some as "Temporal Tables".  You can find a brief description of what they are at the following link:

    https://en.wikipedia.org/wiki/Slowly_changing_dimension#Pure_type_6_implementation

    For any give "Fact" or "Dimension" table, a "Temporal Table" will have a row for each period of time that a row is or has been active and will contain a start date/time and an end date/time column for when the contents of the row were active and valid.  They can be built as a single table (contains the active row and all the history rows, which has some great advantages as well as some great disadvantages) or as two tables, one of which only contains the active rows and the other contains only inactive historical rows (which also has some great advantages and disadvantages).

    The bottom line is that the currently active row will have a start date/time and a NON-NULL end date/time far in the future.  The non-null end date/time keeps you from having to use OR in your criteria.

    When people make such tables, they also make the awful mistake of using the very last instant in the calendar that whatever RDBMS they're using can handle.  For example, for the DATETIME datatype in T-SQL, people will make the mistake of using "9999-12-31 23:59:59.997".  The reason why that's such a horrible mistake is that there are a huge number of "tricks of the trade" in temporal calculations that require the use of an "exclusive" end date/time.  In other words, instead of trying to include (inclusive) both end points for a period, it's frequently a whole lot easier (not to mention much more "bullet proof" if someone decides to change a datatype on a column, which also changes the temporal resolution, but the code will still survive with no changes) to have a "closed" (inclusive) start date/time and an "open" (exclusive) end date/time.

    If you add even 100 nanoseconds to even a DATETIME2(7) datatype to get the first instant where some period is no longer true and you have used something (for example) "9999-12-31 23:59:59.9999999' (like MS temporal tables unfortunately do), you'll get a date over flow error.

    If, however, you use just '9999' for your max end date for the currently active row, that equates to "9999-01-01 00:00:00.0000000" which leaves you plenty of room to do temporal calculation tricks of add just about any temporal value less than a year without such an overrun.  It's also easy to remember that "4 nines" is the datetime that everyone should use for "hasn't happened yet" for an end date without using a blasted NULL, which WILL make your life much more complicated.  NULLs have great purpose and utility but this isn't one of the places you should use it.

    This also means that you don't need a bloody "Is Active" bit or other equally annoying and very low utility column.

    All of that allows for easy coding and easily standardized code.  For example, if you want to know what was active for any table (regardless of number of rows it contains) at a given point in time, the WHERE clause is always the same...

    WHERE @DesiredDT >= StartDTCol and @DesiredDT < EndDateCol

    I would seriously avoid using BETWEEN because it always includes both end points and so isn't "bullet proof".

    And, correct... the example above doesn't involved any of the temporal magic I was speaking of... you'll know what I mean the first time you have to use it (usually comes in the oversimplified form of EndDT+1) and you'll be thankful you left some "headroom" in the end date/times for that.

    Just remember, StartDT must be inclusive... EndDT must be the first instant where something is no longer true (exclusive).

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Yes sir, the PIT records would really be a challenging data to handle. I'll design and implement a SCD type 6 for this section. I got confused on the work/history data which is dependent to other dimensions. I'm also trying to avoid the snowflake schema as it would be very hard to track and to maintain and the domino effect would be very huge in case of a mess-up transformation.

    Thank you for your insights and response

  • Like Granny used to say, "Mind the pennies and the dollars will take care of themselves". 😀

    In this case, I'd concentrate on the tables that need to be PITted.  The rest will come naturally.  It's kind of like having the best corn on the cob in the world... step 1 is to plant the corn right. 😀

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Modern companies and organizations operate under the conditions of a large volume constantly changing information, which and constantly changing information that needs to be quickly analyzed and and make the right decisions. There is a rapid development of computer technology and information technologies. It's hard to find a company that is not It's hard to find a company that is not engaged in the development of information technologies. Today's business leaders are fully aware of the fact that at the present time success and profitability of the company entirely depends on the level of development IT-technologies, speed and quality of processing information processing speed and quality, reasonableness and balanced of decisions taken.

    • This reply was modified 2 years, 9 months ago by  JacobW.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply