Late arriving dimensions

  • Hi,
    We all know it's a best practice prepare dimensions before facts so we can add surrogate keys to the fact table at the time the fact table is loaded.
    I have an ETL with an hourly load schedule and I see edge cases where a new fact can be loaded that has a new dimension member in between the time the dimension is being prepared and the fact table loaded.
    This results in getting a NULL lookup for the surrogate key.  It's hard for me to adjust loaded rows based on time as I don't have lastmodifieddate in all the dim tables.
    I can see how we can handle this in the next hour by re-running surrogate key generation for NULL.
    It seems to be a waste to run surrogate key generation twice - not to mention any downstream processes that use the surrogate key i.e. cubes will also have to be re-run.
    Are there other ways to handle this that I'm not seeing?
    Best
    Lee

  • I'm thinking the only practical thing to do is to run the 2nd lookup with highly optimised SQL targeted target at NULLS i.e. filtered index for null for the late dims.
    It will be incremental processing but hopefully minimal.

  • Hi,
    Late arriving dimensions are processed in different ways. We have a process were we capture the fact record and keep it separately in a different table and there is a separate job that runs every 24 hours or On-Demand to load those in the data warehouse.  If you are following Kimball Methodology there are few solutions discussed.
    https://decisionworks.com/2006/04/design-tip-78-late-arriving-dimension-rows/

    Here is another link(not Kimball) that provides alternate ways:
    http://www.leapfrogbi.com/2015/06/08/early-arriving-facts-late-arriving-dimensions/

    =======================================================================

  • leehbi - Tuesday, April 10, 2018 7:05 AM

    I'm thinking the only practical thing to do is to run the 2nd lookup with highly optimised SQL targeted target at NULLS i.e. filtered index for null for the late dims.
    It will be incremental processing but hopefully minimal.

    You usually want to avoid putting NULL values in the fact table for foreign key columns, as this would require you to always use OUTER JOIN then when querying the fact.  I've used the second method for many years described in the DecisionWorks (Kimball Group) article referenced by IrfanHyd, to ensure that we aren't "loosing" fact records.  You would then do what you're describing as the "2nd lookup" based on fact records that match the dimension key for this Unknown dimension value to see if they now match a new dimension record.  This is one of the few times that it makes sense to update an existing fact.

  • Thanks for sharing links. Will make for some interesting reading.

  • I handle it by placing a place holder so the business knows it's late and incomplete. Then separate it so it can be processed separately. As the data in the fact is large, having to constantly reprocess it is a pain. Separation helps keep the process small and on a separate frequency that can be faster than the other. For example, if you have the main process at 1 hour, the separate may be every 30 minutes. While it may not rebuild cubes every 30 minutes, the point is the next hourly process does not have the next hour of data plus the previous hour of late data.

  • Don't use Null.  Assign it to an "Unknown" bucket.  I do a left join on the dimension tables for the load into the fact tables with a COALESCE statement.  If the value is NULL, it will be assigned an integer of -1, which points to the Unknown. 

    As for the comment

    This is one of the few times that it makes sense to update an existing fact

    , fact table records that are accumulating snapshot tables are updated as a matter of routine.  There's not enough information to know whether that would be the case.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply