Deleted source records in dimension tables

  • I know, I know. The source data shouldn't have deleted records. They should be marked inactive. However I can see that not happening due to third party systems, odd business practices, general bad practices, etc.

    I've already got the three cardinal fields on all my tables regardless of what type of SCD is being used - EffectiveDt, ExpirationDt, CurrentRow.

    Does anyone have a best practice for this kind of situation? Here's my thought on it and luckily I'm early into the process of building the warehouse:

    Use the ExpirationDt for the date the record was deleted and leave the CurrentRow as 'Current'.

    Pros and cons of this? I don't want to delete records from the warehouse since historically they did exist.

  • I know, I know. The source data shouldn't have deleted records. They should be marked inactive.

    It's possible to see this as being valid. I deal with deleted order details because they have in fact been deleted from the order for one reason or another. These populate a fact table, however. I can't think of a process that deletes dimension data. Assuming the tables are keyed, it would mean that they are not referenced by a related table.

    If the row is referenced by a fact record, my initial thought is to have a deleted flag. It might even be exposed as an attribute. If the row is not referenced, I would consider deleting it. There are reasons to keep it, but for most companies, knowing the list of items never used is not a useful piece of information.

  • I'm dealing with many years of various ways of bringing in data (there's still a SQL 2k server in use) and a couple of rather unfriendly third party systems that only hold data for 30 days. So there will be and are deleted dimension records.

    I really, really, really don't want to have to search all the fact tables and bridge tables that might access the deleted records. We've got plenty of space and since I don't know what tables will incur this problem a deleted flag would be iffy and needed on every ETL.

    I think using the ExpirationDt will work because if there's a current record with a date that's less than 2999-12-31 then it's no longer in the source. That also indicates when it was removed and keeps the time slices accurate.

    I'll leave the post open without an answer for a while (nothing against you Ron) to give other people a chance to chime in with what they do in their environments.

  • nothing against you Ron

    I don't take things personally. These were suggestions, but your the person on the ground.

  • any time you are bringing data into the DW using a Delta you will potentially have records that are not in the Delta. This may be because they are outside the scope of the Delta or because they have been physically removed from the source data. If this is important to you then you will need to run a cleanup script to identify records you expected to see in the delta but were not there and have to assume that they have been deleted and mark them appropriately.

    This still leaves you an issue with how to identify records which you are not expecting to be in the Delta but have been removed from source.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply