Should a DW track data origin system?

  • Hi, we are in the beginning stages of designing our data warehouse. We have a lot of history (15+ years) we'd like to move into a warehouse. The fact we are beginning our warehouse with is ticket sales and we've got our star shaped fact and dimension tables laid out to our liking. The data which will end up in our warehouse comes from 4 different SQL databases, 2 are purely historical and 2 are production but they are from 4 different pieces of software.

    Our question is should we design a dimension table to keep track of the which source system the records came from or should we build the origin fields directly into each dimension table. Would it disrupt our table layout and the efficiency of the warehouse if we did design an "origin" table which every dimension touched?

  • Personally I would just include a field in the dimension that stores the source where the record came from.

    Is it possible though the dimension info for one record can come from multiple sources?

    How would you use that information? Are you going to filter on it in reports? Or is it just for auditing?

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • For years we struggled with being able to identify where certain transactions came from. We finally added a field in the last year, to a few of our key tables called source_id it contain a numeric value that represents a particular source. We have a table that contains this source_id with other info about that source. If you can do it know, DO IT. 🙂 It will save you some head ache down the road.

    -------------------------------------------------------------
    we travel not to escape life but for life not to escape us
    Don't fear failure, fear regret.

  • I think auditing the history of data is important. It's not something that can be added retrospectively so the sooner you implement a solution, the better.

    My personal preference is for an AuditKey column in every dimension. This is a foreign key to an Audit dimension which contains detailed information on how that piece of data was put there, e.g. the date and time it was last changed as well as the ETL involved in moving it etc. You could go to town on this and it depends very much on how complex your ETL is. Keeping the history of the changes made to data can be very useful in tracking errors or problems down the line. By adding a step to all your ETL processes that inserts to the Audit dimension you can keep this system going for future changes.

  • Cliff O'Reilly (6/6/2014)


    I think auditing the history of data is important. It's not something that can be added retrospectively so the sooner you implement a solution, the better.

    My personal preference is for an AuditKey column in every dimension. This is a foreign key to an Audit dimension which contains detailed information on how that piece of data was put there, e.g. the date and time it was last changed as well as the ETL involved in moving it etc. You could go to town on this and it depends very much on how complex your ETL is. Keeping the history of the changes made to data can be very useful in tracking errors or problems down the line. By adding a step to all your ETL processes that inserts to the Audit dimension you can keep this system going for future changes.

    + 1

  • Thank you all for your replies they have been very helpful.

    @Koen - Our intention is primarily for auditing but I can see scenarios where it would be useful for reporting. For instance the CFO may wish to see a summary of the ticket data from just System A, just System B or perhaps a summary of A and B omitting systems C & D.

    @below86 - Yes, it is certainly easier to keep track from the beginning rather than adding later. We will definitely keep track and probably with a table as you Cliff & Lempster suggested.

    @Cliff, & Lempster - I too like the idea of a separate table I just wasn't sure if it was the right way to go. I have a meeting with the vested parties this afternoon and if they agree too we'll add it as a separate table with a foreign key into each of our Ticket Fact as well as Location, Ticket Type, and Event dimension tables.

  • Oh and a little bit of extra information on the 4 systems, 2 of them are point of sale based data and 2 are from post POS settlement/adjustment software. Ideally the POS software would always be correct because POS operators would always do their job and void/refund excess tickets however reality is different and it is easier to make adjustments to ticket sales in the settlement/adjustment software. So when all of this data goes into the warehouse it will be POS Sales/Refunds + Adjustments = Bottom Line which should then match to our separate accounting software.

  • We have a sourcesystem table, each source is defined. We needed the ability to enable/disable sources and this accomplishes it, and leaves us a trail in case something strange happens (Super Storm Sandy forced a data center there to close, no problem, I disabled the source to prevent pkgs from looking for it).

    Further, all the auditing carries this sourcesystemkey. We host, so it is critical that I be able to determine very quickly where a problem is coming from.

    Strongly agree on the auditing, also. It has saved my butt many times. Every table has an audit key from which I can determine when it ran, what sources it hit, how many records returned from each, dimensions referenced with hits,misses, unknowns, etc etc.

    Lastly, Event Handlers. While easier than expected to implement, this also has saved my butt many times. When there is a problem and we need to consider killing pkgs, I can consult my logger to see what step every executing pkg is in. We had some delete before insert pkgs so would not want to kill past delete before insert but never knew quite where the pkg was. Now we know.

    Build this stuff out sooner rather than later.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply