Any ideas for solving a data lineage requirement using SQL Server (or other database?)

  • Hi,
    We have the following requirement for a new system we're building and I'm trying to figure out the best type of schema for it - also open to alternative database technologies if it makes sense. 

    We are building a system where multiple sources of data can come in and go through an ETL-type process, where one of the sources is chosen as the "official" source. In the end of the process, the data can be stored in a normal, traditional table - but we need to be able to trace back the lineage of each data point on a column by column basis. In other words, different columns in the same table, may have come from different sources. We need to be able to tell which source each data point came from, and also when it was saved, and potentially other meta data about it - i.e. which ETL rule was used to produce the column, etc.

    I normally hate key-value tables, but was starting to think maybe this makes sense, at least as a working table. I could then store whatever metadata I need for each value as it comes in. I could even throw a persisted pivoted table on top of that, so that it's performant when people  want to query the data in a more traditional way. My experience with key value tables is that they get horribly slow as the data grows. It's also very hacky, because we will have multiple data types, and then there's the question of what data type you use for the value column, etc.

    Last night, I had a different idea - where each traditional data table has a "shadow" table, with the exact same schema and unique key. However the shadow table would be all xml types, and then I can store whatever meta data I want about each column, and join the two tables when needed. (most of the time people will not need to access the meta data, it's more shown as information when researching data points, or seeing different sources side by side). 

    Has anyone done anything like this before? Again, I'm open to other technologies besides SQL Server.

    Thanks!
    H

  • Interesting concept.   However, without a lot more detail, it's somewhat difficult to make a useful recommendation.   I'm not any fan of EAV designs, as they are the very worst when it comes to database design.   They go against pretty much every shred of common sense associated with databases, and performance usually stinks to high heaven as well.   Just trying to develop a useful query against an EAV design is often a nightmare.   Data types are tossed out the window too.   Anyway, I'd want to know more about overall record size, as well as data volume per unit time.   That may shed light on a better strategy.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Yes - agreed about the EAV design. I'm trying to avoid it at all costs - it's temping because on paper it gives me the functionality I need. But I'm avoiding it for all the reasons you mentioned.

    Figure the table will have about 20-30 columns, mostly nullable numeric data. It will be key'd by an ID-AsOfDate pair. It will have a growth rate of around 1 million records per year.

    I'm surprised that this isn't a requirement that commonly comes up.

  • Your lineage requirements are more complicated than mine.  I do have some individual columns simple to record the ETL Process ID that updated them as they are critical one time updates (e.g. Customer First Contact Date and Time.

    One small thing that might help you though.  Investigate the use of the SEQUENCE.  They are fairly new in SQL Server.  But I use a single sequence point that that each different ETL process has it's own ID.  That way when all the data is in the warehouse, it would be possible to know which process added or last updated either the entire record or certain individual critical pieces.

  • If your table is only going to be 23-30 columns why not simply add another set of columns that stores where the data came from along with whatever meta data you want.  That seems a lot simpler than trying to load and then join another table whenever you want to know that information.

  • Most of the reason that it doesn't "come up" is because those organizations that need it have well established ETL processes where the right person can easily track down all the source data because they ETL any source data into STAGING tables, and then either archive that data in an organized fashion or just "version it" in some way.   It's not usually done at the column level though.  Most of the time, data from different sources remains separated by source, as a means of "NEVER DESTROY YOUR INPUTS".   That's what makes this requirement somewhat unique.   It's highly unusual to have different sources of data feeding into the same data record.

    However, that said, I think your "shadow table" concept isn't a bad idea.   Simply have slight different data types in it for the same key value from the original table, with one column per metadata element so that you can properly data type that information.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Am I understanding correctly that there are multiple sources, and for each separate ETL run, the source of each column in the destination table can be different depending on a set of rules? So two rows may have a different lineage because they were derived from an ETL run using a different set of rules?

    I've actually done something similar to this before when the company I worked for acquired about a dozen different companies and they needed an ETL process to source from multiple line of business databases into one central accounting system.

    This is really just a conventional ETL process only with the introduction of a more complex sourcing and auditing capability. I see no reason to resort to an EAV table design or NoSQL database solution.

    To start, what you need is a lineage table that has one row for each ETL run and containing the following columns:

    CREATE TABLE [ETLRun]
    (
    [ETLRunID] INT IDENTITY(1,2) PRIMARY KEY,
    [ETLRunStartTime] DATETIME NOT NULL,
    [ETLRunEndTime] DATETIME NULL,
    [ETLRuleID] SMALLINT NOT NULL REFERENCES <foreign key to your ETLRule table>,
    [ETLRunSQL] VARCHAR(MAX) NOT NULL
    );

    So for each ETL run we are tracking when it ran and the ID of the ETL rule (perhaps there will be a need for another table linking multiple rules). I'm assuming that if you know the ETL rule(s) for a run, then that will tell you essentially details about what source databases and columns were used. Also, just for completeness add a column to contain the actual SQL statement derived from the rule(s), because this will further clarify exactly how the lineage of each run was generated.

    Now, for each row inserted into the destination table(s), simply populate [ETLRunID]. That one key columns will reference back to the table [ETLRun] and tell you whatever you need to know about the lineage of the data.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Yes - I think what makes my situation different is that I need to track the source/audit info on a per column basis. Each column has a sort of waterfall that selects the primary source, and all columns could potentially be overridden by a user, which I treat as just another source.

    For example, the waterfall logic could look like:

    Column 1: UserOverride, Source A, Source B, Source C
    Column 2: UserOverride, Source B, Source A, Source C
    Column 3: UserOverride, Source A, Source C, Source B

    So when the data comes in, it will try to use that hierarchy in determining what the "official" value should be. User override would always take highest precedence. Then later, if someone is looking back at the data, they should be able to see that "Column 2 came from Source A on 3/31/2017 at 4pm", etc. 

    The one requirement I'm not sure about yet is if users also need the ability to compare all the different sources together for each column. Our legacy system does not allow for this, but it would be nice to have. In that case, each source would probably need to be its own table. I do keep the source data around in "master" tables, but it is raw and unprocessed (there are other mappings and rules happening in the process that I didn't mention here). If I need the ability to view all the processed data side by side, my model would have to look something like:

    SourceARawData ---> SourceAProcessedData --->  OfficialData
    SourceBRawData ---> SourceBProcessedData --->  OfficialData
    UserOverride --->  OfficialData

    I think I would still need to have a shadow table of some sort on the OfficialData table that shows the lineage of the data, per column.

    Eric - I'm not sure I 100% follow your suggestion. Would your implementation allow me to do the above? How can i know that Column 1 came from Source A at 3/31/2017 and column 2 (on the same record) came from a user override done at 4/3/17 ?

  • hominamad - Friday, April 7, 2017 9:34 AM

    Yes - I think what makes my situation different is that I need to track the source/audit info on a per column basis. Each column has a sort of waterfall that selects the primary source, and all columns could potentially be overridden by a user, which I treat as just another source.

    For example, the waterfall logic could look like:

    Column 1: UserOverride, Source A, Source B, Source C
    Column 2: UserOverride, Source B, Source A, Source C
    Column 3: UserOverride, Source A, Source C, Source B

    So when the data comes in, it will try to use that hierarchy in determining what the "official" value should be. User override would always take highest precedence. Then later, if someone is looking back at the data, they should be able to see that "Column 2 came from Source A on 3/31/2017 at 4pm", etc. 

    The one requirement I'm not sure about yet is if users also need the ability to compare all the different sources together for each column. Our legacy system does not allow for this, but it would be nice to have. In that case, each source would probably need to be its own table. I do keep the source data around in "master" tables, but it is raw and unprocessed (there are other mappings and rules happening in the process that I didn't mention here). If I need the ability to view all the processed data side by side, my model would have to look something like:

    SourceARawData ---> SourceAProcessedData --->  OfficialData
    SourceBRawData ---> SourceBProcessedData --->  OfficialData
    UserOverride --->  OfficialData

    I think I would still need to have a shadow table of some sort on the OfficialData table that shows the lineage of the data, per column.

    Eric - I'm not sure I 100% follow your suggestion. Would your implementation allow me to do the above? How can i know that Column 1 came from Source A at 3/31/2017 and column 2 (on the same record) came from a user override done at 4/3/17 ?

    My proposed solution above would link each row the table to a specific ETL run by the column [ETLRunID], and each ETL run would be linked to one or more ETL rules which would indicate the source for each column. I'm assuming here that all column mapping and sourcing is driven by meta-data rules and columns would be mapped using the same rule configuration within the same run.

    This thing about user overrides sounds like a manual review and edit process that re-maps columns for specific rows after the ETL process has completed? If so, then you are indeed weaving a tangled web when it comes to the lineage of your data in the destination table. However, you could still treat each individual row edit by a user (or perhaps a batch of consistently mapped edits) as an ETL run. So, for example, if User A edits the column mapping of row #662812, then that would trigger the insert of a new lineage record into the ETLRun table, and row #662812 would get updated with a new ETLRunID.

    Alternately, instead of updating the original row's mapping and ETLRunID, if you need to preserve edit history, you could instead leave the row in place with a logical delete flag, and then insert a new replacement row with new ETLRunID and updated columns. This would be sort of like a Type 2 Slowly Changing Dimension table.
    http://www.kimballgroup.com/2008/09/slowly-changing-dimensions-part-2/

    Other database solutions like block chain would simplify the versioning and linkage of data edits, but not being a relational database, it would be limited in terms of line of business usability.
    https://en.wikipedia.org/wiki/Blockchain

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Why not simply create a single "Lineage" table that has Object_id, column_id & source_system_id?

    I'm assuming that the "official source" for a given column remains consistent of course...

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply