[Generic] Person ETL

  • Environment : Many various disparate systems that contain unique person information.

    Task: Merge the person information into a single centralized transactional database, on a regular basis throughout the day.

     - Currently each subsystem gets a complex ETL process that joins duplicates, at the source, together. That dataset is then compared against the existing person information, in the centralized database, based upon gender, dob, last name and partial match on first name.

    What I'm contemplating is more of a "hopper" based system.

    I'm thinking, create a staging hopper where the various subsystems dump their data in a predefined format with a source appended - then have one regularly run ETL to pull that information into the Persons schema.

    What I'm not finding are any test cases for this scenario, can anyone point me in the direction of a good write up for something similar?

  • I don't have any links to anything specific but I can tell you a bit from personal experience.  I don't know what format your external sources are in but every time I've seen someone else try to join or merge external sources, it always seems that something eventually goes haywire.  That's not to say that one of the heavy hitters on this and other forums would have such problems.  I'm just saying that I've not personally seen people have much success with the joining of sources externally.

    What I've personally always done is to import the data from a source into a staging table dedicated to receiving the data from the source.  Partial validation is done during this step because I make sure the data types in the table are the data types expected from the source.  The rest of the validation is done after the data is loaded into the staging table.

    I generally write one proc per staging table so that I can change the order of imports and sometimes do intermediate processing as controlled by a master procedure that does all of the tables.  Once the tables are loaded, the you can do joins and merges however you see fit and they'll usually be faster than trying to do external joins of multiple sources.

    All of that is a bit more complex than what most people come up with.  It's also, IMHO, a whole lot more flexible.

    The key for me is that I don't need SSIS to do any of it and I don't need to redeploy a package every time I need to make a change. I usually just need to change one of the sub-procs called by the master proc and I'm done.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Greatly appreciate your response Jeff, apologies on the delay on mine - stupid busy and haven't had the chance to get back to this, as it was more of a question as to "I wonder if this would be a better alternative..."

    I do like your ideas on using SP's instead of SSIS - most of these packages don't leverage SSIS functionality (other than doing a quickload of the Temp table), they're just SQL wrappers. Honestly went with SSIS due to the existing ETL's that my predecessor had created, though even those will need to be recreated at some point (they aren't built around deduplicating information, so that we don't end up with multiple Person records for a single individual, whereas all of the newer ETL's do so.)

    Think I'll have to look into this possibility a little more, I like the idea of having an easily updatable master proc.  As it is right now, I have to cascade the packages and it isn't as elegant as I would like.

    I also use Staging tables, though I actually use a Temporary table that is Truncated before/after the ETL runs.  Basically, dump the data into the Temp table and do partial validation during that step to make certain datatypes and data are as expected; then compare Temp to Stage and and add new/update existing. Have to do it that way, due to the fact that there is some Legacy SQL/Applications that are hitting against the Stage tables directly.  Eventually will get around to fixing that.

    I appreciate your input on the "hopper" idea - I've been debating pros/cons on it for a few weeks, and I haven't come up with an all encompassing solution; I'm not sure I want to give up on it completely yet, simply cause it's an interesting thought exercise.  When/if I do come up with a good working idea, I'll add a write-up to the forums here and see if anyone is interested in reading it over.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply