ETL Best Practice - Need help and possible part time gig available

  • Hi, we've been working with consultant (who we may terminate) within our organization to come up with a standard or best practice of using ETL process.  Sometimes we may want to perform incremental load (using last modify timestamp) and other times a full truncate and reload.  Let me know which is better to use as standard and your detail thoughts and opinions.

    Here are two examples, ETL file 1 and ETL file 2.

    Parameters:

    1. Load type = 1 incremental time based of given date or last modify date

    2. Load type = 2 full truncate reload

    End of script document the stored procedure, destination table, and process time start to finish.

    File 1 is hard-coded to log data

    File 2 calls a stored procedure to log data

    File 1: https://www.dropbox.com/s/67ntboyuqy8v0bv/ETL%20File%201.txt?dl=0

    File 2: https://www.dropbox.com/s/400m1pup4n9sq1d/ETL%20File%202.txt?dl=0

     

    • This topic was modified 3 years, 11 months ago by mrpaulytx. Reason: Changed files to text format
    • This topic was modified 3 years, 11 months ago by mrpaulytx. Reason: Add files on Dropbox
    Attachments:
    You must be logged in to view attached files.
  • Without spending much time looking at the code, my humble opinion is that it's usually easier and faster to truncate and reload if the full reload is available in a file.  If you use the "Swap'n'Drop" method (have two identical tables and a pair of synonyms you use to repoint to the "current" and "to be loaded" versions of the table that you repoint after each successful load), the total "down time" for usage of the table approaches seriously approaches zero.

    The exception to using the "Swap'n'Drop" can sometimes be if the file has hundreds of millions of rows and only a million or two change.  Then the "Pluck'n'Chuck" (merge) method can work well.

    In the latter method, it's obviously important to identify rows that have changed as quickly as possible.  The proper use of a HashBytes key can seriously increase the performance with the understanding that if the resulting hash key says two rows are different, then they are but if they're the same, there are small chances that they could still be different.  That can happen more than people think if the very popular 16 Byte MD5 algorithm is used and will, of course, decrease the chances if one of the larger algorithms (up to 64 bytes, which is usually serious overkill) is used.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • As a bit of a sidebar and, like I said, only a quick look at the code you've attached, what is your complaint with the code?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Well if it's just a matter of copying data truncate and reload is easier to set up and maintain, volume permitting.  Trying to capture deltas will but down on volume and potentially make it easier to handle anything that triggers off of those deltas.

     

    But looking at the code the way it's using linked servers is effectively just copying all the data anyways so they might as well be truncate and reload.  And if volume is an issue you would need to address that.

  • ZZartin wrote:

    Well if it's just a matter of copying data truncate and reload is easier to set up and maintain, volume permitting.  Trying to capture deltas will but down on volume and potentially make it easier to handle anything that triggers off of those deltas.

    But looking at the code the way it's using linked servers is effectively just copying all the data anyways so they might as well be truncate and reload.  And if volume is an issue you would need to address that.

    Hadn't looked at the code that closely and still haven't.  Since it's a full copy across a linked server, it might actually be much faster to do a BCP export using the native format and then a BULK INSERT, also using the native format.

    I have had some great success with that in the past.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Forgot to mention a few more things:

    1.  Yes, we are planning to remove the link server.  Very inefficient  and unacceptable especially when joining to local data warehouse tables.
    2. The code is about the same, except we are looking to minimize coding errors and redundancy.
    3. In our data warehouse, we are using surrogate keys (integer based) for our fact and dimensions.  Are we able to incrementally load the data (for changes) throughout the day.  As new data comes in, our preference is to see those changes in near real-time.  I may writeup another message explaining our proof of concept.

      Much appreciated.

  • mrpaulytx wrote:

    Forgot to mention a few more things:

    1.  Yes, we are planning to remove the link server.  Very inefficient  and unacceptable especially when joining to local data warehouse tables.
    2. The code is about the same, except we are looking to minimize coding errors and redundancy.
    3. In our data warehouse, we are using surrogate keys (integer based) for our fact and dimensions.  Are we able to incrementally load the data (for changes) throughout the day.  As new data comes in, our preference is to see those changes in near real-time.  I may writeup another message explaining our proof of concept.

      Much appreciated.

    For "near real time", you might want to look into "Merge Replication".  It does have some caveats on the source machine such as the recovery model and the log file but they're not huge caveats for most people.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 7 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply