October 8, 2021 at 8:34 am
Hi, we've been working with consultant (who we may terminate) within our organization to come up with a standard or best practice of using ETL process. Sometimes we may want to perform incremental load (using last modify timestamp) and other times a full truncate and reload. Let me know which is better to use as standard and your detail thoughts and opinions.
Here are two examples, ETL file 1 and ETL file 2.
Parameters:
1. Load type = 1 incremental time based of given date or last modify date
2. Load type = 2 full truncate reload
End of script document the stored procedure, destination table, and process time start to finish.
File 1 is hard-coded to log data
File 2 calls a stored procedure to log data
File 1: https://www.dropbox.com/s/67ntboyuqy8v0bv/ETL%20File%201.txt?dl=0
File 2: https://www.dropbox.com/s/400m1pup4n9sq1d/ETL%20File%202.txt?dl=0
October 8, 2021 at 1:38 pm
Without spending much time looking at the code, my humble opinion is that it's usually easier and faster to truncate and reload if the full reload is available in a file. If you use the "Swap'n'Drop" method (have two identical tables and a pair of synonyms you use to repoint to the "current" and "to be loaded" versions of the table that you repoint after each successful load), the total "down time" for usage of the table approaches seriously approaches zero.
The exception to using the "Swap'n'Drop" can sometimes be if the file has hundreds of millions of rows and only a million or two change. Then the "Pluck'n'Chuck" (merge) method can work well.
In the latter method, it's obviously important to identify rows that have changed as quickly as possible. The proper use of a HashBytes key can seriously increase the performance with the understanding that if the resulting hash key says two rows are different, then they are but if they're the same, there are small chances that they could still be different. That can happen more than people think if the very popular 16 Byte MD5 algorithm is used and will, of course, decrease the chances if one of the larger algorithms (up to 64 bytes, which is usually serious overkill) is used.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 8, 2021 at 1:53 pm
As a bit of a sidebar and, like I said, only a quick look at the code you've attached, what is your complaint with the code?
--Jeff Moden
Change is inevitable... Change for the better is not.
October 8, 2021 at 2:33 pm
Well if it's just a matter of copying data truncate and reload is easier to set up and maintain, volume permitting. Trying to capture deltas will but down on volume and potentially make it easier to handle anything that triggers off of those deltas.
But looking at the code the way it's using linked servers is effectively just copying all the data anyways so they might as well be truncate and reload. And if volume is an issue you would need to address that.
October 8, 2021 at 3:49 pm
Well if it's just a matter of copying data truncate and reload is easier to set up and maintain, volume permitting. Trying to capture deltas will but down on volume and potentially make it easier to handle anything that triggers off of those deltas.
But looking at the code the way it's using linked servers is effectively just copying all the data anyways so they might as well be truncate and reload. And if volume is an issue you would need to address that.
Hadn't looked at the code that closely and still haven't. Since it's a full copy across a linked server, it might actually be much faster to do a BCP export using the native format and then a BULK INSERT, also using the native format.
I have had some great success with that in the past.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 8, 2021 at 4:31 pm
Forgot to mention a few more things:
Much appreciated.
October 8, 2021 at 5:02 pm
Forgot to mention a few more things:
- Yes, we are planning to remove the link server. Very inefficient and unacceptable especially when joining to local data warehouse tables.
- The code is about the same, except we are looking to minimize coding errors and redundancy.
- In our data warehouse, we are using surrogate keys (integer based) for our fact and dimensions. Are we able to incrementally load the data (for changes) throughout the day. As new data comes in, our preference is to see those changes in near real-time. I may writeup another message explaining our proof of concept.
Much appreciated.
For "near real time", you might want to look into "Merge Replication". It does have some caveats on the source machine such as the recovery model and the log file but they're not huge caveats for most people.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 7 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply