Back when i was in school, me and a bunch of friends used to hangout at a diner. The diner was popular for its specials, which were displayed on a board outside the entrance.
The manager would roll in every day, wipe the board clean and list all the specials for that day. There will be a few new specials introduced daily, but most of them would be the same every day. Still the manager would erase everything completely (including those that weren’t changed) and rewrite. He was fine with having to spend a lot of time every day to do this, but he wanted to clearly display all the specials.
On the other hand, a manager at a different diner would erase only those specials that were discontinued and add the new ones. He didn’t mind that the specials were not clearly displayed, and only wanted to make the daily changes as quickly as possible.
Also known as Full Load, is a process of completely destroying/deleting the existing data and reloading it from scratch. A lot of unchanged data is also deleted and reloaded in this process. But a destructive load ensures the highest data integrity easily.
“Delete destination data. Read data from source. Load into destination.”
Incremental load is a process of loading data incrementally. Only new and changed data is loaded to the destination. Data that didn’t change will be left alone. Data integrity can be ensured in this process too, but ETL can get complicated.
“Read source data. Compare with destination. Filter for new and changed data. Write to destination.”. I stole this line from Anatomy of an Incremental Load by Andy Leonard (b|t).
A full load seems to be easiest the approach. Right?. Then why do we even care about incremental load?
|Destructive Load||Incremental Load|
|How it works||Deletes all rows and reload from scratch.||Only new or updated rows are processed|
|Time||Requires more time.||Requires less time.|
|Data Integrity||Can easily be guaranteed||Difficult. ETL must check for new/updated rows.|
|History||Can be lost.||Retained.|
Full load is the easiest way to load daily. But consumes a lot of time and other server resources. Incremental load processes only the new or changed data, so when time is of essence with larger data sets, incremental load is the way to go.
In my next post, I’ll continue with scripts to do a full and incremental loads.