SQLServerCentral Article

HASHBYTES can help quickly load a Data Warehouse

,

A common problem in loading a data warehouse (hereafter abbreviated 'DW') is that of what source system records to update, which to insert, and which to ignore because they were previously loaded. There are many reasons for this dilemma. Many legacy source systems completely lack indicators for the insert or the latest update time. Some DWs are in a state of flux and need an indeterminate number of rows reloaded just to be sure all data is stored. Finally, other DWs often only load a subset of fields from a source table even when there is an update timestamp. An update flag triggered due to a field of no concern to the DW potentially causes many unnecessary updates to a large fact table.

This article illustrates a method for an insert, update, or ignore strategy usable when a simpler method is unavailable. We will make use of SQL Server's built in HASHBYTES function as well as a few well placed staging area tables.

Let us consider a simplified example of a source system and DW fact and dimension tables. The source does not record any update or insert indicator or timestamp. The source also does not reliably commit or record both source tables in unison. Sometimes a record destined for the fact table does not have a matching dimension table record ready at the time of extraction.

Part 1: The Setup

First, let's create some sample source tables:

create table #dim_source(nk int, dim_data varchar(10));
-- the nk abbreviation indicates Natural Key from the source system
create table #fact_source(fact_nk int,dim_nk int, fact_data money);

The #dim_source table will represent a source table for our DW dimension and #fact_source has data for our fact table. The first problem is immediately apparent. This source system has neither insert timestamps nor any other indicator of when rows are updated.

Now let's make the fact and dimension tables:

create table #dim(sk int, dim_data varchar(10));
-- the sk abbreviation indicates Synthetic Key generated by the DW
create table #fact(fact_sk int, dim_sk int, fact_data money);
-- many DWs enforce foreign keys between facts and dims
-- but SQL Server does not allowed FK enforcement on temp tables
-- so please just pretend for this illustration

Finally, let's populate a few rows in the source:

insert #dim_source values (5005,'Fred');
insert #dim_source values (7007,'Joe');
insert #fact_source values (2002,5005,25.50);
insert #fact_source values (3003,5005,52.05);
insert #fact_source values (4004,7007,10.00);
insert #fact_source values (5005,2222,88.88);

A potential problem appears. The fact table source contains a key value for the dimension data missing in the dimension data source. The exact cause of the missing data is unknown - this is a question for the source's data steward. Meanwhile, the DW load will fail if the foreign keys are enforced when the dim_nk #2222 is committed. One of the main points of this method of DW load is to gracefully handle this problem.

To recap: we have 2 problems: the source will not tell us what records need to be loaded and some foreign keys are missing.

Part 2: The Data Warehouse Stage

Let's make DW staging area tables for the two source tables:

create table #dim_stage(nk int, dim_data varchar(10),load_flag char(1), chksum varbinary(100));
create table #fact_stage(fact_nk int,dim_nk int, fact_data money, load_flag char(1), chksum varbinary(100));

The new feature here is the chksum columns. These columns are the secret sauce behind this method of DW loading. They will contain MD5 checksums for the source rows. We will also have a flag to indicate when a given row should be loaded to the DW. The possible options are an insert, an update, or ignored because it is already loaded to the DW. Let's populate the DW staging tables from the source and have a look at what goes into the chksum fields:

-- load the stage for dimension data
merge #dim_stage as target
using (select nk, dim_data, hashbytes('MD5',dim_data)
from #dim_source) as source (nk,dim_data,chksum)
on target.nk = source.nk
when matched and target.chksum != source.chksum
then update
set target.chksum = source.chksum
, target.dim_data = source.dim_data
, load_flag = 'Y'
when not matched then insert
(nk,dim_data,load_flag,chksum)
values (source.nk,source.dim_data,'Y',source.chksum); -- load the stage for fact data
merge #fact_stage as target
using (select fact_nk, dim_nk, fact_data, hashbytes('MD5',convert(varchar(max),fact_data))
from #fact_source) as source (fact_nk, dim_nk,fact_data,chksum)
on target.fact_nk = source.fact_nk
when matched and target.chksum != source.chksum
then update
set target.chksum = source.chksum
, target.fact_data = source.fact_data
, load_flag = 'Y'
when not matched then insert
(fact_nk,dim_nk,fact_data,load_flag,chksum)
values (source.fact_nk,source.dim_nk,source.fact_data,'Y',source.chksum);

(Please note: The merge command is used here only for the purpose of illustrating the method. Very rarely are the source system and DW on the same server or even remotely linked for direct queries. Your favorite ETL tool, be it SSIS, Ab Initio, or Informatica, can be used to perform the update else insert else ignore logic of the merge command.)

Let's have a look at what happened in the steps above. The merge performed an update else insert into our staging tables based on the natural key values from the source system. We also calculate and insert the MD5 checksum for the data columns at the same time thanks to SQL Server's handy build in HASHBYTES function. (you can read all about HASHBYTES and the various algorithms it supports in BOL) The MD5 determines source records needing updating, loading, or ignoring.

Using merge may be unneeded in this step depending on your requirements. Some DWs, however, read from their source(s) several times per day and this method will allow the stage tables to fill up over time before being either flushed to an archive table or simply truncated. If one wanted to track intra-day deltas then a load datetime could be added to the stage tables.

A simple insert can be used if your source to DW load is a full process every time and your stage tables are truncated at each sweep.

Part 3: The Synthetic Key Lookup Tables

The next step is to begin determining records requiring synthetic key values (SKs). To do this, the secret sauce of the MD5 checksums needs its secret mixing bowl: the synthetic key cross reference table. Here is the DDL for the SKREF tables:

create table #skref_dim(nk int primary key clustered, sk int identity(1,1), chksum varbinary(100), load_dttm datetime default getdate());
create table #skref_fact(nk int primary key clustered, sk int identity(1,1), chksum varbinary(100), load_dttm datetime default getdate());

Important points to notice about the XREF tables:

1: Both (all) are the same format - making them easy to create and maintain. Only the name has changed. Varchar values can be used for the NK fields if your source does not have a strict policy of using integers for keys. Varchars are also useful in the case of source tables with concatenated keys. Multiple integers can be converted to string format and placed in a varchar field to keep the structure of all the xref tables identical. This is purely for ease of maintenance. A mixture of key column values can of course be used, if desired.

2: The SK will be generated by the Identity sequence generator whenever a row is inserted with a new NK. This simplifies DW SK generation.

3: These demo tables do not bother with indexes but in a real environment clustered indexes would be placed on the NK columns. The SK columns do not need to be indexed because as we will see, the NK does all the work and the SKs are only retrieved, not queried directly.

4: The load_dttm is optional but very useful for audit tracking. It will be used in an example later.

OK, let's fill in the SKREF table for dimension first since dimensions are typically loaded first:

merge #skref_dim as target
using (select nk, chksum
from #dim_stage) as source (nk,chksum)
on target.nk = source.nk
when matched and target.chksum != source.chksum
then update
set target.chksum = source.chksum
when not matched then insert
(nk,chksum)
values (source.nk,source.chksum);

Have a quick look inside the three dimension data tables loaded so far:

select nk, dim_data from #dim_source;
select nk, dim_data, load_flag, chksum from #dim_stage;
select nk, sk, chksum from #skref_dim;
nk   dim_data
5005 Fred
7007 Joe
nk   dim_data load_flag chksum
5005 Fred     Y         0xF2CB32166F77F33E80311BE40C97466E
7007 Joe      Y         0x3A368818B7341D48660E8DD6C5A77DBE
nk   sk chksum
5005 1  0xF2CB32166F77F33E80311BE40C97466E
7007 2  0x3A368818B7341D48660E8DD6C5A77DBE

We can see the new DW SK values assigned to each NK and the checksums in the SKREF table matches those of the rows in from the source.

Part 4: Loading the Data Warehouse Tables

Now the data is ready for loading to the DW dimension table. We need to fetch the SK from the SKREF and load the stage data to #dim and have a look inside:

merge #dim as target
using (select sk.sk, ds.dim_data from #dim_stage ds
inner join #skref_dim sk on ds.nk=sk.nk where ds.load_flag='Y') as source (sk,dim_data)
on target.sk = source.sk
when matched then update
set target.dim_data = source.dim_data
when not matched then insert
(sk,dim_data)
values (source.sk,source.dim_data);

Here we see the dimension table neatly filled with our two source data rows and the new DW SKs replacing the source NKs. Most DWs track the NK through from the source but they are removed here for brevity:

select * from #dim;

sk dim_data
----------- ----------
1 Fred
2 Joe

Let us get the fact table's NKs converted to SKs in the SKREF table using the same method we already used for the dimension source data:

merge #skref_fact as target
using (select fact_nk, chksum from #fact_stage) as source (fact_nk,chksum)
on target.nk = source.fact_nk
when matched and target.chksum != source.chksum
then update set target.chksum = source.chksum
when not matched then
insert (nk,chksum)
values (source.fact_nk,source.chksum);

Part 5: Dealing with Missing Source Keys

But wait, we still aren't ready to load the fact data into the fact table because there is still a source dimension data record missing. The fact data needs to be processes through the dimension table's SKREF. This will allow us to put in placeholder SKs for the missing dimension records:

merge #skref_dim as target
using (select distinct fs.dim_nk from #fact_stage fs where fs.load_flag='Y' ) as source (nk)
on target.nk = source.nk
when not matched then
insert (nk,chksum)
values (source.nk,0);

Have a look in the dimension table's SKREF:

select nk, sk, chksum from #skref_dim;
nk   sk chksum
2222 3  0x00000000
5005 1  0xF2CB32166F77F33E80311BE40C97466E
7007 2  0x3A368818B7341D48660E8DD6C5A77DBE

The important thing here is that a zero is placed for the checksum of the missing dimension row. When the actual data row finally arrives from the source, it will appear as an update because no actual data, not even an empty string, will generate an MD5 checksum of zero (a NULL placed in the chksum field would cause the merge command's equals (=) qualifier to miss the row). When that row arrives, the SKREF will update to the proper checksum and the row flagged to load. It will then merge in as an update to the dim table because we are going to load a placeholder record now as well as loading the fact table:

merge #dim as target
using (select sk.sk from #fact_stage fs
inner join #skref_dim sk on fs.dim_nk = sk.nk
where fs.load_flag = 'Y') as source (sk)
on target.sk=source.sk
when not matched then
insert (sk)
values (source.sk); merge #fact as target
using (select sk.sk, dk.sk, fs.fact_data from #fact_stage fs
inner join #skref_fact sk on fs.fact_nk=sk.nk
inner join #skref_dim dk on dk.nk = fs.dim_nk
where fs.load_flag='Y') as source (fact_sk,dim_sk,fact_data)
on target.fact_sk = source.fact_sk
when matched
then update set target.fact_data = source.fact_data
when not matched then insert
(fact_sk,dim_sk,fact_data)
values (source.fact_sk,source.dim_sk,source.fact_data);

Now let's have a look at our fully loaded fact and dimension tables:

select d.dim_data, SUM(f.fact_data) as grand_total 
from #fact f
inner join #dim d on f.dim_sk = d.sk
group by d.dim_data;
dim_data   grand_total
---------- ---------------------
NULL       88.88
Fred       77.55
Joe        10.00

The question immediately arises as to whether or not our DW should report on the row with the NULL data in the placeholder dimension record. This is a business decision based on why the source system was missing this record in the first place. The easiest exclusion method is a simple flag on the dimension record set when the merge from fact stage is used to populate it and which is reset to a good value when updated or loaded by the dimension stage load process. Report writers would need to honor any such flag in their queries.

Part 6: Loading the Missing Source Data

Run the following then re-run all the merge statements in order from the beginning of the article:

update #dim_stage set load_flag = 'N';
update #fact_stage set load_flag = 'N';
-- The stage tables need their load_flag set to N between each run.
-- The load process should determine anew at each run whether
-- to flag them back to Y. insert #dim_source values (2222,'Sue');
-- Our source has finally committed or otherwise found
-- the missing key value!
update #dim_source set dim_data = 'Dave' where nk = 5005;
-- Fred's account has the wrong name. It is now transferred to Dave

You will get:

dim_data grand_total
Dave     77.55
Joe      10.00
Sue      88.88

Watch the rows affected counts and you will see how the process ignores the rows that are not updated or new on the source. Experiment with adding or modifying rows in the fact source as well and the same thing will happen: only records requiring insert or update will make it past the stage tables.

Additional tricks can be done with the SKREF tables. For example, let us imagine there is a problem of data quality with many or all of the records loaded on New Year's Day in the fact or dimension table and must be reloaded:

update #skref_dim set chksum = 0 where load_dttm between '1/1/2010 00:00:00' and '1/1/2010 23:59:59';

(This is for illustration only - it won't actually affect any example rows unless your system date is set accordingly)

This quick update to the SKREF tricks the dimension load merge command into updating all the record in the dim table whether they need it or not. Superfluous updates will simply overwrite the same values in place but any incorrect records would be refreshed from the source.

Part 7: Final Thoughts

One might be tempted to make a combination stage table containing the SK values and chksums, however this leads to several problems:

1: The SKREF tables are meant to be as narrow as possible to improve performance. Performance suffers if a wide stage table with many (millions) rows looks up to itself to verify the checksums and update new checksum value and SK values.

2: New columns added to the source pull causes all-in-one tables to become unaligned and the code for using them becomes more and more customized per table. A good ETL tool can generic SKREF tables by passing the table name only if all the tables are constructed the same way.

SKREF tables like the ones shown here can be used to load extremely large DW tables in short order when many of the rows are actually duplicates of prior records. Either the SQL Merge command or an ETL tool's merge functionality makes quick work of winnowing out the rows that do not need to be changed thanks to looking up the checksums in the SKREF.

As a bonus, you may notice the fact table can be loaded prior to the dimension table. Some source systems output flat files one at a time or have some other dependency holding up prompt DW loads. Using this method completely frees up load interdependency at the mild cost of increased updates when child tables are loaded full of placeholder records prior to parent tables. Whether you gain time removing this dependency depends on a few factors - feel free to experiment.

I must mention again I've used the merge command only for illustration. You'd have to be nuts to code an entire DW load process in merge commands, in my opinion, when SSIS is available for free with SQL Server and there are several other excellent ETL tools on the market for more complex needs.

Sometimes people are concerned about the possibility of updates not causing the MD5 checksum to change. This is called an MD5 'collision'. Rest assured that MD5 collisions occur not only in extremely rare circumstances but also only under very contrived circumstances. One must try very hard with complex mathematical manipulation to recreate a duplicate MD5 sum. That said, MD5 has become a bad choice for security. We are using it here for nothing that needs security, just data record change detection.

There are several additional layers that can be built upon the staging and SKREF tables, depending on your needs. Have a good look and imagine the possibilities.

Credits:

I worked on a team with others who worked out a more complex version of what I've illustrated here. It's by far not all my ideas but I did contribute. Everyone has since moved on from the job where we instigated this scheme and I'm sure everyone continues to put their own stamps on it to refine it further. If you need the souped up version, we charge reasonable rates.

Further reading:

Rate

3.89 (35)

You rated this post out of 5. Change rating

Share

Share

Rate

3.89 (35)

You rated this post out of 5. Change rating