Click here to monitor SSC
SQLServerCentral is supported by Redgate
Log in  ::  Register  ::  Not logged in
Home       Members    Calendar    Who's On

Add to briefcase «««678910

HASHBYTES can help quickly load a Data Warehouse Expand / Collapse
Posted Sunday, October 9, 2011 2:44 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Friday, June 17, 2016 12:49 PM
Points: 42, Visits: 508
L' Eomot Inversé (10/9/2011)
Langston Montgomery (10/8/2011)
Has anyone come across cases where SHA1 has produced output longer than 20 bytes, because I think I have. We are using HASHBYTES to detect changes in our data warehouse, and we were getting type 2 changes, even though all the columns values were exactly the same between the most current and previous rows; even the hashbyte value looked the same!

But, upon looking closer, the HASHBYTE function was returning this: 0xEE0816119692A1C4DCC2045EA1E37CD47EFF4900
But, the value in our varbinary(20) column in the table was storing this: 0xEE0816119692A1C4DCC2045EA1E37CD47EFF49
I don't think "longer that 20 bytes" is relevant, as the number ending 00 is 20 bytes (the one without 00 is only 19).

You're right, L'Eomot. I wrote that wrong. What I was trying to say was that HASHBYTE was returning a 20 byte length value, but when SQL wrote it somehow it got truncated to 19.

Notice the two zeros at the end of the function's output. For some reason, when SQL writes the output from the function to the table it truncates those last two "characters" at the end!

We have no idea why it does that, but when we increase the varbinary column in the table from 20 to 40, SQL doesn't truncate the output anymore, and all is well.

Anyway, if anyone has any insight on why SQL seems to be truncating the output from the function as it writes to the varbinary(20) column, please let me know, I'm curious!

I tried a few things to see if I could reproduce this behaviour, using hashbytes to generate values for local variables, for columns in table variable, for columns in temp tables, and for columns in premanent tables, but got no truncation ever when using varchar(20). Maybe I'm on a different version - I'm using SQL 2008 R2.

We're using SQL 2008 R2 and SSIS 2008. We are also using the OLE DST component with fast load, nolock, and don't check constraints to load the type2 change to the table. Wanna read something funny?... I reduced the varbinary column from 40 back to 20 expecting SQL to truncate it again... and it doesn't! It's almost like something had to be reset. I thought it was ANSI_PADDING or something, SSIS wouldn't turn that off... I don't think.

Anyway, thanks for your response, L'Eomot. I don't like when things work without knowing what fixed it, but we're good for now!
Post #1187675
Posted Monday, October 10, 2011 12:47 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Friday, June 17, 2016 12:49 PM
Points: 42, Visits: 508

!!False Alarm!!

I feel like such an idiot. I'm almost embarrassed to admit this, but out of courtesy, I thought I would give an update on this in case anyone else makes the oversight I did.

You guessed it. The ANSI_PADDING was turned off!! Doh!! This was a newly created database. Rookie mistake. Luckily, worse case, our data warehouse will have a few thousand phantom type2 updates which won't cause any noise, save some space being used up for nothing.

Thanks to all for taking the time to read this. I'm gonna go dig a whole and stick my head in it for a while!
Post #1188080
Posted Sunday, November 1, 2015 5:06 AM



Group: General Forum Members
Last Login: Wednesday, August 24, 2016 10:25 AM
Points: 122, Visits: 460
I've found the SHA-256 algorithm meets my needs so far in ETL loads.
I use a C# script task within SSIS to compare hash values in the Dimension tables, which obviously by their nature tend to be smaller than Fact tables.
Using the cryptography namespace in C# gets around some of the limitations of the T-SQL Hashbytes function specifically with data types and NULL handling.
The trick for Fact tables is to implement Change Tracking at source (less overhead than CDC) which can then be used for net data extraction based on primary keys detected by the "changetable" function. So the incremental loads are tiny (provided the gap between data pulls is reasonably small) compared to "initial" load. Annoyingly, in the source system I work with, we have Updates, Inserts and Deletes on production database transactional tables so these have to be handled by ETL process. Nevertheless, with the appropriate SSIS tasks we can achieve good performance and main thing is the update process on Fact table is still a batch process rather than a row-based process. The T-SQL Merge statement is really powerful for comparing rows as an alternative to hashing algorithms. So far I'm using hashbytes and provided there is no loss of synchronicity via the Change Tracking mechanism, this can also be used for row comparisons Fact as well as Dimension.
Post #1732664
« Prev Topic | Next Topic »

Add to briefcase «««678910

Permissions Expand / Collapse