HASHBYTES MD5 - problem

Question

HASHBYTES MD5 - problem

WilburSmith

Hall of Fame

Points: 3219
More actions
November 18, 2016 at 12:05 am

#401080

Hi,
I run into a problem while loading data into DataVault. I create my dv key with HASHBYTES('MD5', <business key>)
The load failed due 'Duplicate data' error and the duplicate key being:
0x1f9f0d70785734c03a78ec6336779f
And the data within (from the log):
Arvopaperilaji (Arvopaperilaji:UniChar.100:): "Kotimaiset osakaslainat, kiint."
Value_date (Value_date:Date:): "10/31/2016 00:00:00.000000000"
First I naturally checked whether there indeed was duplicate data coming from my STG-->DV view - NO
What is found in DV with the key in question?
select * from [VAKAVARAISUUS_PO].[RDV_Arvopaperi_Arvostusero_POSITIO_H]
where dv_key = 0x1f9f0d70785734c03a78ec6336779f
dv_key dv_load_dts dv_src Arvopaperilaji Value_date
0x1F9F0D70785734C03A78EC6336779F 2016-11-09 12:47:18.0000000 CSV Kotimaiset osakaslainat, kiint. 2016-10-31
But the data coming from STG to DV now:
SELECT CAST(HASHBYTES('MD5', '2016-10-31' + ' # ' + LTrim(Rtrim('Kotimaiset osakaslainat, kiint.'))) AS VARBINARY(40)) dv_key
dv_key
0x1F9F0D70785734C03A78EC6336779F00
Notice the extra '00' at the end of the key. This does not matter with Sql (it seems that EVEN number of 'zeroes' can be added to the end of the varbinary)
But the thing is that the ETL tool (Informatica 9.6) Lookup comparison considers them different and thus of course a duplicate error arises in the database while inserting.
I know I have to investigate the Informatica side of the inquiry (is there a workaround) as well but in here (SqlServer) I'd like to ask
1. Why do I get the '00' to the end compared to the result of '2016-11-09'?
2. The string is 100% the same but the source is CSV file - can it be that there is something not visible to my I that causes this (what computer was used to create the CSV)?
Regards, Ville
Sql Server 2014

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

Eirikur Eiriksson SSC Guru Points: 182882 More actions · Answer 1

WilburSmith (11/18/2016)
Hi,
I run into a problem while loading data into DataVault. I create my dv key with HASHBYTES('MD5', <business key>)
The load failed due 'Duplicate data' error and the duplicate key being:
0x1f9f0d70785734c03a78ec6336779f
And the data within (from the log):
Arvopaperilaji (Arvopaperilaji:UniChar.100:): "Kotimaiset osakaslainat, kiint."
Value_date (Value_date:Date:): "10/31/2016 00:00:00.000000000"
First I naturally checked whether there indeed was duplicate data coming from my STG-->DV view - NO
What is found in DV with the key in question?
select * from [VAKAVARAISUUS_PO].[RDV_Arvopaperi_Arvostusero_POSITIO_H]
where dv_key = 0x1f9f0d70785734c03a78ec6336779f
dv_key dv_load_dts dv_src Arvopaperilaji Value_date
0x1F9F0D70785734C03A78EC6336779F 2016-11-09 12:47:18.0000000 CSV Kotimaiset osakaslainat, kiint. 2016-10-31
But the data coming from STG to DV now:
SELECT CAST(HASHBYTES('MD5', '2016-10-31' + ' # ' + LTrim(Rtrim('Kotimaiset osakaslainat, kiint.'))) AS VARBINARY(40)) dv_key
dv_key
0x1F9F0D70785734C03A78EC6336779F00
Notice the extra '00' at the end of the key. This does not matter with Sql (it seems that EVEN number of 'zeroes' can be added to the end of the varbinary)
But the thing is that the ETL tool (Informatica 9.6) Lookup comparison considers them different and thus of course a duplicate error arises in the database while inserting.
I know I have to investigate the Informatica side of the inquiry (is there a workaround) as well but in here (SqlServer) I'd like to ask
1. Why do I get the '00' to the end compared to the result of '2016-11-09'?
2. The string is 100% the same but the source is CSV file - can it be that there is something not visible to my I that causes this (what computer was used to create the CSV)?
Regards, Ville
Sql Server 2014

Quick thought, firstly the output of the HASHBYTE (MD5) is varbinary(16) so there is no truncation on the SQL Server side. Secondly, a lot of software uses 0x00 as string / field / column termination, most likely that is what is happening somewhere in the flow.

😎

WilburSmith Hall of Fame Points: 3219 More actions · Answer 2

Hi,

the columns are varbinary(40) in DB and ETL

I form the dv_key in Sql view like seen up there and it was done similarly both times.

There are 21 rows in this CSV and this particular row is the only one where this happens

v

WilburSmith Hall of Fame Points: 3219 More actions · Answer 3

Hi,

if anybody is interested then here is what I did as a workaround and some other findings as well

First of all, I finally got it that actually the HASHBYTES has produced the value exactly the same way both times. This was just remarkable coincidence that the key happened to have the trailing '00', which were stripped away in the DB and thus caused the 'NO MATCH' in ETL Lookup.

As a workaround I now updated the literal (Arvopaperilaji) to a different value (no trailing even '00' anymore) with the customer and updated the old values/dv_keys in DataVault and problem solved for now...:)

As a solution for the future, I would think that it should be as VARBINARY(16) in DB and equally CAST as BINARY(16) within the HASHBYTES-call. This way it would always be padded with trailing '00' to the full length and on the other hand it would not take any excess space from the database because it will be binary16 anyway...

I tested with a table of 7000+ rows and the current dv_key was always the same as rehashed 'CAST as BINARY(16)' ==> It seems to be OK, this BINARY(16)

Any thoughts against or for this suggestion..?

V

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 4

IIRC, HASHBYTES isn't very good for your purposes. I think Thomas Kejser did a very good analysis of various mechanisms for computing unique "checksum" type values. You should be able to find it with a web search.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

WilburSmith Hall of Fame Points: 3219 More actions · Answer 5

They have DataVault 2.0 and it is requested to have a HASHed dv_key...

v