HASHBYTES MD5 - problem

  • Hi,

    I run into a problem while loading data into DataVault. I create my dv key with HASHBYTES('MD5', <business key>)

    The load failed due 'Duplicate data' error and the duplicate key being:

    0x1f9f0d70785734c03a78ec6336779f

    And the data within (from the log):

    Arvopaperilaji (Arvopaperilaji:UniChar.100:): "Kotimaiset osakaslainat, kiint."

    Value_date (Value_date:Date:): "10/31/2016 00:00:00.000000000"

    First I naturally checked whether there indeed was duplicate data coming from my STG-->DV view - NO

    What is found in DV with the key in question?

    select * from [VAKAVARAISUUS_PO].[RDV_Arvopaperi_Arvostusero_POSITIO_H]

    where dv_key = 0x1f9f0d70785734c03a78ec6336779f

    dv_key dv_load_dts dv_src Arvopaperilaji Value_date

    0x1F9F0D70785734C03A78EC6336779F 2016-11-09 12:47:18.0000000 CSV Kotimaiset osakaslainat, kiint. 2016-10-31

    But the data coming from STG to DV now:

    SELECT CAST(HASHBYTES('MD5', '2016-10-31' + ' # ' + LTrim(Rtrim('Kotimaiset osakaslainat, kiint.'))) AS VARBINARY(40)) dv_key

    dv_key

    0x1F9F0D70785734C03A78EC6336779F00

    Notice the extra '00' at the end of the key. This does not matter with Sql (it seems that EVEN number of 'zeroes' can be added to the end of the varbinary)

    But the thing is that the ETL tool (Informatica 9.6) Lookup comparison considers them different and thus of course a duplicate error arises in the database while inserting.

    I know I have to investigate the Informatica side of the inquiry (is there a workaround) as well but in here (SqlServer) I'd like to ask

    1. Why do I get the '00' to the end compared to the result of '2016-11-09'?

    2. The string is 100% the same but the source is CSV file - can it be that there is something not visible to my I that causes this (what computer was used to create the CSV)?

    Regards, Ville

    Sql Server 2014

  • WilburSmith (11/18/2016)


    Hi,

    I run into a problem while loading data into DataVault. I create my dv key with HASHBYTES('MD5', <business key>)

    The load failed due 'Duplicate data' error and the duplicate key being:

    0x1f9f0d70785734c03a78ec6336779f

    And the data within (from the log):

    Arvopaperilaji (Arvopaperilaji:UniChar.100:): "Kotimaiset osakaslainat, kiint."

    Value_date (Value_date:Date:): "10/31/2016 00:00:00.000000000"

    First I naturally checked whether there indeed was duplicate data coming from my STG-->DV view - NO

    What is found in DV with the key in question?

    select * from [VAKAVARAISUUS_PO].[RDV_Arvopaperi_Arvostusero_POSITIO_H]

    where dv_key = 0x1f9f0d70785734c03a78ec6336779f

    dv_key dv_load_dts dv_src Arvopaperilaji Value_date

    0x1F9F0D70785734C03A78EC6336779F 2016-11-09 12:47:18.0000000 CSV Kotimaiset osakaslainat, kiint. 2016-10-31

    But the data coming from STG to DV now:

    SELECT CAST(HASHBYTES('MD5', '2016-10-31' + ' # ' + LTrim(Rtrim('Kotimaiset osakaslainat, kiint.'))) AS VARBINARY(40)) dv_key

    dv_key

    0x1F9F0D70785734C03A78EC6336779F00

    Notice the extra '00' at the end of the key. This does not matter with Sql (it seems that EVEN number of 'zeroes' can be added to the end of the varbinary)

    But the thing is that the ETL tool (Informatica 9.6) Lookup comparison considers them different and thus of course a duplicate error arises in the database while inserting.

    I know I have to investigate the Informatica side of the inquiry (is there a workaround) as well but in here (SqlServer) I'd like to ask

    1. Why do I get the '00' to the end compared to the result of '2016-11-09'?

    2. The string is 100% the same but the source is CSV file - can it be that there is something not visible to my I that causes this (what computer was used to create the CSV)?

    Regards, Ville

    Sql Server 2014

    Quick thought, firstly the output of the HASHBYTE (MD5) is varbinary(16) so there is no truncation on the SQL Server side. Secondly, a lot of software uses 0x00 as string / field / column termination, most likely that is what is happening somewhere in the flow.

    😎

  • Hi,

    the columns are varbinary(40) in DB and ETL

    I form the dv_key in Sql view like seen up there and it was done similarly both times.

    There are 21 rows in this CSV and this particular row is the only one where this happens

    v

  • Hi,

    if anybody is interested then here is what I did as a workaround and some other findings as well

    First of all, I finally got it that actually the HASHBYTES has produced the value exactly the same way both times. This was just remarkable coincidence that the key happened to have the trailing '00', which were stripped away in the DB and thus caused the 'NO MATCH' in ETL Lookup.

    As a workaround I now updated the literal (Arvopaperilaji) to a different value (no trailing even '00' anymore) with the customer and updated the old values/dv_keys in DataVault and problem solved for now...:)

    As a solution for the future, I would think that it should be as VARBINARY(16) in DB and equally CAST as BINARY(16) within the HASHBYTES-call. This way it would always be padded with trailing '00' to the full length and on the other hand it would not take any excess space from the database because it will be binary16 anyway...

    I tested with a table of 7000+ rows and the current dv_key was always the same as rehashed 'CAST as BINARY(16)' ==> It seems to be OK, this BINARY(16)

    Any thoughts against or for this suggestion..?

    V

  • IIRC, HASHBYTES isn't very good for your purposes. I think Thomas Kejser did a very good analysis of various mechanisms for computing unique "checksum" type values. You should be able to find it with a web search.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • They have DataVault 2.0 and it is requested to have a HASHed dv_key...

    v

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply