Trying to use the HASHBYTES of a nvarcar(Max)

  • When I try to make a calculated field [File HASH] using hashbytes('MD5',[ASCII File]) the calculation I get the resultant field [File HASH] as a nvarchar(max)

    I want the field [File HASH] to be nvarchar(50) so I tried left(hashbytes('MD5',[ASCII File]), 50) figuring if I only take the left 50 characters then the resultant field would only be 50 characters

    When I try to do a select query using including the [File HASH] field I get

    Query completed with errors

    Msg 8152, Level 16, State 10, Line 2

    String or Binary data would be truncated

    So I tried instead of using the [File HASH] calculkated field directly I tried using left(hashbytes('MD5',[ASCII File]), 50) in the select instead of the calculated field and I get the same error messages.

    How can I take the HASHBYTES of a nvarchar(max) field and put the result in a another field that is nvarchar(50)

    BTW the MD5 HASH has is way less than 50 characters as a result.

    Also I am using SQL SERVER Management Studio 2016 and I modified the table using the GUI but the Queries I am using SQL.

    I was wondering if convert could somehow be used? If so what would be the syntax for the calculated field?

  • 1) Try CASTing it as the data type you want.

    2) What do you get when you do DATALENGTH(..) on your calculation?

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Quick thoughts;

    The HASHBYTES has the 8000 and 4000 character limits for VARCHAR and NVARCHAR data types respectfully. This means that if one wants to create a hash of character strings larger than the limit then one has either to resort to external methods (CLR etc.) or chunk the string accordingly. I have done this in T-SQL by using the following method:

    1) Chop the string into 8000/4000 characters

    2) Hash each chunk

    3) Concatenate the hashes and test for the length

    4) Hash the concatenation, repeating the chopping if longer than the limit.

    5) Finally ending up with one hash.

    This will generate a unique hash for the character string but obviously it will not match other methods which process the whole string in one go.

    The length of the output has nothing to do with the input lenght, it depends on which algorithm is used, available algorithms are MD2, MD4, MD5, SHA, SHA1, SHA2_256 and SHA2_512. There is a performance difference between those and I've found that the SHA1 is normally the fastest and MD5 is close behind.

    For several reasons I'm not to keen on a calculated column for the hashes, performance being an obvious one, I rather hash the content on the insert.

    😎

  • For several reasons I'm not to keen on a calculated column for the hashes, performance being an obvious one, I rather hash the content on the insert.

    SO VERY TRUE!! There are so many things I see clients do on SQL Server that should be done elsewhere!! It is a precious, VERY EXPENSIVE, collection of limited resources (RAM, IO, CPU, LOCKS, IO, LATCHES, IO, etc, etc) that when constrained or overwhelmed cause performance issues.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • You said "The length of the output has nothing to do with the input lenght" but that seems to be what is happening. How else do you explain the error messages?

  • You said "I'm not to keen on a calculated column for the hashes, performance being an obvious one" the theoretical performance hit is extremely negligible to the point of being measurable

  • joejuska (7/30/2016)


    You said "The length of the output has nothing to do with the input lenght" but that seems to be what is happening. How else do you explain the error messages?

    The output of the hashbytes is 16 bytes for 128 bits encryption, 20 bytes for 10 bits, 32 bytes for 256 bits and 64 bytes for 512 bits encryption, regardless of the input length.

    😎

    If you want to hash only the first 50 characters then you have to chop it off first before passing it to the hash function:

    HASBYTES('MD5',LEFT([ASCII File],50))

  • joejuska (7/30/2016)


    You said "I'm not to keen on a calculated column for the hashes, performance being an obvious one" the theoretical performance hit is extremely negligible to the point of being measurable

    Quick questions, on what do you base that assumption? Do you have a realistic test data and harness to proof the point? Is there a problem calculating the hash on insert?

    😎

    Further on the subject of hashing and it's role: hashing uniquely identifies the data, it allows for detection of duplicate entries and it allows for a detection of any modification of the data. Only hashing the first 50 characters of a long character string serves, in my point of view, hardly any purpose, it would match entries where maybe 99% of the data would be missing. There are circumstances where a portion of the data uniquely identifies it from a business perspective and then it is probably better to extract that data as is from the bulk and use simple matching, no need really hashing anything smaller than ~= 64 bytes.

    On the performance, if there is an occasional insert every second or in minutes then yes, calculated column could do just fine for shorter inputs. Unfortunately this does not scale too well and when working with 1000s of insert/second of several kilo or megabytes, it simply does not work unless one is willing to throw some really serious hardware at the problem.

  • You said "Quick questions, on what do you base that assumption?" I can't directly answer your question because YOUR QUESTION is wrong. IT'S NOT AN ASSUMPTION BUT AN OBSERVATION. The observation over many years of doing it both ways and any difference (I assume there must be some overhead that would make it theoretically slower) are so small as to be undetectable with tens of thousands of inserts per day. Both methods yield an instantaneous result as perceived by the human users brain.

    You said "Do you have a realistic test data and harness to proof the point?" Real world usage by thousands of users

    You said "Is there a problem calculating the hash on insert?" Yes. There are numerous (unknown number and unknown type) ways users can input the data so we have no control over what is insert so the only way to get a hash is with a calculated field or running a process after an insert.

    This is the last supersize in futility I am going to offer. Anyone else please have the common decency to directly answer my question instead of offering alternatives that make stupid assumptions.

  • No I DO NOT WANT TO HASH ONLY THE FIRST 50 CHARACTERS. I WANT TO HASH THE ENTIRE FILE BUT HAVE THE RESULTANT FIELD NVARCHAR(50) OR CHAR(50) INSTEAD OF SSMS MAKING THE RESULTANT FIELD NVARCHAR(MAX)

  • In this case there is no difference between an assumption and a casual observation, nothing solid to proof anything.

    😎

    Here are two test harnesses, a) measures the difference between calculating the md5 on insert vs. calculated column, b) the difference between inserting an pre-calculated md5 vs. calculated column.

    The caveat here is that the input is limited to the maximum of 4000 wide characters, in real live situation one would expect larger inputs which means more complicated calculation of the md5, especially with the calculated column as it is limited to a scalar function because the apply operator cannot be used there. That will further add to the overhead of the calculated column method.

    Test Harness A

    USE TEEST;

    GO

    SET NOCOUNT ON;

    -- Test the difference between calculating an md5 hash on the insert and a calculated column for the md5

    IF OBJECT_ID(N'dbo.TBL_TEST_HASH_SAMPLE') IS NOT NULL DROP TABLE dbo.TBL_TEST_HASH_SAMPLE;

    CREATE TABLE dbo.TBL_TEST_HASH_SAMPLE

    (

    TH_ID INT IDENTITY(1,1) NOT NULL CONSTRAINT PK_DBO_TBL_TEST_HASH_SAMPLE_TH_ID PRIMARY KEY CLUSTERED

    ,TH_NSTR NVARCHAR(MAX) NOT NULL

    ,TH_MD5 BINARY(20) NOT NULL

    );

    IF OBJECT_ID(N'dbo.TBL_TEST_HASH') IS NOT NULL DROP TABLE dbo.TBL_TEST_HASH;

    CREATE TABLE dbo.TBL_TEST_HASH

    (

    TH_ID INT IDENTITY(1,1) NOT NULL CONSTRAINT PK_DBO_TBL_TEST_HASH_TH_ID PRIMARY KEY CLUSTERED

    ,TH_NSTR NVARCHAR(MAX) NOT NULL

    ,TH_MD5 BINARY(20) NOT NULL

    );

    IF OBJECT_ID(N'dbo.TBL_TEST_HASH_CALC_COL') IS NOT NULL DROP TABLE dbo.TBL_TEST_HASH_CALC_COL;

    CREATE TABLE dbo.TBL_TEST_HASH_CALC_COL

    (

    TH_ID INT IDENTITY(1,1) NOT NULL CONSTRAINT PK_DBO_TBL_TEST_HASH_CALC_COL_TH_ID PRIMARY KEY CLUSTERED

    ,TH_NSTR NVARCHAR(MAX) NOT NULL

    ,TH_MD5 AS (HASHBYTES('MD5',TH_NSTR)) PERSISTED

    );

    DECLARE @SAMPLE_SIZE INT = 100000;

    DECLARE @MAX_SIZE INT = 4000;

    -- APPROX 7.5m characters

    DECLARE @LONG_STR NVARCHAR(MAX) =

    (

    SELECT

    NCHAR(32) + SMSG.text

    FROM sys.messages SMSG

    WHERE SMSG.language_id IN (1031, 1033,1036,3082,1040,2068)

    FOR XML PATH('')

    );

    DECLARE @MAX_START INT = LEN(@LONG_STR) - (@MAX_SIZE + 1);

    ;WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))

    , NUMS(N) AS (SELECT TOP(@SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9)

    ,STRDATA(TEST_STR) AS

    (

    SELECT

    SUBSTRING(@LONG_STR ,(ABS(CHECKSUM(NEWID())) % @MAX_START),1 + (ABS(CHECKSUM(NEWID())) % @MAX_SIZE)) AS TEST_STR

    FROM NUMS NM

    )

    INSERT INTO dbo.TBL_TEST_HASH_SAMPLE(TH_NSTR,TH_MD5)

    SELECT

    SD.TEST_STR

    ,HASHBYTES('MD5',SD.TEST_STR)

    FROM STRDATA SD;

    DECLARE @timer TABLE (T_TXT VARCHAR(50) NOT NULL, T_TS DATETIME2(7) NOT NULL DEFAULT (SYSDATETIME()));

    DECLARE @NVARCHAR_BUCKET NVARCHAR(MAX) = N'';

    DECLARE @BINARY_BUCKET BINARY(20) = 0x00;

    -- ROUND ONE

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    SELECT

    @NVARCHAR_BUCKET = TS.TH_NSTR

    ,@BINARY_BUCKET = TS.TH_MD5

    FROM dbo.TBL_TEST_HASH_SAMPLE TS

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    INSERT INTO dbo.TBL_TEST_HASH_CALC_COL(TH_NSTR)

    SELECT

    TS.TH_NSTR

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    INSERT INTO dbo.TBL_TEST_HASH(TH_NSTR,TH_MD5)

    SELECT

    TS.TH_NSTR

    ,HASHBYTES('MD5',TS.TH_NSTR)

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    -- ROUND TWO

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    SELECT

    @NVARCHAR_BUCKET = TS.TH_NSTR

    ,@BINARY_BUCKET = TS.TH_MD5

    FROM dbo.TBL_TEST_HASH_SAMPLE TS

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    INSERT INTO dbo.TBL_TEST_HASH_CALC_COL(TH_NSTR)

    SELECT

    TS.TH_NSTR

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    INSERT INTO dbo.TBL_TEST_HASH(TH_NSTR,TH_MD5)

    SELECT

    TS.TH_NSTR

    ,HASHBYTES('MD5',TS.TH_NSTR)

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    -- ROUND THREE

    TRUNCATE TABLE dbo.TBL_TEST_HASH_CALC_COL;

    TRUNCATE TABLE dbo.TBL_TEST_HASH;

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    SELECT

    @NVARCHAR_BUCKET = TS.TH_NSTR

    ,@BINARY_BUCKET = TS.TH_MD5

    FROM dbo.TBL_TEST_HASH_SAMPLE TS

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    INSERT INTO dbo.TBL_TEST_HASH(TH_NSTR,TH_MD5)

    SELECT

    TS.TH_NSTR

    ,HASHBYTES('MD5',TS.TH_NSTR)

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    INSERT INTO dbo.TBL_TEST_HASH_CALC_COL(TH_NSTR)

    SELECT

    TS.TH_NSTR

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    -- ROUND FOUR

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    SELECT

    @NVARCHAR_BUCKET = TS.TH_NSTR

    ,@BINARY_BUCKET = TS.TH_MD5

    FROM dbo.TBL_TEST_HASH_SAMPLE TS

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    INSERT INTO dbo.TBL_TEST_HASH(TH_NSTR,TH_MD5)

    SELECT

    TS.TH_NSTR

    ,HASHBYTES('MD5',TS.TH_NSTR)

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    INSERT INTO dbo.TBL_TEST_HASH_CALC_COL(TH_NSTR)

    SELECT

    TS.TH_NSTR

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    -- ROUND FIVE

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    SELECT

    @NVARCHAR_BUCKET = TS.TH_NSTR

    ,@BINARY_BUCKET = TS.TH_MD5

    FROM dbo.TBL_TEST_HASH_SAMPLE TS

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    INSERT INTO dbo.TBL_TEST_HASH_CALC_COL(TH_NSTR)

    SELECT

    TS.TH_NSTR

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    INSERT INTO dbo.TBL_TEST_HASH(TH_NSTR,TH_MD5)

    SELECT

    TS.TH_NSTR

    ,HASHBYTES('MD5',TS.TH_NSTR)

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    -- ROUND SIX

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    SELECT

    @NVARCHAR_BUCKET = TS.TH_NSTR

    ,@BINARY_BUCKET = TS.TH_MD5

    FROM dbo.TBL_TEST_HASH_SAMPLE TS

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    INSERT INTO dbo.TBL_TEST_HASH_CALC_COL(TH_NSTR)

    SELECT

    TS.TH_NSTR

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    INSERT INTO dbo.TBL_TEST_HASH(TH_NSTR,TH_MD5)

    SELECT

    TS.TH_NSTR

    ,HASHBYTES('MD5',TS.TH_NSTR)

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    -- ROUND SEVEN

    TRUNCATE TABLE dbo.TBL_TEST_HASH_CALC_COL;

    TRUNCATE TABLE dbo.TBL_TEST_HASH;

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    SELECT

    @NVARCHAR_BUCKET = TS.TH_NSTR

    ,@BINARY_BUCKET = TS.TH_MD5

    FROM dbo.TBL_TEST_HASH_SAMPLE TS

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    INSERT INTO dbo.TBL_TEST_HASH(TH_NSTR,TH_MD5)

    SELECT

    TS.TH_NSTR

    ,HASHBYTES('MD5',TS.TH_NSTR)

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    INSERT INTO dbo.TBL_TEST_HASH_CALC_COL(TH_NSTR)

    SELECT

    TS.TH_NSTR

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    -- ROUND EIGHT

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    SELECT

    @NVARCHAR_BUCKET = TS.TH_NSTR

    ,@BINARY_BUCKET = TS.TH_MD5

    FROM dbo.TBL_TEST_HASH_SAMPLE TS

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    INSERT INTO dbo.TBL_TEST_HASH(TH_NSTR,TH_MD5)

    SELECT

    TS.TH_NSTR

    ,HASHBYTES('MD5',TS.TH_NSTR)

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    INSERT INTO dbo.TBL_TEST_HASH_CALC_COL(TH_NSTR)

    SELECT

    TS.TH_NSTR

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    SELECT

    T.T_TXT

    ,DATEDIFF(MICROSECOND,MIN(T.T_TS),MAX(T.T_TS)) AS DURATION

    FROM @timer T

    GROUP BY T.T_TXT

    ORDER BY DURATION;

    Results from A on 2nd Gen i5, 8Gb ram, 3rd Gen SSD, SQL Server 2016 Dev Ed.

    T_TXT DURATION

    ----------- -----------

    INSERT 59156070

    DRY RUN 59394936

    CALC COL 67620986

    Test Harness B

    USE TEEST;

    GO

    SET NOCOUNT ON;

    -- Test the difference between inserting an md5 hash and a calculated column for the md5

    IF OBJECT_ID(N'dbo.TBL_TEST_HASH_SAMPLE') IS NOT NULL DROP TABLE dbo.TBL_TEST_HASH_SAMPLE;

    CREATE TABLE dbo.TBL_TEST_HASH_SAMPLE

    (

    TH_ID INT IDENTITY(1,1) NOT NULL CONSTRAINT PK_DBO_TBL_TEST_HASH_SAMPLE_TH_ID PRIMARY KEY CLUSTERED

    ,TH_NSTR NVARCHAR(MAX) NOT NULL

    ,TH_MD5 BINARY(20) NOT NULL

    );

    IF OBJECT_ID(N'dbo.TBL_TEST_HASH') IS NOT NULL DROP TABLE dbo.TBL_TEST_HASH;

    CREATE TABLE dbo.TBL_TEST_HASH

    (

    TH_ID INT IDENTITY(1,1) NOT NULL CONSTRAINT PK_DBO_TBL_TEST_HASH_TH_ID PRIMARY KEY CLUSTERED

    ,TH_NSTR NVARCHAR(MAX) NOT NULL

    ,TH_MD5 BINARY(20) NOT NULL

    );

    IF OBJECT_ID(N'dbo.TBL_TEST_HASH_CALC_COL') IS NOT NULL DROP TABLE dbo.TBL_TEST_HASH_CALC_COL;

    CREATE TABLE dbo.TBL_TEST_HASH_CALC_COL

    (

    TH_ID INT IDENTITY(1,1) NOT NULL CONSTRAINT PK_DBO_TBL_TEST_HASH_CALC_COL_TH_ID PRIMARY KEY CLUSTERED

    ,TH_NSTR NVARCHAR(MAX) NOT NULL

    ,TH_MD5 AS (HASHBYTES('MD5',TH_NSTR)) PERSISTED

    );

    DECLARE @SAMPLE_SIZE INT = 100000;

    DECLARE @MAX_SIZE INT = 4000;

    -- APPROX 7.5m characters

    DECLARE @LONG_STR NVARCHAR(MAX) =

    (

    SELECT

    NCHAR(32) + SMSG.text

    FROM sys.messages SMSG

    WHERE SMSG.language_id IN (1031, 1033,1036,3082,1040,2068)

    FOR XML PATH('')

    );

    DECLARE @MAX_START INT = LEN(@LONG_STR) - (@MAX_SIZE + 1);

    ;WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))

    , NUMS(N) AS (SELECT TOP(@SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9)

    ,STRDATA(TEST_STR) AS

    (

    SELECT

    SUBSTRING(@LONG_STR ,(ABS(CHECKSUM(NEWID())) % @MAX_START),1 + (ABS(CHECKSUM(NEWID())) % @MAX_SIZE)) AS TEST_STR

    FROM NUMS NM

    )

    INSERT INTO dbo.TBL_TEST_HASH_SAMPLE(TH_NSTR,TH_MD5)

    SELECT

    SD.TEST_STR

    ,HASHBYTES('MD5',SD.TEST_STR)

    FROM STRDATA SD;

    DECLARE @timer TABLE (T_TXT VARCHAR(50) NOT NULL, T_TS DATETIME2(7) NOT NULL DEFAULT (SYSDATETIME()));

    DECLARE @NVARCHAR_BUCKET NVARCHAR(MAX) = N'';

    DECLARE @BINARY_BUCKET BINARY(20) = 0x00;

    -- ROUND ONE

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    SELECT

    @NVARCHAR_BUCKET = TS.TH_NSTR

    ,@BINARY_BUCKET = TS.TH_MD5

    FROM dbo.TBL_TEST_HASH_SAMPLE TS

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    INSERT INTO dbo.TBL_TEST_HASH_CALC_COL(TH_NSTR)

    SELECT

    TS.TH_NSTR

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    INSERT INTO dbo.TBL_TEST_HASH(TH_NSTR,TH_MD5)

    SELECT

    TS.TH_NSTR

    ,TS.TH_MD5

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    -- ROUND TWO

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    SELECT

    @NVARCHAR_BUCKET = TS.TH_NSTR

    ,@BINARY_BUCKET = TS.TH_MD5

    FROM dbo.TBL_TEST_HASH_SAMPLE TS

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    INSERT INTO dbo.TBL_TEST_HASH_CALC_COL(TH_NSTR)

    SELECT

    TS.TH_NSTR

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    INSERT INTO dbo.TBL_TEST_HASH(TH_NSTR,TH_MD5)

    SELECT

    TS.TH_NSTR

    ,TS.TH_MD5

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    -- ROUND THREE

    TRUNCATE TABLE dbo.TBL_TEST_HASH_CALC_COL;

    TRUNCATE TABLE dbo.TBL_TEST_HASH;

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    SELECT

    @NVARCHAR_BUCKET = TS.TH_NSTR

    ,@BINARY_BUCKET = TS.TH_MD5

    FROM dbo.TBL_TEST_HASH_SAMPLE TS

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    INSERT INTO dbo.TBL_TEST_HASH(TH_NSTR,TH_MD5)

    SELECT

    TS.TH_NSTR

    ,TS.TH_MD5

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    INSERT INTO dbo.TBL_TEST_HASH_CALC_COL(TH_NSTR)

    SELECT

    TS.TH_NSTR

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    -- ROUND FOUR

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    SELECT

    @NVARCHAR_BUCKET = TS.TH_NSTR

    ,@BINARY_BUCKET = TS.TH_MD5

    FROM dbo.TBL_TEST_HASH_SAMPLE TS

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    INSERT INTO dbo.TBL_TEST_HASH(TH_NSTR,TH_MD5)

    SELECT

    TS.TH_NSTR

    ,TS.TH_MD5

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    INSERT INTO dbo.TBL_TEST_HASH_CALC_COL(TH_NSTR)

    SELECT

    TS.TH_NSTR

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    -- ROUND FIVE

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    SELECT

    @NVARCHAR_BUCKET = TS.TH_NSTR

    ,@BINARY_BUCKET = TS.TH_MD5

    FROM dbo.TBL_TEST_HASH_SAMPLE TS

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    INSERT INTO dbo.TBL_TEST_HASH_CALC_COL(TH_NSTR)

    SELECT

    TS.TH_NSTR

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    INSERT INTO dbo.TBL_TEST_HASH(TH_NSTR,TH_MD5)

    SELECT

    TS.TH_NSTR

    ,TS.TH_MD5

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    -- ROUND SIX

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    SELECT

    @NVARCHAR_BUCKET = TS.TH_NSTR

    ,@BINARY_BUCKET = TS.TH_MD5

    FROM dbo.TBL_TEST_HASH_SAMPLE TS

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    INSERT INTO dbo.TBL_TEST_HASH_CALC_COL(TH_NSTR)

    SELECT

    TS.TH_NSTR

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    INSERT INTO dbo.TBL_TEST_HASH(TH_NSTR,TH_MD5)

    SELECT

    TS.TH_NSTR

    ,TS.TH_MD5

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    -- ROUND SEVEN

    TRUNCATE TABLE dbo.TBL_TEST_HASH_CALC_COL;

    TRUNCATE TABLE dbo.TBL_TEST_HASH;

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    SELECT

    @NVARCHAR_BUCKET = TS.TH_NSTR

    ,@BINARY_BUCKET = TS.TH_MD5

    FROM dbo.TBL_TEST_HASH_SAMPLE TS

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    INSERT INTO dbo.TBL_TEST_HASH(TH_NSTR,TH_MD5)

    SELECT

    TS.TH_NSTR

    ,TS.TH_MD5

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    INSERT INTO dbo.TBL_TEST_HASH_CALC_COL(TH_NSTR)

    SELECT

    TS.TH_NSTR

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    -- ROUND EIGHT

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    SELECT

    @NVARCHAR_BUCKET = TS.TH_NSTR

    ,@BINARY_BUCKET = TS.TH_MD5

    FROM dbo.TBL_TEST_HASH_SAMPLE TS

    INSERT INTO @timer(T_TXT) VALUES ('DRY RUN');

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    INSERT INTO dbo.TBL_TEST_HASH(TH_NSTR,TH_MD5)

    SELECT

    TS.TH_NSTR

    ,TS.TH_MD5

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('INSERT');

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    INSERT INTO dbo.TBL_TEST_HASH_CALC_COL(TH_NSTR)

    SELECT

    TS.TH_NSTR

    FROM dbo.TBL_TEST_HASH_SAMPLE TS;

    INSERT INTO @timer(T_TXT) VALUES ('CALC COL');

    SELECT

    T.T_TXT

    ,DATEDIFF(MICROSECOND,MIN(T.T_TS),MAX(T.T_TS)) AS DURATION

    FROM @timer T

    GROUP BY T.T_TXT

    ORDER BY DURATION;

    Results from B on 2nd Gen i5, 8Gb ram, 3rd Gen SSD, SQL Server 2016 Dev Ed.

    T_TXT DURATION

    ---------- -----------

    INSERT 41015984

    DRY RUN 43208396

    CALC COL 49214666

  • joejuska (7/30/2016)


    No I DO NOT WANT TO HASH ONLY THE FIRST 50 CHARACTERS. I WANT TO HASH THE ENTIRE FILE BUT HAVE THE RESULTANT FIELD NVARCHAR(50) OR CHAR(50) INSTEAD OF SSMS MAKING THE RESULTANT FIELD NVARCHAR(MAX)

    How large is the input (in characters)? Do you understand fully the hashbyte function? The code you posted

    left(hashbytes('MD5',[ASCII File]), 50) does not reflect that.

    😎

  • joejuska (7/30/2016)


    No I DO NOT WANT TO HASH ONLY THE FIRST 50 CHARACTERS. I WANT TO HASH THE ENTIRE FILE BUT HAVE THE RESULTANT FIELD NVARCHAR(50) OR CHAR(50) INSTEAD OF SSMS MAKING THE RESULTANT FIELD NVARCHAR(MAX)

    Instead of shouting at people who are trying to help, you might want to try actually paying attention to what they say.

    The error you see is not caused by the insert you are doing (easy to see if you just run the SELECT without the INSERT). It is caused by the call to the HASHBYTES function itself. Eirikur already told you that in his first contribution to this thread. The input for the HASHBYTES function cannot exceed 4000 characters for unicode or 8000 character for single-byte string data. Easy to veryfy:

    DECLARE @x nvarchar(max);

    SET @x = REPLICATE (CAST('x' AS varchar(max)), 4001);

    SELECT HASHBYTES('MD5', LEFT(@x, 4000)); -- Returns a 16-byte binary

    SELECT HASHBYTES('MD5', @x); -- Error

    The fix? Either follow one of the options Erikur posted (1: Use CLR; 2: Build T-SQL logic to chop the string, concatenate the hases, rinse and repeat) or accept that it cannot be done and hash only the first 4000 characters.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • joejuska (7/30/2016)


    You said "Quick questions, on what do you base that assumption?" I can't directly answer your question because YOUR QUESTION is wrong. IT'S NOT AN ASSUMPTION BUT AN OBSERVATION. The observation over many years of doing it both ways and any difference (I assume there must be some overhead that would make it theoretically slower) are so small as to be undetectable with tens of thousands of inserts per day. Both methods yield an instantaneous result as perceived by the human users brain.

    Of course. That's because you are testing the wrong thing. The performance hit of a computed column does not occur on insert, it occurs when the data is queried.

    When this computed column is never or hardly ever queried, then a computed column will not hurt you. But when it's queried very often, you will take a hit. And maybe even that hit will not be perceived by the human brain - but it will increase the total server load.

    If you really want to benchmark this, then create a test table with three columns: data, computed column (using hashbytes), and persisted column (storing that same hashbytes result). Fill it with at least ten million or so rows. Now run two queries: one for MAX(computed column) and the other one for MAX(persisted column). Run each twice and time only the second (to make sure both queries are timed from cache). Measure the duraction for each, but also keep an eye on the CPU usage wihile each is running.

    You said "Is there a problem calculating the hash on insert?" Yes. There are numerous (unknown number and unknown type) ways users can input the data so we have no control over what is insert so the only way to get a hash is with a calculated field or running a process after an insert.

    In that case I would suggest using a persisted computed column. Defined the same way as a normal computed column with the additional keyword PERSISTED. SQL Server will compute the value when the row is inserted and updated and the store it; queries will use the stored value.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • joejuska (7/30/2016)


    No I DO NOT WANT TO HASH ONLY THE FIRST 50 CHARACTERS. I WANT TO HASH THE ENTIRE FILE BUT HAVE THE RESULTANT FIELD NVARCHAR(50) OR CHAR(50) INSTEAD OF SSMS MAKING THE RESULTANT FIELD NVARCHAR(MAX)

    If you call HASHBYTES with MD5 as the first argument and a valid string as the second the result returned will be exactly 128 bits long, so your problem is obviously nothing to do with the output length, you must be providing an invalid second argument - for example a VARCHAR string whose length has to be given as MAX, because it won't fit in anything shorter.

    Instead of making offensive attacks on people trying to help you (which seems to be what you do most often, judging by this thread) you might find it useful to read the documentation provided by Microsoft.

    Tom

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply