Suggestion/solution for using Hashbytes across entire table

Question

Suggestion/solution for using Hashbytes across entire table

Kim Crosser

SSCommitted

Points: 1763
More actions
October 6, 2015 at 12:25 pm

#148851

One of the nice features of the CHECKSUM function is that it can trivially be applied against an entire table row, such as:
insert into MyCheckTable (RefID, Cksum)
select RefID, CHECKSUM(*)
from MySourceTable;
This will quickly give you a table containing individual checksums across the entire row data in the source table.
However, CHECKSUM isn't all that great for detecting changes to the source table. It will completely miss changes in capitalization (although BINARY_CHECKSUM will fix that), and it also has problems with repetitive strings of more than 8 characters.
I would prefer to use HASHBYTES instead, but does anyone have a good way to apply HASHBYTES across an entire table row without explicitly enumerating, converting, and concatenating the individual table columns? I.e., is there a good (efficient) way to do this as a generic function without resorting to dynamic SQL? I know how to code this to generate dynamic SQL or to hard-code this for a specific table, but does anyone have a better way to create what would effectively be a generic solution for the following?
insert into MyCheckTable (RefID, HashValue)
select RefID, HASHBYTES('sha1', <concatenated varbinary(8000) row data here>)
from MySourceTable;
BTW - none of my tables of interest have any [N]Varchar(max) or varbinary columns, nor do any of them contain anywhere close to 8000 characters in a single row, so a solution for me does not have to worry about those conditions.
Note - the solution should include all the columns in the resulting value passed to the HASHBYTES function, including the primary key columns (the "RefID" in the example above).
Forum Etiquette: How to post Reporting Services problems

[/url]

Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
[/url]

How to Post Performance Problems - by Gail Shaw
[/url]

Eric M Russell SSC Guru Points: 125620 More actions · Answer 1

The HASHBYTES() function doesn't support a multiple column list like CHECKSUM(), however, a while back I found a solution to this by appending multiple columns after re-casting to VarBinary datatype.

For example:

print hashbytes('MD5',cast('John' as varbinary(99)) + cast('Doe' as varbinary(99)));

0x9FD9F63E0D6487537569075DA85A0C7F

You can also add a computed column that consists of a hash of multiple columns like so:

create table Employee

(

EmployeeID int not null,

EmployeeFName varchar(30) not null,

EmployeeLName varchar(30) not null,

EmployeeHash as hashbytes('MD5',cast(EmployeeFName as varbinary(99)) + cast(EmployeeLName as varbinary(99))) persisted

);

insert into Employee

values (1,'John','Wilkerson'), (2,'Susan','Phelps-Jamseson');

select * from Employee;

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Kim Crosser SSCommitted Points: 1763 More actions · Answer 2

As a solution for a single table (where the column list will never change), I can easily write a static expression to convert and concatenate the fields. I have already done that and tested that expression. BTW - to handle Null values, I use:

declare @varbinnull varbinary(1) = convert(varbinary(1), '');

...

+ IsNull(Convert(varbinary(8000),<column_name>),@varbinnull)

This allows concatenating all the fields with or without null values.

For a single-table approach I can easily generate the concatenated list as:

select '+ IsNull(Convert(varbinary(8000),[' + isc.Column_name + ']),@varbinnull)'

from INFORMATION_SCHEMA.COLUMNS isc

where isc.Table_Name = '<mytable>'

order by isc.Ordinal_Position

However, I am looking for a solution that doesn't need to be re-coded if a new column is added to the table (for one example).

Eric M Russell SSC Guru Points: 125620 More actions · Answer 3

Kim Crosser (10/6/2015)
...
However, I am looking for a solution that doesn't need to be re-coded if a new column is added to the table (for one example).

If you make the hash a computed column, then the hash expression (column list) is declared in one location, and the hash expression can be altered within the same DDL script that adds the new column. However, adding a new column will invalidate the hash for all rows on the table.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Kim Crosser SSCommitted Points: 1763 More actions · Answer 4

To make the situation a little more concrete...

I have developed a lot of data conversion T-SQL to convert data from a third-party database, where in their tables there are NO reliable fields to detect changes. I cannot alter those tables.

The data conversion requires about 3 days to run a full conversion (on over 30 years of historical data including transactional histories, all of which must be converted), but I have designed it so that as long as I can detect "changes" to the source data tables, an incremental update a month later can run in 10-15 minutes.

With this approach, the customer can take one snapshot of their current live system, run a full conversion into their new system (which is also live for a lot of other functional components), validate all the converted data (over a few weeks or so), then - when they are confident they are ready - we shut down their old system, copy the tables in their current states, and run the incremental conversion. That means they are only down for an hour or two, instead of shutting down a 24/7 operation for 3+ days. (Plus - should anything be found in the testing after the initial conversion, we can correct that and re-run if necessary.)

Also - if needed, we can do this more than once. We can do the full convert, test, an incremental convert, test, another incremental convert, etc.

So - I am using parallel tables that contain the primary keys of the source tables and a computed checksum, used to detect changes to the source tables since the last conversion run.

The CHECKSUM function is *probably* sufficient to detect virtually all these changes, and has worked so far (that we can detect), but I would like to eliminate the "*probably*" part. 🙂 Plus, in future projects I would like to have another arrow in my quiver of useful techniques...

Eric M Russell SSC Guru Points: 125620 More actions · Answer 5

Kim Crosser (10/6/2015)
...
So - I am using parallel tables that contain the primary keys of the source tables and a computed checksum, used to detect changes to the source tables since the last conversion run.
The CHECKSUM function is *probably* sufficient to detect virtually all these changes, and has worked so far (that we can detect), but I would like to eliminate the "*probably*" part. 🙂 Plus, in future projects I would like to have another arrow in my quiver of useful techniques...

The datatype of CHECKSUM is a 4 byte Integer, which isn't selective enough to be used as a reliable hash across all columns in a 200+ byte wide record, especially if there are unique integer columns within the record. Years ago, I attempted to leverage CHECKSUM(*) as a quick and dirty means of identifying duplicate records within large tables and found it returning too many false positives.

For what you're doing, you'll want to use HASHBYTES() function with at least the MD5 (16 byte hash) algorithm. SHA2 can go up to a 64 byte hash, but of course the point of your using a hash us to conserve space in the reference tables and provide fast lookups. However, there is no option for doing something like HASHBYTES(*) across all columns, so you'll have to include the full list of column across which to hash.

But that shouldn't be a problem, because I would think that the column structure for tables in the legacy database should be static at this point. So you're only coding this ETL step once for each table, right?

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

mister.magoo SSC-Forever Points: 47174 More actions · Answer 6

I don't know whether your setup would suit this, but how about converting to xml , then using hashbytes?

select MBT.refID,

hashbytes('MD5',(select MBT.* from (values(null))foo(bar) for xml auto)) as [Hash]

from MyBaseTable as MBT;

You need the "from (values(null))foo(bar)" clause to use xml auto, it serves no other purpose.

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

Eric M Russell SSC Guru Points: 125620 More actions · Answer 7

mister.magoo (10/6/2015)
I don't know whether your setup would suit this, but how about converting to xml , then using hashbytes?
select MBT.refID,
hashbytes('MD5',(select MBT.* from (values(null))foo(bar) for xml auto)) as [Hash]
from MyBaseTable as MBT;
You need the "from (values(null))foo(bar)" clause to use xml auto, it serves no other purpose.

Bravo, Mr. Magoo!

Leveraging XQuery to serialize all the columns in each row is brilliant. When I test this against a large table, the execution plan shows that 90% of the cost is the table scan and the UDX operator (the part that serializes the XML) is only 2% of the cost. So it appears we can compute an MD5 or SHA hash dynamically across all columns in a row.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

mister.magoo SSC-Forever Points: 47174 More actions · Answer 8

Eric M Russell (10/7/2015)
mister.magoo (10/6/2015)
I don't know whether your setup would suit this, but how about converting to xml , then using hashbytes?
select MBT.refID,
hashbytes('MD5',(select MBT.* from (values(null))foo(bar) for xml auto)) as [Hash]
from MyBaseTable as MBT;
You need the "from (values(null))foo(bar)" clause to use xml auto, it serves no other purpose.
Bravo, Mr. Magoo!
Leveraging XQuery to serialize all the columns in each row is brilliant. When I test this against a large table, the execution plan shows that 90% of the cost is the table scan and the UDX operator (the part that serializes the XML) is only 2% of the cost. So it appears we can compute an MD5 or SHA hash dynamically across all columns in a row.

Yes, I was surprised how well it worked, and just think of all the things we couldn't do without XML support in SQL.

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

Eric M Russell SSC Guru Points: 125620 More actions · Answer 9

mister.magoo (10/7/2015)
Eric M Russell (10/7/2015)
mister.magoo (10/6/2015)
I don't know whether your setup would suit this, but how about converting to xml , then using hashbytes?
select MBT.refID,
hashbytes('MD5',(select MBT.* from (values(null))foo(bar) for xml auto)) as [Hash]
from MyBaseTable as MBT;
You need the "from (values(null))foo(bar)" clause to use xml auto, it serves no other purpose.
Bravo, Mr. Magoo!
Leveraging XQuery to serialize all the columns in each row is brilliant. When I test this against a large table, the execution plan shows that 90% of the cost is the table scan and the UDX operator (the part that serializes the XML) is only 2% of the cost. So it appears we can compute an MD5 or SHA hash dynamically across all columns in a row.
Yes, I was surprised how well it worked, and just think of all the things we couldn't do without XML support in SQL.

I just tested this against a 15 million row table having 115 columns and found no hash collisions, so it appears solid.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

mister.magoo SSC-Forever Points: 47174 More actions · Answer 10

Eric M Russell (10/7/2015)
mister.magoo (10/7/2015)
Eric M Russell (10/7/2015)
mister.magoo (10/6/2015)
I don't know whether your setup would suit this, but how about converting to xml , then using hashbytes?
select MBT.refID,
hashbytes('MD5',(select MBT.* from (values(null))foo(bar) for xml auto)) as [Hash]
from MyBaseTable as MBT;
You need the "from (values(null))foo(bar)" clause to use xml auto, it serves no other purpose.
Bravo, Mr. Magoo!
Leveraging XQuery to serialize all the columns in each row is brilliant. When I test this against a large table, the execution plan shows that 90% of the cost is the table scan and the UDX operator (the part that serializes the XML) is only 2% of the cost. So it appears we can compute an MD5 or SHA hash dynamically across all columns in a row.
Yes, I was surprised how well it worked, and just think of all the things we couldn't do without XML support in SQL.
I just tested this against a 15 million row table having 115 columns and found no hash collisions, so it appears solid.

Good to know, thanks for testing that.

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

Kim Crosser SSCommitted Points: 1763 More actions · Answer 11

mister.magoo (10/6/2015)
I don't know whether your setup would suit this, but how about converting to xml , then using hashbytes?
select MBT.refID,
hashbytes('MD5',(select MBT.* from (values(null))foo(bar) for xml auto)) as [Hash]
from MyBaseTable as MBT;
You need the "from (values(null))foo(bar)" clause to use xml auto, it serves no other purpose.

Brilliant! This is exactly why I posted here. I was hoping someone on the forum would come up with a new approach.

I would never have thought of converting to XML as part of the method.

Thank you! I have marked your post as the solution.

mister.magoo SSC-Forever Points: 47174 More actions · Answer 12

Kim Crosser (10/7/2015)
Brilliant! This is exactly why I posted here. I was hoping someone on the forum would come up with a new approach.
I would never have thought of converting to XML as part of the method.
Thank you! I have marked your post as the solution.

Most welcome, I enjoyed the challenge.

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

Kim Crosser SSCommitted Points: 1763 More actions · Answer 13

For anyone interested, I ran several tests on one of the actual data tables involved, containing 1,526,652 rows of data with 49 columns, containing an average of approximately 470 bytes of data per row. (Definitely NOT my design - this is a legacy third-party database from which I am converting data.)

The hard-coded express conversion, isnull, and concatenation ran in ~44 seconds, while Mister Magoo's generic solution completed in ~66 seconds (I did several runs, all were within 1-2 seconds of these numbers).

Given the amount of additional text and overhead generated by the conversion to XML (all the column name tags) that is very good performance. 🙂

I am really impressed with this approach for a few reasons:

1. It is generic and can be applied to ANY table without having to generate hard-coded expressions (tables with [n]varchar(max) or varbinary(max) columns may need to be truncated at 8000 characters)

2. It avoids the possibility of accidentally "missing" one of the columns in the hard-coded expressions

3. If a column is added to a table later, this will result in a different hash, where the hard-coded expressions would miss new columns

4. It solves a minor (unlikely) data pattern that the hard-coded approach could miss - Null columns in the hard-coded version were discarded, so any sequence of ('a'), (null,'a'), and ('a',null) would yield the same hash value (easily solvable by adding column delimiters to the hard-coded expression, but now not needed)

This is a very useful code pattern I expect to use in more than just this project. 😎

mister.magoo SSC-Forever Points: 47174 More actions · Answer 14

Further "interest" comes from comparing XML to JSON in SQL 2016.

context avg_duration (ms) avg_cpu_time (ms)

xml 4147.091600 3987.400000

json 4406.513600 4334.400000

This was a test of 1 million rows from a narrow table (3 small columns), run 5 times for each query, with

checkpoint;

dbcc dropcleanbuffers;

dbcc freeproccache;

..between each test (i.e. twice)

That appears to show roughly 6% win for XML over JSON in this test for duration and 8% for cpu time.

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);