SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Suggestion/solution for using Hashbytes across entire table


Suggestion/solution for using Hashbytes across entire table

Author
Message
Kim Crosser
Kim Crosser
SSCommitted
SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)

Group: General Forum Members
Points: 1559 Visits: 275
One of the nice features of the CHECKSUM function is that it can trivially be applied against an entire table row, such as:
insert into MyCheckTable (RefID, Cksum)
select RefID, CHECKSUM(*)
from MySourceTable;


This will quickly give you a table containing individual checksums across the entire row data in the source table.
However, CHECKSUM isn't all that great for detecting changes to the source table. It will completely miss changes in capitalization (although BINARY_CHECKSUM will fix that), and it also has problems with repetitive strings of more than 8 characters.

I would prefer to use HASHBYTES instead, but does anyone have a good way to apply HASHBYTES across an entire table row without explicitly enumerating, converting, and concatenating the individual table columns? I.e., is there a good (efficient) way to do this as a generic function without resorting to dynamic SQL? I know how to code this to generate dynamic SQL or to hard-code this for a specific table, but does anyone have a better way to create what would effectively be a generic solution for the following?

insert into MyCheckTable (RefID, HashValue)
select RefID, HASHBYTES('sha1', <concatenated varbinary(8000) row data here>Wink
from MySourceTable;



BTW - none of my tables of interest have any [N]Varchar(max) or varbinary columns, nor do any of them contain anywhere close to 8000 characters in a single row, so a solution for me does not have to worry about those conditions.

Note - the solution should include all the columns in the resulting value passed to the HASHBYTES function, including the primary key columns (the "RefID" in the example above).
Eric M Russell
Eric M Russell
SSC Guru
SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)

Group: General Forum Members
Points: 122595 Visits: 15608
The HASHBYTES() function doesn't support a multiple column list like CHECKSUM(), however, a while back I found a solution to this by appending multiple columns after re-casting to VarBinary datatype.

For example:

print hashbytes('MD5',cast('John' as varbinary(99)) + cast('Doe' as varbinary(99)));

0x9FD9F63E0D6487537569075DA85A0C7F



You can also add a computed column that consists of a hash of multiple columns like so:


create table Employee
(
EmployeeID int not null,
EmployeeFName varchar(30) not null,
EmployeeLName varchar(30) not null,
EmployeeHash as hashbytes('MD5',cast(EmployeeFName as varbinary(99)) + cast(EmployeeLName as varbinary(99))) persisted
);

insert into Employee
values (1,'John','Wilkerson'), (2,'Susan','Phelps-Jamseson');

select * from Employee;




"The universe is complicated and for the most part beyond your control, but your life is only as complicated as you choose it to be."
Kim Crosser
Kim Crosser
SSCommitted
SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)

Group: General Forum Members
Points: 1559 Visits: 275
As a solution for a single table (where the column list will never change), I can easily write a static expression to convert and concatenate the fields. I have already done that and tested that expression. BTW - to handle Null values, I use:
declare @varbinnull varbinary(1) = convert(varbinary(1), '');
...
+ IsNull(Convert(varbinary(8000),<column_name>Wink,@varbinnull)


This allows concatenating all the fields with or without null values.

For a single-table approach I can easily generate the concatenated list as:
select '+ IsNull(Convert(varbinary(8000),[' + isc.Column_name + ']),@varbinnull)'
from INFORMATION_SCHEMA.COLUMNS isc
where isc.Table_Name = '<mytable>'
order by isc.Ordinal_Position


However, I am looking for a solution that doesn't need to be re-coded if a new column is added to the table (for one example).
Eric M Russell
Eric M Russell
SSC Guru
SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)

Group: General Forum Members
Points: 122595 Visits: 15608
Kim Crosser (10/6/2015)

...
However, I am looking for a solution that doesn't need to be re-coded if a new column is added to the table (for one example).


If you make the hash a computed column, then the hash expression (column list) is declared in one location, and the hash expression can be altered within the same DDL script that adds the new column. However, adding a new column will invalidate the hash for all rows on the table.


"The universe is complicated and for the most part beyond your control, but your life is only as complicated as you choose it to be."
Kim Crosser
Kim Crosser
SSCommitted
SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)

Group: General Forum Members
Points: 1559 Visits: 275
To make the situation a little more concrete...

I have developed a lot of data conversion T-SQL to convert data from a third-party database, where in their tables there are NO reliable fields to detect changes. I cannot alter those tables.

The data conversion requires about 3 days to run a full conversion (on over 30 years of historical data including transactional histories, all of which must be converted), but I have designed it so that as long as I can detect "changes" to the source data tables, an incremental update a month later can run in 10-15 minutes.

With this approach, the customer can take one snapshot of their current live system, run a full conversion into their new system (which is also live for a lot of other functional components), validate all the converted data (over a few weeks or so), then - when they are confident they are ready - we shut down their old system, copy the tables in their current states, and run the incremental conversion. That means they are only down for an hour or two, instead of shutting down a 24/7 operation for 3+ days. (Plus - should anything be found in the testing after the initial conversion, we can correct that and re-run if necessary.)

Also - if needed, we can do this more than once. We can do the full convert, test, an incremental convert, test, another incremental convert, etc.

So - I am using parallel tables that contain the primary keys of the source tables and a computed checksum, used to detect changes to the source tables since the last conversion run.

The CHECKSUM function is *probably* sufficient to detect virtually all these changes, and has worked so far (that we can detect), but I would like to eliminate the "*probably*" part. :-) Plus, in future projects I would like to have another arrow in my quiver of useful techniques...
Eric M Russell
Eric M Russell
SSC Guru
SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)

Group: General Forum Members
Points: 122595 Visits: 15608
Kim Crosser (10/6/2015)

...

So - I am using parallel tables that contain the primary keys of the source tables and a computed checksum, used to detect changes to the source tables since the last conversion run.

The CHECKSUM function is *probably* sufficient to detect virtually all these changes, and has worked so far (that we can detect), but I would like to eliminate the "*probably*" part. :-) Plus, in future projects I would like to have another arrow in my quiver of useful techniques...


The datatype of CHECKSUM is a 4 byte Integer, which isn't selective enough to be used as a reliable hash across all columns in a 200+ byte wide record, especially if there are unique integer columns within the record. Years ago, I attempted to leverage CHECKSUM(*) as a quick and dirty means of identifying duplicate records within large tables and found it returning too many false positives.

For what you're doing, you'll want to use HASHBYTES() function with at least the MD5 (16 byte hash) algorithm. SHA2 can go up to a 64 byte hash, but of course the point of your using a hash us to conserve space in the reference tables and provide fast lookups. However, there is no option for doing something like HASHBYTES(*) across all columns, so you'll have to include the full list of column across which to hash.

But that shouldn't be a problem, because I would think that the column structure for tables in the legacy database should be static at this point. So you're only coding this ETL step once for each table, right?


"The universe is complicated and for the most part beyond your control, but your life is only as complicated as you choose it to be."
mister.magoo
mister.magoo
SSC-Forever
SSC-Forever (46K reputation)SSC-Forever (46K reputation)SSC-Forever (46K reputation)SSC-Forever (46K reputation)SSC-Forever (46K reputation)SSC-Forever (46K reputation)SSC-Forever (46K reputation)SSC-Forever (46K reputation)

Group: General Forum Members
Points: 46414 Visits: 7956
I don't know whether your setup would suit this, but how about converting to xml , then using hashbytes?

select MBT.refID, 
hashbytes('MD5',(select MBT.* from (values(null))foo(bar) for xml auto)) as [Hash]
from MyBaseTable as MBT;



You need the "from (values(null))foo(bar)" clause to use xml auto, it serves no other purpose.

MM


select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);




  • Forum Etiquette: How to post Reporting Services problems
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • How to Post Performance Problems - by Gail Shaw

  • Eric M Russell
    Eric M Russell
    SSC Guru
    SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)

    Group: General Forum Members
    Points: 122595 Visits: 15608
    mister.magoo (10/6/2015)
    I don't know whether your setup would suit this, but how about converting to xml , then using hashbytes?

    select MBT.refID, 
    hashbytes('MD5',(select MBT.* from (values(null))foo(bar) for xml auto)) as [Hash]
    from MyBaseTable as MBT;



    You need the "from (values(null))foo(bar)" clause to use xml auto, it serves no other purpose.


    Bravo, Mr. Magoo!
    Leveraging XQuery to serialize all the columns in each row is brilliant. When I test this against a large table, the execution plan shows that 90% of the cost is the table scan and the UDX operator (the part that serializes the XML) is only 2% of the cost. So it appears we can compute an MD5 or SHA hash dynamically across all columns in a row.


    "The universe is complicated and for the most part beyond your control, but your life is only as complicated as you choose it to be."
    mister.magoo
    mister.magoo
    SSC-Forever
    SSC-Forever (46K reputation)SSC-Forever (46K reputation)SSC-Forever (46K reputation)SSC-Forever (46K reputation)SSC-Forever (46K reputation)SSC-Forever (46K reputation)SSC-Forever (46K reputation)SSC-Forever (46K reputation)

    Group: General Forum Members
    Points: 46414 Visits: 7956
    Eric M Russell (10/7/2015)
    mister.magoo (10/6/2015)
    I don't know whether your setup would suit this, but how about converting to xml , then using hashbytes?

    select MBT.refID, 
    hashbytes('MD5',(select MBT.* from (values(null))foo(bar) for xml auto)) as [Hash]
    from MyBaseTable as MBT;



    You need the "from (values(null))foo(bar)" clause to use xml auto, it serves no other purpose.


    Bravo, Mr. Magoo!
    Leveraging XQuery to serialize all the columns in each row is brilliant. When I test this against a large table, the execution plan shows that 90% of the cost is the table scan and the UDX operator (the part that serializes the XML) is only 2% of the cost. So it appears we can compute an MD5 or SHA hash dynamically across all columns in a row.


    Yes, I was surprised how well it worked, and just think of all the things we couldn't do without XML support in SQL.

    MM


    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);




  • Forum Etiquette: How to post Reporting Services problems
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • How to Post Performance Problems - by Gail Shaw

  • Eric M Russell
    Eric M Russell
    SSC Guru
    SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)SSC Guru (122K reputation)

    Group: General Forum Members
    Points: 122595 Visits: 15608
    mister.magoo (10/7/2015)
    Eric M Russell (10/7/2015)
    mister.magoo (10/6/2015)
    I don't know whether your setup would suit this, but how about converting to xml , then using hashbytes?

    select MBT.refID, 
    hashbytes('MD5',(select MBT.* from (values(null))foo(bar) for xml auto)) as [Hash]
    from MyBaseTable as MBT;



    You need the "from (values(null))foo(bar)" clause to use xml auto, it serves no other purpose.


    Bravo, Mr. Magoo!
    Leveraging XQuery to serialize all the columns in each row is brilliant. When I test this against a large table, the execution plan shows that 90% of the cost is the table scan and the UDX operator (the part that serializes the XML) is only 2% of the cost. So it appears we can compute an MD5 or SHA hash dynamically across all columns in a row.


    Yes, I was surprised how well it worked, and just think of all the things we couldn't do without XML support in SQL.


    I just tested this against a 15 million row table having 115 columns and found no hash collisions, so it appears solid.


    "The universe is complicated and for the most part beyond your control, but your life is only as complicated as you choose it to be."
    Go


    Permissions

    You can't post new topics.
    You can't post topic replies.
    You can't post new polls.
    You can't post replies to polls.
    You can't edit your own topics.
    You can't delete your own topics.
    You can't edit other topics.
    You can't delete other topics.
    You can't edit your own posts.
    You can't edit other posts.
    You can't delete your own posts.
    You can't delete other posts.
    You can't post events.
    You can't edit your own events.
    You can't edit other events.
    You can't delete your own events.
    You can't delete other events.
    You can't send private messages.
    You can't send emails.
    You can read topics.
    You can't vote in polls.
    You can't upload attachments.
    You can download attachments.
    You can't post HTML code.
    You can't edit HTML code.
    You can't post IFCode.
    You can't post JavaScript.
    You can post emoticons.
    You can't post or upload images.

    Select a forum









































































































































































    SQLServerCentral


    Search