﻿<?xml version='1.0' encoding='UTF-8'?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/"><channel><title>SQLServerCentral / Article Discussions / Discuss content posted by magarity kerns / Article Discussions by Author  / HASHBYTES can help quickly load a Data Warehouse / Latest Posts</title><generator>InstantForum.NET v2.9.0</generator><description>SQLServerCentral</description><link>http://www.sqlservercentral.com/Forums/</link><webMaster>notifications@sqlservercentral.com</webMaster><lastBuildDate>Thu, 20 Jun 2013 04:39:41 GMT</lastBuildDate><ttl>20</ttl><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>!!False Alarm!!I feel like such an idiot. I'm almost embarrassed to admit this, but out of courtesy, I thought I would give an update on this in case anyone else makes the oversight I did.You guessed it. The ANSI_PADDING was turned off!! Doh!! This was a newly created database. Rookie mistake. Luckily, worse case, our data warehouse will have a few thousand phantom type2 updates which won't cause any noise, save some space being used up for nothing.Thanks to all for taking the time to read this. I'm gonna go dig a whole and stick my head in it for a while!</description><pubDate>Mon, 10 Oct 2011 12:47:23 GMT</pubDate><dc:creator>Langston Montgomery</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>[quote][b]L' Eomot Inversé (10/9/2011)[/b][hr][quote][b]Langston Montgomery (10/8/2011)[/b][hr]Has anyone come across cases where SHA1 has produced output longer than 20 bytes, because I think I have. We are using HASHBYTES to detect changes in our data warehouse, and we were getting type 2 changes, even though all the columns values were exactly the same between the most current and previous rows; even the hashbyte value looked the same!But, upon looking closer, the HASHBYTE function was returning this: 0xEE0816119692A1C4DCC2045EA1E37CD47EFF49[b]00[/b]But, the value in our varbinary(20) column in the table was storing this: 0xEE0816119692A1C4DCC2045EA1E37CD47EFF49[/quote]I don't think "longer that 20 bytes" is relevant, as the number ending 00 is 20 bytes (the one without 00 is only 19).[/quote]You're right, L'Eomot. I wrote that wrong. What I was trying to say was that HASHBYTE was returning a 20 byte length value, but when SQL wrote it somehow it got truncated to 19.[quote][quote]Notice the two zeros at the end of the function's output. For some reason, when SQL writes the output from the function to the table it truncates those last two "characters" at the end!We have no idea why it does that, but when we increase the varbinary column in the table from 20 to 40, SQL doesn't truncate the output anymore, and all is well.Anyway, if anyone has any insight on why SQL seems to be truncating the output from the function as it writes to the varbinary(20) column, please let me know, I'm curious![/quote]I tried a few things to see if I could reproduce this behaviour, using hashbytes to generate values for local variables, for columns in table variable, for columns in temp tables, and for columns in premanent tables, but got no truncation ever when using varchar(20).  Maybe I'm on a different version - I'm using SQL 2008 R2.[/quote]We're using SQL 2008 R2 and SSIS 2008. We are also using the OLE DST component with fast load, nolock, and don't check constraints to load the type2 change to the table. Wanna read something funny?... I reduced the varbinary column from 40 back to 20 expecting SQL to truncate it again... and it doesn't! It's almost like something had to be reset. I thought it was ANSI_PADDING or something, SSIS wouldn't turn that off... I don't think.Anyway, thanks for your response, L'Eomot. I don't like when things work without knowing what fixed it, but we're good for now!</description><pubDate>Sun, 09 Oct 2011 14:44:32 GMT</pubDate><dc:creator>Langston Montgomery</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>[quote][b]Langston Montgomery (10/8/2011)[/b][hr]Has anyone come across cases where SHA1 has produced output longer than 20 bytes, because I think I have. We are using HASHBYTES to detect changes in our data warehouse, and we were getting type 2 changes, even though all the columns values were exactly the same between the most current and previous rows; even the hashbyte value looked the same!But, upon looking closer, the HASHBYTE function was returning this: 0xEE0816119692A1C4DCC2045EA1E37CD47EFF49[b]00[/b]But, the value in our varbinary(20) column in the table was storing this: 0xEE0816119692A1C4DCC2045EA1E37CD47EFF49[/quote]I don't think "longer that 20 bytes" is relevant, as the number ending 00 is 20 bytes (the one without 00 is only 19).[quote]Notice the two zeros at the end of the function's output. For some reason, when SQL writes the output from the function to the table it truncates those last two "characters" at the end!We have no idea why it does that, but when we increase the varbinary column in the table from 20 to 40, SQL doesn't truncate the output anymore, and all is well.Anyway, if anyone has any insight on why SQL seems to be truncating the output from the function as it writes to the varbinary(20) column, please let me know, I'm curious![/quote]I tried a few things to see if I could reproduce this behaviour, using hashbytes to generate values for local variables, for columns in table variable, for columns in temp tables, and for columns in premanent tables, but got no truncation ever when using varchar(20).  Maybe I'm on a different version - I'm using SQL 2008 R2.</description><pubDate>Sun, 09 Oct 2011 09:38:14 GMT</pubDate><dc:creator>L' Eomot Inversé</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>Has anyone come across cases where SHA1 has produced output longer than 20 bytes, because I think I have. We are using HASHBYTES to detect changes in our data warehouse, and we were getting type 2 changes, even though all the columns values were exactly the same between the most current and previous rows; even the hashbyte value looked the same!But, upon looking closer, the HASHBYTE function was returning this: 0xEE0816119692A1C4DCC2045EA1E37CD47EFF49[b]00[/b]But, the value in our varbinary(20) column in the table was storing this: 0xEE0816119692A1C4DCC2045EA1E37CD47EFF49Notice the two zeros at the end of the function's output. For some reason, when SQL writes the output from the function to the table it truncates those last two "characters" at the end!We have no idea why it does that, but when we increase the varbinary column in the table from 20 to 40, SQL doesn't truncate the output anymore, and all is well.Anyway, if anyone has any insight on why SQL seems to be truncating the output from the function as it writes to the varbinary(20) column, please let me know, I'm curious![quote][b]L' Eomot Inversé (4/26/2011)[/b][hr][quote][b]wta306 (4/25/2011)[/b][hr]How do you determine what size to make your VARBINARY field to hold the results of the HASHBYTES function?Is it simply a matter of setting it to be equal to the largest HashByte function input value?[code="sql"]DECLARE @vHashResults AS VARBINARY(40)DECLARE @vHashInput01 AS CHAR(20)DECLARE @vHashInput02 AS CHAR(20)SET @vHashInput01 = 'thisisa20bytechar123'SET @vHashInput02 = 'thisisa21bytechar1234'SET @vHashResults = HashBytes('SHA1', @vHashInput01 + @vHashInput02)[/code][/quote]I would be somewhat horrified if HashBytes('SHA1', &amp;lt;anything&amp;gt;) delived a result with any length other than 160 bits (20 bytes)!  That's what MS thinks it delivers too, according to BoL.[/quote]</description><pubDate>Sat, 08 Oct 2011 21:03:35 GMT</pubDate><dc:creator>Langston Montgomery</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>[quote][b]Cyclone (4/27/2011)[/b][hr]Hi,We use SHA1 and store the result as Varbinary(MAX).Do you see any problems with this ?Thanks[/quote]Yes, there is a problem: a varbinary(max) column can't be a key column in an index, so if you want to use your hash in an index you shouldn't use varbinary(max).  That's the only problem I can think of.</description><pubDate>Thu, 28 Apr 2011 05:21:14 GMT</pubDate><dc:creator>L' Eomot Inversé</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>Hi,We use SHA1 and store the result as Varbinary(MAX).Do you see any problems with this ?Thanks</description><pubDate>Wed, 27 Apr 2011 23:04:15 GMT</pubDate><dc:creator>Cyclone</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>[quote][b]Tom.Thomson (4/27/2011)[/b][hr][quote][b]wta306 (4/27/2011)[/b][hr]Unless I'm mistaken, MD5 is only 128 bits (16bytes)...and yet the author of the original article stores the resulting values in a VARBINARY(100)?  What am I missing?[/quote]Yes, MD5 is 16 bytes.  I suspect VARBINARY(100) was used because it will cater for using SHA1 (20 bytes) and, when SHA2 family of hash functions (SHA256 - 32 bytes, SHA512 - 64 bytes, SHA224 -28 bytes, and SHA384 - 48 bytes) are supported VARBINARY(100) will cater for all of them; and maybe it will even cater for SHA3.  Presumably we'll know when NIST announces the winner of the SHA3 design contest next year; I'm pretty sure I've seen something about lengths for SHA3, so maybe we could know already, but I don't recall what so maybe not; I suspect VARBINARY(100) may turn out to be too short to cover all variants of SHA3 (but it will be years before SQL Server supports SHA3). To be certain why VARBINARY(100) was used in the article we'll have to hope for a response from the author; but I'm reasonably happy with my guess that it was intended to cope with known future growth in hash sizes (to 512 bits) and have a little bit to spare.  After all, it has to be at least VARBINARY(20) not BINARY(16) to cope with all hashes supported by SQL Server 2008 unless one uses different data type depending on which hash is being used (which would make changing from one hash to another twice as hard) and there's no extra overhead in using VARBINART(64) or VARBINARY(100) so that it can cover SHA2 instead of restricting it to MDn and SHA1 by using VARBINARY(20).  If I'd been writing something about hashing and wanted to cover future possibilities I might even have chosen VARBINARY(256) in the hope of catering for SHA4 when/if that happens![/quote]wow, ok - that makes a ton of sense.  thank you!!</description><pubDate>Wed, 27 Apr 2011 08:45:02 GMT</pubDate><dc:creator>byoBI</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>[quote][b]wta306 (4/27/2011)[/b][hr]Unless I'm mistaken, MD5 is only 128 bits (16bytes)...and yet the author of the original article stores the resulting values in a VARBINARY(100)?  What am I missing?[/quote]Yes, MD5 is 16 bytes.  I suspect VARBINARY(100) was used because it will cater for using SHA1 (20 bytes) and, when SHA2 family of hash functions (SHA256 - 32 bytes, SHA512 - 64 bytes, SHA224 -28 bytes, and SHA384 - 48 bytes) are supported VARBINARY(100) will cater for all of them; and maybe it will even cater for SHA3.  Presumably we'll know when NIST announces the winner of the SHA3 design contest next year; I'm pretty sure I've seen something about lengths for SHA3, so maybe we could know already, but I don't recall what so maybe not; I suspect VARBINARY(100) may turn out to be too short to cover all variants of SHA3 (but it will be years before SQL Server supports SHA3). To be certain why VARBINARY(100) was used in the article we'll have to hope for a response from the author; but I'm reasonably happy with my guess that it was intended to cope with known future growth in hash sizes (to 512 bits) and have a little bit to spare.  After all, it has to be at least VARBINARY(20) not BINARY(16) to cope with all hashes supported by SQL Server 2008 unless one uses different data type depending on which hash is being used (which would make changing from one hash to another twice as hard) and there's no extra overhead in using VARBINART(64) or VARBINARY(100) so that it can cover SHA2 instead of restricting it to MDn and SHA1 by using VARBINARY(20).  If I'd been writing something about hashing and wanted to cover future possibilities I might even have chosen VARBINARY(256) in the hope of catering for SHA4 when/if that happens!</description><pubDate>Wed, 27 Apr 2011 07:59:01 GMT</pubDate><dc:creator>L' Eomot Inversé</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>[quote][b]Tom.Thomson (4/26/2011)[/b][hr][quote][b]wta306 (4/25/2011)[/b][hr]How do you determine what size to make your VARBINARY field to hold the results of the HASHBYTES function?Is it simply a matter of setting it to be equal to the largest HashByte function input value?[code="sql"]DECLARE @vHashResults AS VARBINARY(40)DECLARE @vHashInput01 AS CHAR(20)DECLARE @vHashInput02 AS CHAR(20)SET @vHashInput01 = 'thisisa20bytechar123'SET @vHashInput02 = 'thisisa21bytechar1234'SET @vHashResults = HashBytes('SHA1', @vHashInput01 + @vHashInput02)[/code][/quote]I would be somewhat horrified if HashBytes('SHA1', &amp;lt;anything&amp;gt;) delived a result with any length other than 160 bits (20 bytes)!  That's what MS thinks it delivers too, according to BoL.[/quote]Tom - thank you for the response.  Unless I'm mistaken, MD5 is only 128 bits (16bytes)...and yet the author of the original article stores the resulting values in a VARBINARY(100)?  What am I missing?</description><pubDate>Wed, 27 Apr 2011 06:20:50 GMT</pubDate><dc:creator>byoBI</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>[quote][b]wta306 (4/25/2011)[/b][hr]How do you determine what size to make your VARBINARY field to hold the results of the HASHBYTES function?Is it simply a matter of setting it to be equal to the largest HashByte function input value?[code="sql"]DECLARE @vHashResults AS VARBINARY(40)DECLARE @vHashInput01 AS CHAR(20)DECLARE @vHashInput02 AS CHAR(20)SET @vHashInput01 = 'thisisa20bytechar123'SET @vHashInput02 = 'thisisa21bytechar1234'SET @vHashResults = HashBytes('SHA1', @vHashInput01 + @vHashInput02)[/code][/quote]I would be somewhat horrified if HashBytes('SHA1', &amp;lt;anything&amp;gt;) delived a result with any length other than 160 bits (20 bytes)!  That's what MS thinks it delivers too, according to BoL.</description><pubDate>Tue, 26 Apr 2011 12:08:04 GMT</pubDate><dc:creator>L' Eomot Inversé</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>How do you determine what size to make your VARBINARY field to hold the results of the HASHBYTES function?Is it simply a matter of setting it to be equal to the largest HashByte function input value?[code="sql"]DECLARE @vHashResults AS VARBINARY(40)DECLARE @vHashInput01 AS CHAR(20)DECLARE @vHashInput02 AS CHAR(20)SET @vHashInput01 = 'thisisa20bytechar123'SET @vHashInput02 = 'thisisa21bytechar1234'SET @vHashResults = HashBytes('SHA1', @vHashInput01 + @vHashInput02)[/code]</description><pubDate>Mon, 25 Apr 2011 15:52:26 GMT</pubDate><dc:creator>byoBI</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>Hey guys,Check out this website, will probably solve your issues with Hash Bytes within SSIS:http://ssismhash.codeplex.com/God Bless.</description><pubDate>Wed, 23 Feb 2011 06:08:47 GMT</pubDate><dc:creator>mackie</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>Hi Mike,I am trying to do what you said concatenate all values to varbinary and then hashwhen you say concatenate you mean add the binary value of each column as follows:DECLARE @ColList Varchar(max), @SQLStatment VARCHAR(MAX)SET @ColList = ''select @ColList = @ColList + 'convert(varbinary(max), isnull(' +  Name + ',0) )+ ' from syscolumns where id = object_id('aggregate') SELECT @SQLStatment = 'SELECT ' + Substring(@ColList,1,len(@ColList)-1) + ' From aggregate'select @SQLStatmentexec(@SQLStatment)</description><pubDate>Fri, 03 Dec 2010 07:52:11 GMT</pubDate><dc:creator>miranwara</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>[quote]  I think you've missed the key architectural point - the 3 column xref tables dramatically reduce disk IO which is much more of a bottleneck than the cpu doing a hash function.  The thing is to compare the width of the target table to the width of the xref table.Many data warehouse tables can be quite wide.  Reading a 50 column wide target table just to get the hash value is many times slower than reading the 3 column wide xref table.[/quote]I now understand your point in using xref table, thanks for the clarification.</description><pubDate>Thu, 08 Jul 2010 04:12:57 GMT</pubDate><dc:creator>nishant30000</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>[quote]Like what should be a minimum number of rows or what is minimum number of columns after which time consumed in calculating Hash will outdo the time aved in comparison.[/quote]Calculating the hash is extremely lightweight as data goes through an ETL tool.  Lightweight enough that it's a foregone conclusion and you can see from other comments that a lot of people are already using one hash function or another.  I think you've missed the key architectural point - the 3 column xref tables dramatically reduce disk IO which is much more of a bottleneck than the cpu doing a hash function.  The thing is to compare the width of the target table to the width of the xref table.Many data warehouse tables can be quite wide.  Reading a 50 column wide target table just to get the hash value is many times slower than reading the 3 column wide xref table.  With SQL Server this can be mitigated with an 'included column' on an index for the target table but some database engines don't have this.  Earlier versions of SS don't have it.  Also consider development time; it is faster to develop a standardized routine rather than if one mixes and matches techniques.  Finally, data warehouses have a strong tendency to change so when a target table that's only 10 columns wide suddenly has a requirement for an additional 25 columns, you would have to rewrite its ETL if you picked the all column compare method because it was only 10 to start.</description><pubDate>Wed, 07 Jul 2010 09:50:27 GMT</pubDate><dc:creator>magarity kerns</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>[quote][b]magarity kerns (4/13/2010)[/b][hr]Comments posted to this topic are about the item [B]&amp;lt;A HREF="/articles/Data+Warehouse/69679/"&amp;gt;HASHBYTES can help quickly load a Data Warehouse&amp;lt;/A&amp;gt;[/B][/quote]Hi great Article thanks! Do you have any time comaprision between the 'method mentioned in this article' Vs 'An all column comparison' method? Like what should be a minimum number of rows or what is minimum number of columns after which time consumed in calculating Hash will outdo the time aved in comparison.</description><pubDate>Wed, 07 Jul 2010 08:11:42 GMT</pubDate><dc:creator>nishant30000</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>Hi ...The reason for converting to upper case is that often users make changes to the data by simply adding the same details in all capital letters (for various reasons). We do not want to pick this up as change in the data, if all that changes is the case. Cheers</description><pubDate>Mon, 17 May 2010 22:43:55 GMT</pubDate><dc:creator>Cyclone</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>[quote][b]Cyclone (5/17/2010)[/b][hr]Do you see any problems with this approach?Also, I checked the length (using Len()) of the Hash Value and the max length I have so far is 20. I have the column that holds the Hash Value as datatype varbinary(max). Should I change the size of this column?[/quote]Two problems: Why are you converting to upper case and you don't have a delimiter.  'AB' and 'A' will give the same as 'A' and 'BA' with your method.  This has been hashed out (pun intended!) in prior conversations; please read the previous comments.  It doesn't matter if the delimiter is in your data or not, just that there is something. You can just use a space.  As long as something is there to break it up.Converting to upper case is a little strange - why do it?  Check with your SME or data steward to confirm your target system should ignore changes in case.  No one here can give the definite answer, although I advise against it unless you have a requirement to do so.The output is always 20 bytes for sha1 (and 16 for md5) but I don't know if the system needs extra processing or space to handle (max) instead of just (20).  It probably doesn't matter.The point of my article was the xref tables to hold checksums and keys; did anyone notice that at all in the debate over whose hash to use???</description><pubDate>Mon, 17 May 2010 11:18:37 GMT</pubDate><dc:creator>magarity kerns</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>Assume you have vendorid and vendorname:AB   AbnerA     BabnerBecause you're converting to varchar, these both collapse toABABNER, so they look the same, while they clearly are not.You have to convert to a fixed width, or put some delimiter in there (and be very certain that delimiter doesn't actually show up in your data).</description><pubDate>Mon, 17 May 2010 08:25:22 GMT</pubDate><dc:creator>Nadrek</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>Hi ...I am currently using a similiar method in order to compare data. The Hash value is calculated as follows:	HashBytes('SHA1',					UPPER(Isnull(convert(varchar(max), Branch), ''))			+ UPPER(Isnull(convert(varchar(max), VendorID), ''))			+ UPPER(Isnull(convert(varchar(max), VendorName), ''))			+ UPPER(Isnull(convert(varchar(max), VendorTradingName), ''))			+ UPPER(Isnull(convert(varchar(max), RegisteredCompany), ''))			+ UPPER(Isnull(convert(varchar(max), TypeOfEntity), ''))			+ UPPER(Isnull(convert(varchar(max), RegistrationNumber), ''))			+ ........ &amp;lt;other columns&amp;gt;			)I convert each field to varchar datatype and concatenate them together in order generate a value.Do you see any problems with this approach?Also, I checked the length (using Len()) of the Hash Value and the max length I have so far is 20. I have the column that holds the Hash Value as datatype varbinary(max). Should I change the size of this column?Thanks in advance.</description><pubDate>Mon, 17 May 2010 02:09:08 GMT</pubDate><dc:creator>Cyclone</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>[quote][b]Rick Todd (5/11/2010)[/b][hr]ziangij, I would definitely recommend reading through some of the comments, but here's my summary:If you have other logic to prevent/handle collisions, checksum might be fine. If you want to rely on the checksum/hashbytes method to do all of your new/update/existing logic, you definitely need hashbytes over checksum.[/quote]apologies, i should have read all the comments which I have done now.many options have been suggested with pros and cons of each... :-)</description><pubDate>Thu, 13 May 2010 00:49:03 GMT</pubDate><dc:creator>ziangij</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>ziangij, I would definitely recommend reading through some of the comments, but here's my summary:If you have other logic to prevent/handle collisions, checksum might be fine. If you want to rely on the checksum/hashbytes method to do all of your new/update/existing logic, you definitely need hashbytes over checksum.</description><pubDate>Tue, 11 May 2010 07:10:26 GMT</pubDate><dc:creator>Rick Todd</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>I actually use checksum function for ETL jobs. Can anybody throw some light which one is better ? Checksum or Hashbytes...</description><pubDate>Tue, 11 May 2010 02:46:01 GMT</pubDate><dc:creator>ziangij</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>[quote][b]magarity kerns (4/19/2010)[/b][hr]Look, the question of what exactly to use for an algorithm is a far off tangent.  The main idea of my article is to use the narrow xref tables to store hashes (however you get them) instead of tacking them on to the wide base table like a lot of people do which no one seems to have noticed in the great debate over what formula to use.  I personally don't care if you use SHA1 or CHECKSUM or anything else.  It has nothing to do with the model.[/quote]Actually it's not as far off tangent as you think.  Consider a simple hash function that simply grabs the first 4 characters of your inbound data.  I don't believe this hash function would achieve a core desired result of detecting changes, regardless of which table you stored your hashes in.  I may be wrong, however; some may find a simple/classic hash function like this might well be sufficient for their needs.[quote]As far as I'm concerned, whose collisions are theoretically less likely than whose is an implementation minutea that you can work out case by case.[/quote]On occasion a little attention to detail can help avoid a ton of problems down the road.BTW, the fact that people are still engaging in discussions of topics from your article for days after it was published is a good thing.  FWIW, having written an article or two myself I've discovered that people often aren't interested in discussing the minutiae that I want them to on the talk pages of my articles.Since this is the talk page for your article, however, I'll go ahead and un-follow this thread.To anyone interested in continuing the conversation about the technical minutiae of collision-free hash functions, you can reach me over at the blog: [url=http://www2.sqlblog.com/blogs/michael_coles/archive/2010/04/17/find-a-hash-collision-win-100.aspx]http://www2.sqlblog.com/blogs/michael_coles/archive/2010/04/17/find-a-hash-collision-win-100.aspx[/url].</description><pubDate>Mon, 19 Apr 2010 15:46:41 GMT</pubDate><dc:creator>Mike C</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>Look, the question of what exactly to use for an algorithm is a far off tangent.  The main idea of my article is to use the narrow xref tables to store hashes (however you get them) instead of tacking them on to the wide base table like a lot of people do which no one seems to have noticed in the great debate over what formula to use.  I personally don't care if you use SHA1 or CHECKSUM or anything else.  It has nothing to do with the model.As far as I'm concerned, whose collisions are theoretically less likely than whose is an implementation minutea that you can work out case by case.</description><pubDate>Mon, 19 Apr 2010 13:14:57 GMT</pubDate><dc:creator>magarity kerns</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>[quote][b]Nadrek (4/19/2010)[/b][hr][quote][b]Mike C (4/19/2010)[/b][hr][quote][b]Nadrek (4/19/2010)[/b][hr][quote][b]Mike C (4/17/2010)[/b][hr]Not sure how you calculated the probability of collision at 2^40 off the top of your head...  You can engineer an MD5 collision in 2^39 or less, but it's a pretty hard calculation to do without showing your work.  In fact I'll post a very simple MD5 hash collision example shortly.  This is why I recommend SHA-1 or better (probability of collision for 160-bit hash is 2^80 per the birthday paradox).  Even the current crop of theoretical attacks on SHA-1 have only pushed the probability of forcibly generating an SHA-1 hash collision to 2^69 which is higher than the standard probability of collision with MD5.  It's really hard to prove a negative and in most cases you're not going to save much by using a hash in a compromise solution to eliminate a few rows from full-blown multi-column comparisons.  I'm putting together a blog entry to give folks who want to use hashes in this way a chance to prove their case.[/quote]I take it you haven't seen the McDonald, Hawkes and Pieprzyk paper from Eurocrypt 2009's rump session:[url=http://eurocrypt2009rump.cr.yp.to/837a0a8086fa6ca714249409ddfae43d.pdf]SHA-1 collisions now 2^52[/url][/quote]Yes actually I did see that theoretical paper and I'm familiar with Joux and Peyrin's Amplified Boomerang attack that it's based on.  In fact I'm running a contest over on the blog at http://www.sqlblog.com (full link in previous messages).  There are no restrictions on the resources you can use.  If you can turn McDonald, Hawkes and Pierprzyk's theoretical slideshow into a practical implementation that actually generates a hash collision using SHA-1 on SQL Server, send it in and you can win $100.[/quote]I suspect I misinterpreted what you meant by the current crop of theoretical attacks increasing the probability of forcing a SHA-1 collision to 2^69, then.[/quote]Nope you didn't misinterpret.  2^69 was published by a Chinese research team, I believe about 5 years ago.  McDonald and company withdrew their published paper that accompanied this slideshow ("Differential Path for SHA-1 with complexity O(2^52)") after they figured out their estimate was incorrect!Link to withdrawal statement added:  [url=http://eprint.iacr.org/2009/259]http://eprint.iacr.org/2009/259[/url]</description><pubDate>Mon, 19 Apr 2010 12:21:20 GMT</pubDate><dc:creator>Mike C</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>[quote][b]Mike C (4/19/2010)[/b][hr][quote][b]Nadrek (4/19/2010)[/b][hr][quote][b]Mike C (4/17/2010)[/b][hr]Not sure how you calculated the probability of collision at 2^40 off the top of your head...  You can engineer an MD5 collision in 2^39 or less, but it's a pretty hard calculation to do without showing your work.  In fact I'll post a very simple MD5 hash collision example shortly.  This is why I recommend SHA-1 or better (probability of collision for 160-bit hash is 2^80 per the birthday paradox).  Even the current crop of theoretical attacks on SHA-1 have only pushed the probability of forcibly generating an SHA-1 hash collision to 2^69 which is higher than the standard probability of collision with MD5.  It's really hard to prove a negative and in most cases you're not going to save much by using a hash in a compromise solution to eliminate a few rows from full-blown multi-column comparisons.  I'm putting together a blog entry to give folks who want to use hashes in this way a chance to prove their case.[/quote]I take it you haven't seen the McDonald, Hawkes and Pieprzyk paper from Eurocrypt 2009's rump session:[url=http://eurocrypt2009rump.cr.yp.to/837a0a8086fa6ca714249409ddfae43d.pdf]SHA-1 collisions now 2^52[/url][/quote]Yes actually I did see that theoretical paper and I'm familiar with Joux and Peyrin's Amplified Boomerang attack that it's based on.  In fact I'm running a contest over on the blog at http://www.sqlblog.com (full link in previous messages).  There are no restrictions on the resources you can use.  If you can turn McDonald, Hawkes and Pierprzyk's theoretical slideshow into a practical implementation that actually generates a hash collision using SHA-1 on SQL Server, send it in and you can win $100.[/quote]I suspect I misinterpreted what you meant by the current crop of theoretical attacks increasing the probability of forcing a SHA-1 collision to 2^69, then.</description><pubDate>Mon, 19 Apr 2010 11:23:22 GMT</pubDate><dc:creator>Nadrek</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>[quote][b]Rick Todd (4/19/2010)[/b][hr]I think there's a significant element missing from the "which way is faster to process" debate (hash then compare, or just compare): time to market. As in, how quickly can we code the solution. We're in the probably unusual, but far from unique situation where we have lots of data sources, but not a tremendous amount of data in each one, and in our case this is combined with small development resources. Let's call it a "wide and shallow" data warehouse. The result? We implemented something akin to Magarity's solution, with a couple of meta-data based SPs and TVFs that we can use over and over again (to hashbyte all the columns in a table, given a list of "ignore" columns, for example). While this might not be THE most efficient way to process all of the source systems, for us it was a no-brainer. Our DW server is fast enough that we can use this solution over and over, and we don't have to dig very deep into each source system to see whether or not we can trust the update date columns or whatever. We just use the same approach for each data source, and reuse the same code, or at least approach. Could we be loading some of our Fact tables faster if we didn't hash the rows and did column-by-column comparisons, and some of the other ones faster by finding that we actually can trust the update date column (maintained by a vendor, mind you, not our own code)? Undoubtedly. Would we still be working on one of the data sources I completed the ETL, Cubes, and Reports for 4 months ago? Undoubtedly.So, while I love seeing the logic and discussion of the efficiencies discussed above, I think this is an efficiency that needs to be considered in at least some cases.Cheers,[/quote]I agree.  In fact, I've implemented solutions that use .NET in SSIS to generate hashes for both primary keys in the source (natural keys) and to detect changes across non-key columns at the same time, with similar capabilities to eliminate certain administrative columns as needed.  Hashing the PKs was done for the reasons you mention above -- one source system may have a one-column PK while another system may have a five-column PK, yet we needed to be able to grab data from new source systems while minimizing customization.  So this should be a big part of the conversation, and could be the determining factor in any given situation.  The other benefits of generating a narrow single-column key (hash) from a multi-column source key, for instance, may well outweigh the overhead involved in generating the hash.</description><pubDate>Mon, 19 Apr 2010 10:15:46 GMT</pubDate><dc:creator>Mike C</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>[quote][b]Nadrek (4/19/2010)[/b][hr][quote][b]Mike C (4/17/2010)[/b][hr]Not sure how you calculated the probability of collision at 2^40 off the top of your head...  You can engineer an MD5 collision in 2^39 or less, but it's a pretty hard calculation to do without showing your work.  In fact I'll post a very simple MD5 hash collision example shortly.  This is why I recommend SHA-1 or better (probability of collision for 160-bit hash is 2^80 per the birthday paradox).  Even the current crop of theoretical attacks on SHA-1 have only pushed the probability of forcibly generating an SHA-1 hash collision to 2^69 which is higher than the standard probability of collision with MD5.  It's really hard to prove a negative and in most cases you're not going to save much by using a hash in a compromise solution to eliminate a few rows from full-blown multi-column comparisons.  I'm putting together a blog entry to give folks who want to use hashes in this way a chance to prove their case.[/quote]I take it you haven't seen the McDonald, Hawkes and Pieprzyk paper from Eurocrypt 2009's rump session:[url=http://eurocrypt2009rump.cr.yp.to/837a0a8086fa6ca714249409ddfae43d.pdf]SHA-1 collisions now 2^52[/url][/quote]Yes actually I did see that theoretical paper and I'm familiar with Joux and Peyrin's Amplified Boomerang attack that it's based on.  In fact I'm running a contest over on the blog at http://www.sqlblog.com (full link in previous messages).  There are no restrictions on the resources you can use.  If you can turn McDonald, Hawkes and Pierprzyk's theoretical slideshow into a practical implementation that actually generates a hash collision using SHA-1 on SQL Server, send it in and you can win $100.</description><pubDate>Mon, 19 Apr 2010 10:10:30 GMT</pubDate><dc:creator>Mike C</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>I think there's a significant element missing from the "which way is faster to process" debate (hash then compare, or just compare): time to market. As in, how quickly can we code the solution. We're in the probably unusual, but far from unique situation where we have lots of data sources, but not a tremendous amount of data in each one, and in our case this is combined with small development resources. Let's call it a "wide and shallow" data warehouse. The result? We implemented something akin to Magarity's solution, with a couple of meta-data based SPs and TVFs that we can use over and over again (to hashbyte all the columns in a table, given a list of "ignore" columns, for example). While this might not be THE most efficient way to process all of the source systems, for us it was a no-brainer. Our DW server is fast enough that we can use this solution over and over, and we don't have to dig very deep into each source system to see whether or not we can trust the update date columns or whatever. We just use the same approach for each data source, and reuse the same code, or at least approach. Could we be loading some of our Fact tables faster if we didn't hash the rows and did column-by-column comparisons, and some of the other ones faster by finding that we actually can trust the update date column (maintained by a vendor, mind you, not our own code)? Undoubtedly. Would we still be working on one of the data sources I completed the ETL, Cubes, and Reports for 4 months ago? Undoubtedly.So, while I love seeing the logic and discussion of the efficiencies discussed above, I think this is an efficiency that needs to be considered in at least some cases.Cheers,</description><pubDate>Mon, 19 Apr 2010 08:41:32 GMT</pubDate><dc:creator>Rick Todd</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>[quote][b]Mike C (4/17/2010)[/b][hr]Not sure how you calculated the probability of collision at 2^40 off the top of your head...  You can engineer an MD5 collision in 2^39 or less, but it's a pretty hard calculation to do without showing your work.  In fact I'll post a very simple MD5 hash collision example shortly.  This is why I recommend SHA-1 or better (probability of collision for 160-bit hash is 2^80 per the birthday paradox).  Even the current crop of theoretical attacks on SHA-1 have only pushed the probability of forcibly generating an SHA-1 hash collision to 2^69 which is higher than the standard probability of collision with MD5.  It's really hard to prove a negative and in most cases you're not going to save much by using a hash in a compromise solution to eliminate a few rows from full-blown multi-column comparisons.  I'm putting together a blog entry to give folks who want to use hashes in this way a chance to prove their case.[/quote]I take it you haven't seen the McDonald, Hawkes and Pieprzyk paper from Eurocrypt 2009's rump session:[url=http://eurocrypt2009rump.cr.yp.to/837a0a8086fa6ca714249409ddfae43d.pdf]SHA-1 collisions now 2^52[/url]</description><pubDate>Mon, 19 Apr 2010 08:35:12 GMT</pubDate><dc:creator>Nadrek</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>[quote][b]tarquin (4/19/2010)[/b][hr]Here I was believing in Kimball! To prove or disprove my guru's theory I went and looked at a client database that has been in operation for about a year. I found a table with a just under 20 million records - the logical key is a date and two ints (EffectiveDate, FundId and DataTypeId). The date is the most unique, fund a bit lower and datatypeid basically useless. I have created a checksum of the three keys roughly as follows: checksum(hashbytes(date as string), fundid, datatypeid)) which is stored along with the values at load time. The checksum calculation has almost no overhead at load.The cardinality value when looking at statistics details are:1) Checksum =      0.000052282) EffectiveDate = 0.00017382) FundId =          0.000571423) DataTypeId =   1So my hashindex is more unique than any of the logical keys alone. This represents a moderate amount of data on a daily basis - more data would show a bigger difference.Am I being dumb here - is this a wasted exercise?[/quote]CHECKSUM has little overhead because it performs similar to the following:1.  Grab 32 bits of input, put in current result2.  Shift 4 bits3.  XOR 32 bits with current result and store in current result4.  Repeat starting at 2 until end of dataIf your logical key is (EffectiveDate, FundID, DataTypeID), then the cardinality of the combination of (EffectiveDate, FundID, DataTypeID) is what's important, not the cardinality of 1/3 of your key.</description><pubDate>Mon, 19 Apr 2010 06:54:45 GMT</pubDate><dc:creator>Mike C</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>Here I was believing in Kimball! To prove or disprove my guru's theory I went and looked at a client database that has been in operation for about a year. I found a table with a just under 20 million records - the logical key is a date and two ints (EffectiveDate, FundId and DataTypeId). The date is the most unique, fund a bit lower and datatypeid basically useless. I have created a checksum of the three keys roughly as follows: checksum(hashbytes(date as string), fundid, datatypeid)) which is stored along with the values at load time. The checksum calculation has almost no overhead at load.The cardinality value when looking at statistics details are:1) Checksum =      0.000052282) EffectiveDate = 0.00017382) FundId =          0.000571423) DataTypeId =   1So my hashindex is more unique than any of the logical keys alone. This represents a moderate amount of data on a daily basis - more data would show a bigger difference.Am I being dumb here - is this a wasted exercise?</description><pubDate>Mon, 19 Apr 2010 02:11:15 GMT</pubDate><dc:creator>tarquin</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>[quote][b]Tom.Thomson (4/17/2010)[/b][hr][quote][b]Mike C (4/17/2010)[/b][hr]The vast majority of ETL solutions I've encountered tend to discard a lot more duplicate rows than they update or insert (after the initial table load, that is).  Perhaps someone can provide some counter-examples they've encountered where the opposite is true?[/quote]It's certainly true that if the system has those characteristics (many more discards than inserts/updates) then using a hash function is a waste of resource, so hashing should be avoided in that case.  And with a good design (so that everything in the data source has a meaningful timestamp or version ID) there's no need for comparison of more than that (plus the primary key) and that's probably a lot cheaper than a hash. So now you have me wondering whether all this stuff I have read about using hashing to optimise update of DW is just hot air.[/quote]And I would argue that no it's not hot air if you believe in the collision-free properties of your hash function.  If you have a true collision-free hash function then there's no need to do column-by-column comparisons at all.  You eliminate the column-by-column comparisons completely, you'll save considerable processing as you pointed out.  Whether you use a timestamp, version ID, or hash function, the key to better efficiency is to eliminate the column-by-column comparisons completely from your process.A timestamp or version ID is still not going to tell you whether the row being imported via ETL from a flat file is different from the corresponding row in the target table--unless, of course, your source systems have *global* timestamps or version IDs that align with the target system's.  If you can put a global versioning system in place on every source system you pull data from, it would increase the efficiency of your ETL considerably and remove the need for many of these types of change detection solutions.  That's usually not feasible though...</description><pubDate>Sat, 17 Apr 2010 20:33:36 GMT</pubDate><dc:creator>Mike C</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>[quote][b]Mike C (4/17/2010)[/b][hr]The vast majority of ETL solutions I've encountered tend to discard a lot more duplicate rows than they update or insert (after the initial table load, that is).  Perhaps someone can provide some counter-examples they've encountered where the opposite is true?[/quote]It's certainly true that if the system has those characteristics (many more discards than inserts/updates) then using a hash function is a waste of resource, so hashing should be avoided in that case.  And with a good design (so that everything in the data source has a meaningful timestamp or version ID) there's no need for comparison of more than that (plus the primary key) and that's probably a lot cheaper than a hash. So now you have me wondering whether all this stuff I have read about using hashing to optimise update of DW is just hot air.</description><pubDate>Sat, 17 Apr 2010 20:23:35 GMT</pubDate><dc:creator>L' Eomot Inversé</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>[quote][b]magarity kerns (4/17/2010)[/b][hr][quote][b]Mike C (4/17/2010)[/b][hr]SELECT @A = 0xd131dd02c5e6eec4693d9a0698aff95c2fcab58712467eab4004583eb8fb7f8955ad340609f4b30283e488832571415a085125e8f7cdc99fd91dbdf280373c5bd8823e3156348f5bae6dacd436c919c6dd53e2b487da03fd02396306d248cda0e99f33420f577ee8ce54b67080a80d1ec69821bcb6a8839396f9652b6ff72a70,[/quote]I surrender; your business users are infintely more badass than mine if they're analyzing these kinds of statistics in your data warehouse.[/quote]Here's a version of the exact same thing that even a business user might be able to analyze :)  I wouldn't expect a business user to necessarily understand the conceptsof data type conversions and hash functions, however.  For that they'll have to read Wikipedia like everyone else.CREATE TABLE #Temp (	ID1 TINYINT, 	CODE1 CHAR(1), 	FEE1 SMALLMONEY, 	FEE2 SMALLMONEY, 	FEE3 SMALLMONEY, 	FEE4 SMALLMONEY, 	INTEREST1 SMALLMONEY, 	ID2 SMALLINT, 	CODE2 CHAR(1), 	CHARGE1 SMALLMONEY, 	ID3 TINYINT, 	PRICE1 SMALLMONEY, 	CODE3 CHAR(1), 	ID4 TINYINT, 	ID5 TINYINT, 	CHARGE2 SMALLMONEY, 	CHARGE3 SMALLMONEY, 	ID6 TINYINT, 	STATE CHAR(2), 	TAX1 SMALLMONEY, 	TAX2 SMALLMONEY, 	TAX3 SMALLMONEY, 	TAX4 SMALLMONEY, 	SURCHARGE1 SMALLMONEY, 	SURCHARGE2 SMALLMONEY, 	SURCHARGE3 SMALLMONEY, 	SURCHARGE4 SMALLMONEY, 	SURCHARGE5 SMALLMONEY, 	SURCHARGE6 SMALLMONEY, 	ID7 TINYINT, 	CODE4 CHAR(1), 	CODE5 CHAR(1), 	ID8 TINYINT, 	REFUND1 SMALLMONEY, 	REFUND2 SMALLMONEY, 	ID9 TINYINT, 	FEE5 SMALLMONEY, 	FEE6 SMALLMONEY, 	FEE7 SMALLMONEY, 	FEE8 SMALLMONEY, 	FEE9 SMALLMONEY, 	ID10 TINYINT, 	CODE6 CHAR(1), 	ID11 TINYINT, 	FEE10 SMALLMONEY);INSERT INTO #Temp VALUES(	209,	1,	-58702.0826,	-28911.7891,	-171084.3729,	-11139.8966,	-124943.9162,	32427,	'@',	7289.2088,	251,	213970.6797,	4,	6,	9,	-18959.5005,	-46081.5579,	113,	'AZ',	13953.5848,	-13750.8449,	-65236.2254,	-214386.3717,	-66255.3039,	144628.5147,	-136854.4044,	91914.8998,	-58170.5036,	-201575.5267,	2,	9,	'c',	6,	-76698.0704,	-37544.2622,	15,	146793.4926,	142124.2496,	-147553.5162,	-174261.9466,	-146777.201,	249,	'e',	43,	187846.9232);INSERT INTO #TempVALUES(	209,	1,	-58702.0826,	-28911.7891,	-171084.3729,	-11139.8966,	-125782.777,	32427,	'@',	7289.2088,	251,	213970.6797,	4,	6,	9,	-18959.5005,	-46081.5579,	241,	'AZ',	13953.5848,	-13750.8449,	-65236.2382,	-214386.3717,	-66255.3039,	144628.5147,	-136854.4044,	91914.8998,	-58170.5164,	-201575.5267,	2,	9,	'c',	6,	-76698.0704,	-37544.2622,	15,	146793.4926,	142124.2496,	67194.8486,	-174261.9466,	-146777.201,	249,	'e',	171,	187846.9232);SELECT *FROM #Temp;SELECT HASHBYTES('MD5',	CAST(ID1 AS BINARY(1)) + 	CAST(CODE1 AS BINARY(1)) + 	CAST(FEE1 AS VARBINARY(4)) + 	CAST(FEE2 AS VARBINARY(4)) + 	CAST(FEE3 AS VARBINARY(4)) + 	CAST(FEE4 AS VARBINARY(4)) + 	CAST(INTEREST1 AS VARBINARY(4)) + 	CAST(ID2 AS VARBINARY(2)) + 	CAST(CODE2 AS BINARY(1)) + 	CAST(CHARGE1 AS VARBINARY(4)) + 	CAST(ID3 AS BINARY(1)) + 	CAST(PRICE1 AS VARBINARY(4)) + 	CAST(CODE3 AS BINARY(1)) + 	CAST(ID4 AS BINARY(1)) + 	CAST(ID5 AS BINARY(1)) + 	CAST(CHARGE2 AS VARBINARY(4)) + 	CAST(CHARGE3 AS VARBINARY(4)) + 	CAST(ID6 AS BINARY(1)) + 	CAST(STATE AS VARBINARY(2)) + 	CAST(TAX1 AS VARBINARY(4)) + 	CAST(TAX2 AS VARBINARY(4)) + 	CAST(TAX3 AS VARBINARY(4)) + 	CAST(TAX4 AS VARBINARY(4)) + 	CAST(SURCHARGE1 AS VARBINARY(4)) + 	CAST(SURCHARGE2 AS VARBINARY(4)) + 	CAST(SURCHARGE3 AS VARBINARY(4)) + 	CAST(SURCHARGE4 AS VARBINARY(4)) + 	CAST(SURCHARGE5 AS VARBINARY(4)) + 	CAST(SURCHARGE6 AS VARBINARY(4)) + 	CAST(ID7 AS BINARY(1)) + 	CAST(CODE4 AS BINARY(1)) + 	CAST(CODE5 AS BINARY(1)) + 	CAST(ID8 AS BINARY(1)) + 	CAST(REFUND1 AS VARBINARY(4)) + 	CAST(REFUND2 AS VARBINARY(4)) + 	CAST(ID9 AS BINARY(1)) + 	CAST(FEE5 AS VARBINARY(4)) + 	CAST(FEE6 AS VARBINARY(4)) + 	CAST(FEE7 AS VARBINARY(4)) + 	CAST(FEE8 AS VARBINARY(4)) + 	CAST(FEE9 AS VARBINARY(4)) + 	CAST(ID10 AS BINARY(1)) + 	CAST(CODE6 AS BINARY(1)) + 	CAST(ID11 AS BINARY(1)) + 	CAST(FEE10 AS VARBINARY(4)) ) FROM #Temp;DROP TABLE #Temp;</description><pubDate>Sat, 17 Apr 2010 19:45:26 GMT</pubDate><dc:creator>Mike C</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>[quote][b]magarity kerns (4/17/2010)[/b][hr][quote][b]Mike C (4/17/2010)[/b][hr]SELECT @A = 0xd131dd02c5e6eec4693d9a0698aff95c2fcab58712467eab4004583eb8fb7f8955ad340609f4b30283e488832571415a085125e8f7cdc99fd91dbdf280373c5bd8823e3156348f5bae6dacd436c919c6dd53e2b487da03fd02396306d248cda0e99f33420f577ee8ce54b67080a80d1ec69821bcb6a8839396f9652b6ff72a70,[/quote]I surrender; your business users are infintely more badass than mine if they're analyzing these kinds of statistics in your data warehouse.[/quote]LOL.  This could actually represent a concatenation of several columns converted to binary format.  There are examples of PDF files that generate collisions with MD5.  Surely your business users use PDF files? :)</description><pubDate>Sat, 17 Apr 2010 17:31:24 GMT</pubDate><dc:creator>Mike C</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>[quote][b]Mike C (4/17/2010)[/b][hr]SELECT @A = 0xd131dd02c5e6eec4693d9a0698aff95c2fcab58712467eab4004583eb8fb7f8955ad340609f4b30283e488832571415a085125e8f7cdc99fd91dbdf280373c5bd8823e3156348f5bae6dacd436c919c6dd53e2b487da03fd02396306d248cda0e99f33420f577ee8ce54b67080a80d1ec69821bcb6a8839396f9652b6ff72a70,[/quote]I surrender; your business users are infintely more badass than mine if they're analyzing these kinds of statistics in your data warehouse.</description><pubDate>Sat, 17 Apr 2010 17:16:35 GMT</pubDate><dc:creator>magarity kerns</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>[quote][b]Tom.Thomson (4/17/2010)[/b][hr][quote][b]Mike C (4/17/2010)[/b][hr]This is why I recommend SHA-1 or better (probability of collision for 160-bit hash is 2^80 per the birthday paradox).  Even the current crop of theoretical attacks on SHA-1 have only pushed the probability of forcibly generating an SHA-1 hash collision to 2^69 which is higher than the standard probability of collision with MD5.  It's really hard to prove a negative and in most cases you're not going to save much by using a hash in a compromise solution to eliminate a few rows from full-blown multi-column comparisons.  I'm putting together a blog entry to give folks who want to use hashes in this way a chance to prove their case.[/quote]Sure, if you want secure validation MD5 is a bit old and week.  But for the DW upload application you aren't after secure validation, you are after saving comptutation. So the questions are what fraction of the cost of a full row comparison is the hash algorithm, and what is the expected proportion of cases where the hash algorithm will deliver a negative result (ie will eliminate the need for a full row comparison).  If the cost fraction is P and the elimination proportion is Q using the hash multiplies the cost by (1+P-Q) so as long as P is smaller than Q you are winning. Certainly SHA1 will have a bigger Q than MD5, but it will also have a bigger P and I am pretty sure that the increase in P is much greater than the increase in Q, which would mean that SHA1 is less efficient for this particular application.[/quote]DECLARE @A varbinary(8000),      @B varbinary(8000),      @hA binary(16),      @hB binary(16);SELECT @A = 0xd131dd02c5e6eec4693d9a0698aff95c2fcab58712467eab4004583eb8fb7f8955ad340609f4b30283e488832571415a085125e8f7cdc99fd91dbdf280373c5bd8823e3156348f5bae6dacd436c919c6dd53e2b487da03fd02396306d248cda0e99f33420f577ee8ce54b67080a80d1ec69821bcb6a8839396f9652b6ff72a70,      @B = 0xd131dd02c5e6eec4693d9a0698aff95c2fcab50712467eab4004583eb8fb7f8955ad340609f4b30283e4888325f1415a085125e8f7cdc99fd91dbd7280373c5bd8823e3156348f5bae6dacd436c919c6dd53e23487da03fd02396306d248cda0e99f33420f577ee8ce54b67080280d1ec69821bcb6a8839396f965ab6ff72a70;SELECT @hA = HASHBYTES('MD5', @A),      @hB = HASHBYTES('MD5', @B);      SELECT CASE WHEN @A = @B                  THEN '@A Equals @B'                  ELSE '@A Is Not Equal To @B'                  END AS AB_Equal,            CASE WHEN @hA = @hB                  THEN '@hA Equals @hB'                  ELSE '@hA Is Not Equal To @hB'                  END AS Hash_Equal;Now do this with SHA1, submit it to my blog [url=http://sqlblog.com/blogs/michael_coles/archive/2010/04/17/find-a-hash-collision-win-100.aspx]http://sqlblog.com/blogs/michael_coles/archive/2010/04/17/find-a-hash-collision-win-100.aspx[/url] and we'll see if you can't win $100 and a book.</description><pubDate>Sat, 17 Apr 2010 16:14:32 GMT</pubDate><dc:creator>Mike C</dc:creator></item><item><title>RE: HASHBYTES can help quickly load a Data Warehouse</title><link>http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx</link><description>[quote][b]Tom.Thomson (4/17/2010)[/b][hr][quote][b]Mike C (4/17/2010)[/b][hr]This is why I recommend SHA-1 or better (probability of collision for 160-bit hash is 2^80 per the birthday paradox).  Even the current crop of theoretical attacks on SHA-1 have only pushed the probability of forcibly generating an SHA-1 hash collision to 2^69 which is higher than the standard probability of collision with MD5.  It's really hard to prove a negative and in most cases you're not going to save much by using a hash in a compromise solution to eliminate a few rows from full-blown multi-column comparisons.  I'm putting together a blog entry to give folks who want to use hashes in this way a chance to prove their case.[/quote]Sure, if you want secure validation MD5 is a bit old and week.  But for the DW upload application you aren't after secure validation, you are after saving comptutation. So the questions are what fraction of the cost of a full row comparison is the hash algorithm, and what is the expected proportion of cases where the hash algorithm will deliver a negative result (ie will eliminate the need for a full row comparison).  If the cost fraction is P and the elimination proportion is Q using the hash multiplies the cost by (1+P-Q) so as long as P is smaller than Q you are winning. Certainly SHA1 will have a bigger Q than MD5, but it will also have a bigger P and I am pretty sure that the increase in P is much greater than the increase in Q, which would mean that SHA1 is less efficient for this particular application.[/quote]I just posted a blog post with a contest to give folks a chance to prove a negative.  The blog entry shows (1) a very simple MD5 hash collision with two relatively short binary strings that have one byte difference.  I'm also holding a contest (discussed on the blog as well) -- the first person to submit an SHA-1 hash collision wins $100 and a book: [url=http://bit.ly/cRKByv]http://bit.ly/cRKByv[/url]On your assertion above, I agree completely, and I discuss this on the blog.  Taking the example of the "slowly" changing dimension, the question is how many inbound rows are going to be inserted or updated, and how many are going to be ignored?  You also need to break "updates" into two separate types of updates -- updates where the hash values are different and updates where the hash values are the same.The only inbound rows that don't require a full all-columns comparison are new inserts (row doesn't exist in target table) and update rows (row exists in target table, but hash is different).  As you point out, for the compromise position to add any value whatsoever you have to have a much, much higher percentage of inserts and updates (where hash is different) than non-change/ignore rows or update rows where the hash is the same.  If the percentage of inserts and updates (different hash) is not significantly higher than the number of rows to ignore and update (same hash), then you not only don't gain any value, you lose a whole lot of efficiency by unnecessarily adding the overhead of generating hashes for no useful reason.The vast majority of ETL solutions I've encountered tend to discard a lot more duplicate rows than they update or insert (after the initial table load, that is).  Perhaps someone can provide some counter-examples they've encountered where the opposite is true?</description><pubDate>Sat, 17 Apr 2010 16:11:51 GMT</pubDate><dc:creator>Mike C</dc:creator></item></channel></rss>