SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


HASHBYTES can help quickly load a Data Warehouse


HASHBYTES can help quickly load a Data Warehouse

Author
Message
Rick Todd
Rick Todd
SSC Veteran
SSC Veteran (290 reputation)SSC Veteran (290 reputation)SSC Veteran (290 reputation)SSC Veteran (290 reputation)SSC Veteran (290 reputation)SSC Veteran (290 reputation)SSC Veteran (290 reputation)SSC Veteran (290 reputation)

Group: General Forum Members
Points: 290 Visits: 441
ziangij, I would definitely recommend reading through some of the comments, but here's my summary:
If you have other logic to prevent/handle collisions, checksum might be fine. If you want to rely on the checksum/hashbytes method to do all of your new/update/existing logic, you definitely need hashbytes over checksum.


Rick Todd
ziangij
ziangij
Hall of Fame
Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)

Group: General Forum Members
Points: 3131 Visits: 376
Rick Todd (5/11/2010)
ziangij, I would definitely recommend reading through some of the comments, but here's my summary:
If you have other logic to prevent/handle collisions, checksum might be fine. If you want to rely on the checksum/hashbytes method to do all of your new/update/existing logic, you definitely need hashbytes over checksum.


apologies, i should have read all the comments which I have done now.

many options have been suggested with pros and cons of each... :-)
Cyclone
Cyclone
Valued Member
Valued Member (70 reputation)Valued Member (70 reputation)Valued Member (70 reputation)Valued Member (70 reputation)Valued Member (70 reputation)Valued Member (70 reputation)Valued Member (70 reputation)Valued Member (70 reputation)

Group: General Forum Members
Points: 70 Visits: 244
Hi ...

I am currently using a similiar method in order to compare data. The Hash value is calculated as follows:

HashBytes('SHA1',

UPPER(Isnull(convert(varchar(max), Branch), ''))
+ UPPER(Isnull(convert(varchar(max), VendorID), ''))
+ UPPER(Isnull(convert(varchar(max), VendorName), ''))
+ UPPER(Isnull(convert(varchar(max), VendorTradingName), ''))
+ UPPER(Isnull(convert(varchar(max), RegisteredCompany), ''))
+ UPPER(Isnull(convert(varchar(max), TypeOfEntity), ''))
+ UPPER(Isnull(convert(varchar(max), RegistrationNumber), ''))
+ ........ <other columns>
)

I convert each field to varchar datatype and concatenate them together in order generate a value.

Do you see any problems with this approach?

Also, I checked the length (using Len()) of the Hash Value and the max length I have so far is 20. I have the column that holds the Hash Value as datatype varbinary(max). Should I change the size of this column?

Thanks in advance.
Nadrek
Nadrek
SSCommitted
SSCommitted (1.9K reputation)SSCommitted (1.9K reputation)SSCommitted (1.9K reputation)SSCommitted (1.9K reputation)SSCommitted (1.9K reputation)SSCommitted (1.9K reputation)SSCommitted (1.9K reputation)SSCommitted (1.9K reputation)

Group: General Forum Members
Points: 1911 Visits: 2729
Assume you have vendorid and vendorname:
AB Abner
A Babner

Because you're converting to varchar, these both collapse to
ABABNER, so they look the same, while they clearly are not.

You have to convert to a fixed width, or put some delimiter in there (and be very certain that delimiter doesn't actually show up in your data).
magarity kerns
magarity kerns
Mr or Mrs. 500
Mr or Mrs. 500 (524 reputation)Mr or Mrs. 500 (524 reputation)Mr or Mrs. 500 (524 reputation)Mr or Mrs. 500 (524 reputation)Mr or Mrs. 500 (524 reputation)Mr or Mrs. 500 (524 reputation)Mr or Mrs. 500 (524 reputation)Mr or Mrs. 500 (524 reputation)

Group: General Forum Members
Points: 524 Visits: 397
Cyclone (5/17/2010)
Do you see any problems with this approach?

Also, I checked the length (using Len()) of the Hash Value and the max length I have so far is 20. I have the column that holds the Hash Value as datatype varbinary(max). Should I change the size of this column?


Two problems: Why are you converting to upper case and you don't have a delimiter. 'AB' and 'A' will give the same as 'A' and 'BA' with your method. This has been hashed out (pun intended!) in prior conversations; please read the previous comments. It doesn't matter if the delimiter is in your data or not, just that there is something. You can just use a space. As long as something is there to break it up.

Converting to upper case is a little strange - why do it? Check with your SME or data steward to confirm your target system should ignore changes in case. No one here can give the definite answer, although I advise against it unless you have a requirement to do so.

The output is always 20 bytes for sha1 (and 16 for md5) but I don't know if the system needs extra processing or space to handle (max) instead of just (20). It probably doesn't matter.

The point of my article was the xref tables to hold checksums and keys; did anyone notice that at all in the debate over whose hash to use???
Cyclone
Cyclone
Valued Member
Valued Member (70 reputation)Valued Member (70 reputation)Valued Member (70 reputation)Valued Member (70 reputation)Valued Member (70 reputation)Valued Member (70 reputation)Valued Member (70 reputation)Valued Member (70 reputation)

Group: General Forum Members
Points: 70 Visits: 244
Hi ...

The reason for converting to upper case is that often users make changes to the data by simply adding the same details in all capital letters (for various reasons). We do not want to pick this up as change in the data, if all that changes is the case.

Cheers
nishant30000
nishant30000
Forum Newbie
Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)

Group: General Forum Members
Points: 2 Visits: 10
magarity kerns (4/13/2010)
Comments posted to this topic are about the item <A HREF="/articles/Data+Warehouse/69679/">HASHBYTES can help quickly load a Data Warehouse</A>


Hi great Article thanks!

Do you have any time comaprision between the 'method mentioned in this article' Vs 'An all column comparison' method?

Like what should be a minimum number of rows or what is minimum number of columns after which time consumed in calculating Hash will outdo the time aved in comparison.
magarity kerns
magarity kerns
Mr or Mrs. 500
Mr or Mrs. 500 (524 reputation)Mr or Mrs. 500 (524 reputation)Mr or Mrs. 500 (524 reputation)Mr or Mrs. 500 (524 reputation)Mr or Mrs. 500 (524 reputation)Mr or Mrs. 500 (524 reputation)Mr or Mrs. 500 (524 reputation)Mr or Mrs. 500 (524 reputation)

Group: General Forum Members
Points: 524 Visits: 397
Like what should be a minimum number of rows or what is minimum number of columns after which time consumed in calculating Hash will outdo the time aved in comparison.

Calculating the hash is extremely lightweight as data goes through an ETL tool. Lightweight enough that it's a foregone conclusion and you can see from other comments that a lot of people are already using one hash function or another. I think you've missed the key architectural point - the 3 column xref tables dramatically reduce disk IO which is much more of a bottleneck than the cpu doing a hash function. The thing is to compare the width of the target table to the width of the xref table.

Many data warehouse tables can be quite wide. Reading a 50 column wide target table just to get the hash value is many times slower than reading the 3 column wide xref table. With SQL Server this can be mitigated with an 'included column' on an index for the target table but some database engines don't have this. Earlier versions of SS don't have it. Also consider development time; it is faster to develop a standardized routine rather than if one mixes and matches techniques. Finally, data warehouses have a strong tendency to change so when a target table that's only 10 columns wide suddenly has a requirement for an additional 25 columns, you would have to rewrite its ETL if you picked the all column compare method because it was only 10 to start.
nishant30000
nishant30000
Forum Newbie
Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)

Group: General Forum Members
Points: 2 Visits: 10
I think you've missed the key architectural point - the 3 column xref tables dramatically reduce disk IO which is much more of a bottleneck than the cpu doing a hash function. The thing is to compare the width of the target table to the width of the xref table.

Many data warehouse tables can be quite wide. Reading a 50 column wide target table just to get the hash value is many times slower than reading the 3 column wide xref table.


I now understand your point in using xref table, thanks for the clarification.
miranwara
miranwara
Forum Newbie
Forum Newbie (4 reputation)Forum Newbie (4 reputation)Forum Newbie (4 reputation)Forum Newbie (4 reputation)Forum Newbie (4 reputation)Forum Newbie (4 reputation)Forum Newbie (4 reputation)Forum Newbie (4 reputation)

Group: General Forum Members
Points: 4 Visits: 158
Hi Mike,

I am trying to do what you said concatenate all values to varbinary and then hash

when you say concatenate you mean add the binary value of each column as follows:


DECLARE @ColList Varchar(max), @SQLStatment VARCHAR(MAX)
SET @ColList = ''
select @ColList = @ColList + 'convert(varbinary(max), isnull(' + Name + ',0) )+ ' from syscolumns where id = object_id('aggregate')
SELECT @SQLStatment = 'SELECT ' + Substring(@ColList,1,len(@ColList)-1) + ' From aggregate'
select @SQLStatment
exec(@SQLStatment)
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search