Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««678910»»»

HASHBYTES can help quickly load a Data Warehouse Expand / Collapse
Author
Message
Posted Tuesday, May 11, 2010 7:10 AM
SSC Veteran

SSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC Veteran

Group: General Forum Members
Last Login: Monday, September 15, 2014 12:38 PM
Points: 216, Visits: 425
ziangij, I would definitely recommend reading through some of the comments, but here's my summary:
If you have other logic to prevent/handle collisions, checksum might be fine. If you want to rely on the checksum/hashbytes method to do all of your new/update/existing logic, you definitely need hashbytes over checksum.



Rick Todd
Post #919657
Posted Thursday, May 13, 2010 12:49 AM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Sunday, November 2, 2014 11:03 PM
Points: 2,104, Visits: 371
Rick Todd (5/11/2010)
ziangij, I would definitely recommend reading through some of the comments, but here's my summary:
If you have other logic to prevent/handle collisions, checksum might be fine. If you want to rely on the checksum/hashbytes method to do all of your new/update/existing logic, you definitely need hashbytes over checksum.


apologies, i should have read all the comments which I have done now.

many options have been suggested with pros and cons of each...
Post #921019
Posted Monday, May 17, 2010 2:09 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, July 21, 2014 4:44 AM
Points: 44, Visits: 237
Hi ...

I am currently using a similiar method in order to compare data. The Hash value is calculated as follows:

HashBytes('SHA1',

UPPER(Isnull(convert(varchar(max), Branch), ''))
+ UPPER(Isnull(convert(varchar(max), VendorID), ''))
+ UPPER(Isnull(convert(varchar(max), VendorName), ''))
+ UPPER(Isnull(convert(varchar(max), VendorTradingName), ''))
+ UPPER(Isnull(convert(varchar(max), RegisteredCompany), ''))
+ UPPER(Isnull(convert(varchar(max), TypeOfEntity), ''))
+ UPPER(Isnull(convert(varchar(max), RegistrationNumber), ''))
+ ........ <other columns>
)

I convert each field to varchar datatype and concatenate them together in order generate a value.

Do you see any problems with this approach?

Also, I checked the length (using Len()) of the Hash Value and the max length I have so far is 20. I have the column that holds the Hash Value as datatype varbinary(max). Should I change the size of this column?

Thanks in advance.
Post #922745
Posted Monday, May 17, 2010 8:25 AM
SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Yesterday @ 2:10 PM
Points: 889, Visits: 2,460
Assume you have vendorid and vendorname:
AB Abner
A Babner

Because you're converting to varchar, these both collapse to
ABABNER, so they look the same, while they clearly are not.

You have to convert to a fixed width, or put some delimiter in there (and be very certain that delimiter doesn't actually show up in your data).
Post #922930
Posted Monday, May 17, 2010 11:18 AM
Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Thursday, May 8, 2014 10:08 PM
Points: 358, Visits: 397
Cyclone (5/17/2010)
Do you see any problems with this approach?

Also, I checked the length (using Len()) of the Hash Value and the max length I have so far is 20. I have the column that holds the Hash Value as datatype varbinary(max). Should I change the size of this column?


Two problems: Why are you converting to upper case and you don't have a delimiter. 'AB' and 'A' will give the same as 'A' and 'BA' with your method. This has been hashed out (pun intended!) in prior conversations; please read the previous comments. It doesn't matter if the delimiter is in your data or not, just that there is something. You can just use a space. As long as something is there to break it up.

Converting to upper case is a little strange - why do it? Check with your SME or data steward to confirm your target system should ignore changes in case. No one here can give the definite answer, although I advise against it unless you have a requirement to do so.

The output is always 20 bytes for sha1 (and 16 for md5) but I don't know if the system needs extra processing or space to handle (max) instead of just (20). It probably doesn't matter.

The point of my article was the xref tables to hold checksums and keys; did anyone notice that at all in the debate over whose hash to use???
Post #923022
Posted Monday, May 17, 2010 10:43 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, July 21, 2014 4:44 AM
Points: 44, Visits: 237
Hi ...

The reason for converting to upper case is that often users make changes to the data by simply adding the same details in all capital letters (for various reasons). We do not want to pick this up as change in the data, if all that changes is the case.

Cheers
Post #923288
Posted Wednesday, July 7, 2010 8:11 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Thursday, July 8, 2010 4:02 AM
Points: 2, Visits: 10
magarity kerns (4/13/2010)
Comments posted to this topic are about the item <A HREF="/articles/Data+Warehouse/69679/">HASHBYTES can help quickly load a Data Warehouse</A>


Hi great Article thanks!

Do you have any time comaprision between the 'method mentioned in this article' Vs 'An all column comparison' method?

Like what should be a minimum number of rows or what is minimum number of columns after which time consumed in calculating Hash will outdo the time aved in comparison.
Post #948559
Posted Wednesday, July 7, 2010 9:50 AM
Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Thursday, May 8, 2014 10:08 PM
Points: 358, Visits: 397
Like what should be a minimum number of rows or what is minimum number of columns after which time consumed in calculating Hash will outdo the time aved in comparison.

Calculating the hash is extremely lightweight as data goes through an ETL tool. Lightweight enough that it's a foregone conclusion and you can see from other comments that a lot of people are already using one hash function or another. I think you've missed the key architectural point - the 3 column xref tables dramatically reduce disk IO which is much more of a bottleneck than the cpu doing a hash function. The thing is to compare the width of the target table to the width of the xref table.

Many data warehouse tables can be quite wide. Reading a 50 column wide target table just to get the hash value is many times slower than reading the 3 column wide xref table. With SQL Server this can be mitigated with an 'included column' on an index for the target table but some database engines don't have this. Earlier versions of SS don't have it. Also consider development time; it is faster to develop a standardized routine rather than if one mixes and matches techniques. Finally, data warehouses have a strong tendency to change so when a target table that's only 10 columns wide suddenly has a requirement for an additional 25 columns, you would have to rewrite its ETL if you picked the all column compare method because it was only 10 to start.
Post #948651
Posted Thursday, July 8, 2010 4:12 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Thursday, July 8, 2010 4:02 AM
Points: 2, Visits: 10
I think you've missed the key architectural point - the 3 column xref tables dramatically reduce disk IO which is much more of a bottleneck than the cpu doing a hash function. The thing is to compare the width of the target table to the width of the xref table.

Many data warehouse tables can be quite wide. Reading a 50 column wide target table just to get the hash value is many times slower than reading the 3 column wide xref table.


I now understand your point in using xref table, thanks for the clarification.
Post #949136
Posted Friday, December 3, 2010 7:52 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Friday, February 15, 2013 10:05 AM
Points: 4, Visits: 158
Hi Mike,

I am trying to do what you said concatenate all values to varbinary and then hash

when you say concatenate you mean add the binary value of each column as follows:


DECLARE @ColList Varchar(max), @SQLStatment VARCHAR(MAX)
SET @ColList = ''
select @ColList = @ColList + 'convert(varbinary(max), isnull(' + Name + ',0) )+ ' from syscolumns where id = object_id('aggregate')
SELECT @SQLStatment = 'SELECT ' + Substring(@ColList,1,len(@ColList)-1) + ' From aggregate'
select @SQLStatment
exec(@SQLStatment)
Post #1029889
« Prev Topic | Next Topic »

Add to briefcase «««678910»»»

Permissions Expand / Collapse