HASHBYTES can help quickly load a Data Warehouse

Question

HASHBYTES can help quickly load a Data Warehouse

magarity kerns

SSCrazy

Points: 2763
More actions
April 13, 2010 at 10:05 pm

#217865

Comments posted to this topic are about the item HASHBYTES can help quickly load a Data Warehouse

Viewing 15 posts - 1 through 15 (of 108 total)

You must be logged in to reply to this topic. Login to reply

Mike C SSC-Insane Points: 23224 More actions · Answer 1

Nice article! I implemented something very similar for loading data warehouses and datamarts a while back. One thing to keep in mind is that you can actually use a BINARY(16) column to hold MD5 hashes or BINARY(20) for SHA-1 since they're guaranteed to return a binary hash with exact length of 128-bits or 160-bits respectively. Also HASHBYTES is restricted in what it can hash -- there's an 8,000 byte restriction on the input, so you can't hash LOB data or columns wider than 8,000 bytes with it effectively. You can work around that limitation by chopping up your input or by generating the hashes in your ETL tool of choice (SSIS script components allow direct access to the System.Security.Cryptography namespace, for example).

Mike C

hennie7863 SSCommitted Points: 1653 More actions · Answer 2

I have not read the article in detail, so correct me if i'm wrong, but the hashbytes accepts only one column (of one type character) and in this article the hashbyte calculation is made simple by using one column.

Currently i''m using the far from perfect checksum calculation (and addtitional fields comparison) for comparing records. This accepts multiple columns and types. When indexed, this will be faster than the hashbytes functionality.

This article draw my attention because the checksum is not perfect. Multiple records will lead to the same checksum result. I'm interested in a solution using the hashbytes with multiple columns and multiple types. Still looking for a good solution.

tarquin SSC Veteran Points: 298 More actions · Answer 3

Hi

Iv'e been developing a DWH for about three years and starting using this technique about a year ago. I started off by trying a checksum but that was way off in terms of uniqueness, with sql 2005 hashbytes improved this but you cant do it over multiple columns. The MD5 hash is much quicker than the other algorithms but is less unique, I also tried SHA1 but its very slow in comparison.

I need to use multiple columns in the hashindex and so use a checksum over hashbytes such as:

checksum(hashbytes('MD5', column1), hashbytes('MD5',column2)) - this works pretty well but is still not 100% unique across all records (in fact even a hashbytes on a single column is not 100% unique across all values).

I suggest you wrap the hashbytes into a UDF so that a change is easy to make for all ETL loads so instead of doing hashbytes('MD5', column1) do dbo.fnHash(column1).

A trick to improve uniqueness when using checksum is to convert dates to strings so use checksum(convert(varchar(10), dateColumn, 102). Beware of floats they cause problems in checksums as even if they are the same value they may give different results - convert them to decimal(38, 8). As a final point put strings as the first column in checksum so checksum(hashbytes(stringColumn), intColumn) is more unique than checksum(intColumn, hashbytes(stringColumn)).

When you do your update query into the DWH you must also include all hashed columns in your join such as: "join Stage.Hash = DWH.Hash and Stage.Column1 = DWH.Column1" - this will always be 100% unique. This has implications for your index on your hash column it must have the hash as a leading column and then include all other columns that are part of the hash such as: create index index_name on table_name (HashIndex, Column1).

This will save you a lot of pain....................

Mike C SSC-Insane Points: 23224 More actions · Answer 4

When I used HASHBYTES my solution was to covert all columns to varbinary and concatenate them together into one value to pass to hashbytes. It's pretty easy to auto-generate the correct statements using information_schema. The downside for me was that the data must reside on the server prior to hash generation and that pesky 8000 byte limitation. I found doing it in SSIS to be more efficient and got around the 8000 byte limit. One interesting feature was that the .NET hash functions for MD5 and SHA-1 generated the same hashes as HASHBYTES for string/character data less than 8000 bytes when I tested. That could be potentially useful, although I didn't do exhaustive testing on it.

I also posted a Connect item requesting CHECKSUM-style syntax to generate HASHBYTES-style hashes a few months ago. I'll post a link when I get to a real computer.

hennie7863 SSCommitted Points: 1653 More actions · Answer 5

I need to use multiple columns in the hashindex and so use a checksum over hashbytes such as:
checksum(hashbytes('MD5', column1), hashbytes('MD5',column2)) - this works pretty well but is still not 100% unique across all records (in fact even a hashbytes on a single column is not 100% unique across all values).

As you says: This still is not unique

I suggest you wrap the hashbytes into a UDF so that a change is easy to make for all ETL loads so instead of doing hashbytes('MD5', column1) do dbo.fnHash(column1).

The problem here is that you can pass only one column to a function. Perhaps i have to create a semi array and pass this to a function. I know there are some solutions for this.

I want something like this in a view:

SELECT

Field1,

Field2

fnHashCalc(field1, Field2) AS Meta_Hash

From view, table

....

WHERE

Table.Meta_Hash <> fnHashCalc(field1, Field2)

It's not correct SQL but i hope get my point. I can't do calculations before this query because its a view.

A checksum like function for hash would be great yeah. It would help building datawarehouses further on. Comparing whether you have loaded the record earlier over millions and millions of records and with a great performance is one of my concerns.

tarquin SSC Veteran Points: 298 More actions · Answer 6

In reply to : fnHashCalc(field1, Field2) AS Meta_Hash

Its not the most elegant solution but checksum(fnHash(field1), fnHash(field2)) as Meta_hash will do what you looking for...

hennie7863 SSCommitted Points: 1653 More actions · Answer 7

hennie7863

SSCommitted

Points: 1653

April 14, 2010 at 7:03 am

#1151307

ahh i get it.. thanx

Mike C SSC-Insane Points: 23224 More actions · Answer 8

Wrapping HASHBYTES with CHECKSUM is not a good idea. CHECKSUM generates a 32 bit hash using a simple bit-shift+XOR algorithm. So it provides very poor uniqueness with the first collision expected at 2^16. HASHBYTES generates 128 or 160 bit hashes. For more uniqueness you can use CLR or SSIS script components/custom components to generate SHA-256, 384 or 512 bit hashes.

rmd0 SSC Enthusiast Points: 160 More actions · Answer 9

I've implemented a solution similar to this one, except my MD5 was originally calculated in an SSIS package. This makes it easier to scale the solution.

One issue I have with this solution is that MD5 *will* have collisions with moderately "wide" tables, especially when bit columns are involved. In fact, I would see as many as two or three collisions in as few as 1 million rows on a table with a dozen columns. It would usually occur when the column that changed as a bit field.

In fact, I would go as far as saying that MD5 is nearly useless for any serious data warehouse scenario. Instead, SHA1 should be used, and it should be implemented inside an SSIS pipeline component, not inside the database itself. This allows you to keep duplicate data out of the database to being with, rather than sort through it in staging tables. This will increase performance.

tarquin SSC Veteran Points: 298 More actions · Answer 10

Wrapping hashbytes with checksum is not bad as long as you include all columns in your join as indicated and create the index including all columns required for the join. If you only use the result of the checksum in your join - yes it is bad!

Mike C SSC-Insane Points: 23224 More actions · Answer 11

I wouldn't expect many collisions with MD5 unless you're processing 2^64 or more rows. If you're getting collisions with MD5 or SHA I would immediately suspect the concatenation function you're using.

Daniel Bowlin SSC-Dedicated Points: 34566 More actions · Answer 12

Fascinating article. Well thought out and presented. Definitely an approach I will keep in mind, partcularly when source data may be out of sync at load time as presented here.

Thanks.

rmd0 SSC Enthusiast Points: 160 More actions · Answer 13

Mike C (4/14/2010)
I wouldn't expect many collisions with MD5 unless you're processing 2^64 or more rows. If you're getting collisions with MD5 or SHA I would immediately suspect the concatenation function you're using.

Ha! I apologize. Not sure why, but I was thinking CRC32 this entire time. MD5 and SHA1 both work great. CRC32 will result in collisions.

magarity kerns SSCrazy Points: 2763 More actions · Answer 14

hennie7863 (4/14/2010)
I have not read the article in detail, so correct me if i'm wrong, but the hashbytes accepts only one column (of one type character) and in this article the hashbyte calculation is made simple by using one column.
Currently i''m using the far from perfect checksum calculation (and addtitional fields comparison) for comparing records. This accepts multiple columns and types. When indexed, this will be faster than the hashbytes functionality.
This article draw my attention because the checksum is not perfect. Multiple records will lead to the same checksum result. I'm interested in a solution using the hashbytes with multiple columns and multiple types. Still looking for a good solution.

Yes, I did make the example overly simplified because I just wanted to trigger an ah-ha! moment. The article was already running long without getting into a lot of what if situations. To address your concern, it's pretty easy to convert various types of columns into a common type and concatenate them together in a function that would also process the MD5 sum. If the resulting length exceeds the HASHBYTES function's input limit then this method isn't for you or you need to come up with a way to shorten them somehow. There's no reason why a extremely wide column couldn't be put into two or three checksums in the XREF table.

Obviously this isn't the perfect method for everyone but I think some people will find it useful.