Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 12345»»»

HASHBYTES can help quickly load a Data Warehouse Expand / Collapse
Author
Message
Posted Tuesday, April 13, 2010 10:05 PM
Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Thursday, May 8, 2014 10:08 PM
Points: 358, Visits: 397
Comments posted to this topic are about the item HASHBYTES can help quickly load a Data Warehouse
Post #902868
Posted Tuesday, April 13, 2010 10:40 PM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Wednesday, September 24, 2014 1:20 PM
Points: 1,276, Visits: 1,135
Nice article! I implemented something very similar for loading data warehouses and datamarts a while back. One thing to keep in mind is that you can actually use a BINARY(16) column to hold MD5 hashes or BINARY(20) for SHA-1 since they're guaranteed to return a binary hash with exact length of 128-bits or 160-bits respectively. Also HASHBYTES is restricted in what it can hash -- there's an 8,000 byte restriction on the input, so you can't hash LOB data or columns wider than 8,000 bytes with it effectively. You can work around that limitation by chopping up your input or by generating the hashes in your ETL tool of choice (SSIS script components allow direct access to the System.Security.Cryptography namespace, for example).

Mike C
Post #902873
Posted Wednesday, April 14, 2010 3:20 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Today @ 1:57 AM
Points: 46, Visits: 3,360
I have not read the article in detail, so correct me if i'm wrong, but the hashbytes accepts only one column (of one type character) and in this article the hashbyte calculation is made simple by using one column.

Currently i''m using the far from perfect checksum calculation (and addtitional fields comparison) for comparing records. This accepts multiple columns and types. When indexed, this will be faster than the hashbytes functionality.

This article draw my attention because the checksum is not perfect. Multiple records will lead to the same checksum result. I'm interested in a solution using the hashbytes with multiple columns and multiple types. Still looking for a good solution.
Post #902947
Posted Wednesday, April 14, 2010 6:26 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Tuesday, May 11, 2010 7:16 AM
Points: 6, Visits: 26
Hi

Iv'e been developing a DWH for about three years and starting using this technique about a year ago. I started off by trying a checksum but that was way off in terms of uniqueness, with sql 2005 hashbytes improved this but you cant do it over multiple columns. The MD5 hash is much quicker than the other algorithms but is less unique, I also tried SHA1 but its very slow in comparison.

I need to use multiple columns in the hashindex and so use a checksum over hashbytes such as:
checksum(hashbytes('MD5', column1), hashbytes('MD5',column2)) - this works pretty well but is still not 100% unique across all records (in fact even a hashbytes on a single column is not 100% unique across all values).

I suggest you wrap the hashbytes into a UDF so that a change is easy to make for all ETL loads so instead of doing hashbytes('MD5', column1) do dbo.fnHash(column1).

A trick to improve uniqueness when using checksum is to convert dates to strings so use checksum(convert(varchar(10), dateColumn, 102). Beware of floats they cause problems in checksums as even if they are the same value they may give different results - convert them to decimal(38, 8). As a final point put strings as the first column in checksum so checksum(hashbytes(stringColumn), intColumn) is more unique than checksum(intColumn, hashbytes(stringColumn)).

When you do your update query into the DWH you must also include all hashed columns in your join such as: "join Stage.Hash = DWH.Hash and Stage.Column1 = DWH.Column1" - this will always be 100% unique. This has implications for your index on your hash column it must have the hash as a leading column and then include all other columns that are part of the hash such as: create index index_name on table_name (HashIndex, Column1).

This will save you a lot of pain....................
Post #903056
Posted Wednesday, April 14, 2010 6:34 AM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Wednesday, September 24, 2014 1:20 PM
Points: 1,276, Visits: 1,135
When I used HASHBYTES my solution was to covert all columns to varbinary and concatenate them together into one value to pass to hashbytes. It's pretty easy to auto-generate the correct statements using information_schema. The downside for me was that the data must reside on the server prior to hash generation and that pesky 8000 byte limitation. I found doing it in SSIS to be more efficient and got around the 8000 byte limit. One interesting feature was that the .NET hash functions for MD5 and SHA-1 generated the same hashes as HASHBYTES for string/character data less than 8000 bytes when I tested. That could be potentially useful, although I didn't do exhaustive testing on it.

I also posted a Connect item requesting CHECKSUM-style syntax to generate HASHBYTES-style hashes a few months ago. I'll post a link when I get to a real computer.
Post #903064
Posted Wednesday, April 14, 2010 6:47 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Today @ 1:57 AM
Points: 46, Visits: 3,360

I need to use multiple columns in the hashindex and so use a checksum over hashbytes such as:
checksum(hashbytes('MD5', column1), hashbytes('MD5',column2)) - this works pretty well but is still not 100% unique across all records (in fact even a hashbytes on a single column is not 100% unique across all values).

As you says: This still is not unique


I suggest you wrap the hashbytes into a UDF so that a change is easy to make for all ETL loads so instead of doing hashbytes('MD5', column1) do dbo.fnHash(column1).

The problem here is that you can pass only one column to a function. Perhaps i have to create a semi array and pass this to a function. I know there are some solutions for this.

I want something like this in a view:

SELECT
Field1,
Field2
fnHashCalc(field1, Field2) AS Meta_Hash
From view, table
....
WHERE
Table.Meta_Hash <> fnHashCalc(field1, Field2)

It's not correct SQL but i hope get my point. I can't do calculations before this query because its a view.

A checksum like function for hash would be great yeah. It would help building datawarehouses further on. Comparing whether you have loaded the record earlier over millions and millions of records and with a great performance is one of my concerns.
Post #903075
Posted Wednesday, April 14, 2010 6:57 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Tuesday, May 11, 2010 7:16 AM
Points: 6, Visits: 26
In reply to : fnHashCalc(field1, Field2) AS Meta_Hash

Its not the most elegant solution but checksum(fnHash(field1), fnHash(field2)) as Meta_hash will do what you looking for...
Post #903087
Posted Wednesday, April 14, 2010 7:03 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Today @ 1:57 AM
Points: 46, Visits: 3,360
ahh i get it.. thanx
Post #903092
Posted Wednesday, April 14, 2010 7:09 AM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Wednesday, September 24, 2014 1:20 PM
Points: 1,276, Visits: 1,135
Wrapping HASHBYTES with CHECKSUM is not a good idea. CHECKSUM generates a 32 bit hash using a simple bit-shift+XOR algorithm. So it provides very poor uniqueness with the first collision expected at 2^16. HASHBYTES generates 128 or 160 bit hashes. For more uniqueness you can use CLR or SSIS script components/custom components to generate SHA-256, 384 or 512 bit hashes.
Post #903099
Posted Wednesday, April 14, 2010 7:10 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Wednesday, September 7, 2011 7:17 AM
Points: 2, Visits: 57
I've implemented a solution similar to this one, except my MD5 was originally calculated in an SSIS package. This makes it easier to scale the solution.

One issue I have with this solution is that MD5 *will* have collisions with moderately "wide" tables, especially when bit columns are involved. In fact, I would see as many as two or three collisions in as few as 1 million rows on a table with a dozen columns. It would usually occur when the column that changed as a bit field.

In fact, I would go as far as saying that MD5 is nearly useless for any serious data warehouse scenario. Instead, SHA1 should be used, and it should be implemented inside an SSIS pipeline component, not inside the database itself. This allows you to keep duplicate data out of the database to being with, rather than sort through it in staging tables. This will increase performance.
Post #903103
« Prev Topic | Next Topic »

Add to briefcase 12345»»»

Permissions Expand / Collapse