Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

Devin Knight

Devin is a BI consultant at Pragmatic Works Consulting. Previously, he has tech edited the book Professional Microsoft SQL Server 2008 Integration Services and was an author in the book Knight's 24-Hour Trainer: Microsoft SQL Server 2008 Integration Services. Devin has spoken at past conferences like PASS and at several SQL Saturday events. He is a contributing member to the Business Intelligence Special Interest Group (SIG) for PASS as a leader in the SSIS Focus Group. Making his home in Jacksonville, FL, Devin is a participating member of the local users’ group (JSSUG).

SSIS – CHECKSUM Transformation

CHECKSUM is a TSQL function that computes a hash value over a list of arguments.  The great thing about checksum2creating a hash value for a table is that it gives you a unique value to represent each row. 

When would you use the CHECKSUM Function?

An example of using a CHECKSUM is to store the unique value for the entire row in a column that can be used later for comparison.  This is useful in a condition where all of the rows in a table need to be compared to perform an UPDATE.

When would you use the CHECKSUM Transformation?

Now that you have a little background on what CHECKSUM is it’s time to learn how to use it in SSIS.  The CHECKSUM Transformation is not included by default in SSIS but it was developed by Konesans and can be downloaded from www.sqlis.com

checksum

Normally in a package that is required to check for changes in data every single column from the destination must be compared to the incoming source.  With the CHECKSUM Transform creating a unique value for each row you only have to compare one column and that’s your CHECKSUM Value.

Checksum != DestCheckSum

To configure the CHECKSUM Transform open the editor and check the columns that you want to have the hash value created for and give the output column a name.

This is a great tool and a big time saver when developing!

Comments

Posted by Steve Jones on 5 January 2009

Good to see you blogging and nice summary here. I might include a slightly longer example of where this makes sense, perhaps a scenario that you've used it.

Posted by knight_devin@hotmail.com on 5 January 2009

Thanks for the advice.  Maybe this topic would be a good article if I elaborated a lot more.

Posted by Steve Jones on 6 January 2009

It definitely would be. It's a good blog length and a specific or two would help, but I wouldn't walk through step by step. If you do that, make it an article and submit it me or another publisher.

Posted by markve on 8 January 2009

It does not work for the CRC32 algorithm. Create an SQL table ChecksumTest with 1 or more columns. Fill it with the same values for about 5 rows. You will see that the algorithm "Original" and "FrameworkChecksum" works fine. The results in all rows are the same. For the CRC32 algorithm the value is different for each row.

Anyone got an idea what's going wrong here?

Posted by knight_devin@hotmail.com on 8 January 2009

That's interesting I haven't tried using that algorithm.  I did find this link explaining the CRC32 algorithm and it says not to use it for data integrity.  I don't really see  a good reason why though.

www.accuhash.com/what-is-crc32.html

Posted by Hubajube on 13 January 2009

You can't use CRC32 if you must guarantee that you catch every change.  That is, there exist many inputs that have the same checksum.

The CRC32 algorithm bug over at sqlis.com is a different beast.  It seems like they are including the row number in the input somehow, meaning that identical values will produce different checksums if they exist in different rows.  That's unfortunate because it makes this algorithm unusable.

Posted by rmg66 on 13 January 2009

I discovered the same "bug" myself. Since the other two options (Original and FrameworkChecksum) have known issues I want to avoid using them, but this issue with the CRC32 option makes it impossible to use. I have breifly reviewed the code (I'm not a dot.net programmer by any means) but it seems the stream that you are pushing to the hash function is somehow grabbing the internal row number on each row that it processes.

I can't imagine this is by design, unless I'm missing something here.

Posted by raghava_2507-1128442 on 19 August 2010

I have created a checksum column(out of three columns) in a table. In the SSIS checksum transformation also I have selected the same three columns and opted for CRC-32 algorithm. Now the problem here is, there is no match between the hash value generated from the SQL server and the value generated from the SSIS checksum transformation.

Please help.

Posted by developermax on 30 May 2011

Hi,

Another way to use CHECKSUM for comparing rows between source and destination it is to add the CHECKSUM function direct in the queries for source  and lookup and use after

a conditional split to redirect the rows how changed .

Posted by westers on 25 August 2011

Using TSQL checksum is unreliable for change detection on numeric types as the following example shows.

-- changing 10.0 to 100.0 should produce a unique checksum

-- but it doesn't.  Same applies to binary_checksum

SELECT checksum('a',10.0), checksum('a',100.0)

Posted by knight_devin@hotmail.com on 1 November 2011

Just catching up on comments guys.. I actually don't recommend using this tool now for the reasons documented in the comments.  I use the TSQL Function Hashbytes now.

Posted by clausm73 on 28 March 2012

According to information from the link:

www.accuhash.com/what-is-crc32.html

mentioned earlier in the thread - CRC32 is not good for data validation. But Cryptographic hash functions, such as MD5 or SHA-1, can be used to verify data integrity.

Well, ust head over to the Codeplex.com site for the SSIS Multiple Hash data flow component to get support for those algorithms (amongst others):

http://ssismhash.codeplex.com/

Just install and replace with the CHECKSUM component as showed in this articles data flow sample - and you have reliable data validation.

Posted by clausm73 on 28 March 2012

My previous statment about a simple replacement isn't quite true.

Depending on the algorithm used the output column from the SSIS Multiple Hash component containing the hash  value can be of type [DT_BYTES] (which is stored as BINARY(n) value in the database).

The problem is when comparing an existing hash value with a generated one using a Conditional Split component you receive the error:

"DT_BYTES cannot be used with binary operator "!="."

A workaround to this problem is to convert either side in the expression to the same string type:

(DT_WSTR, 16)[NewHashValue] != (DT_WSTR, 16)[ExistingHashValue]

Performance wise I have no idea if this is a recommended way of doing the comparison or if its better to convert hash values to hexadecimals strings, store it as string data type and do a comparison, but it gets the job done.

Either way it seems that to use the Conditional Split for comparison both expressions should be converted to a string data type.

Possible a Script component using a byte-by-byte comparsion will be faster. But i'm No .Net programmer so I stick with the Conditional Split using string comparison solution.

Posted by clausm73 on 28 March 2012

Yet another correction.

Using (DT_WSTR, 16) will result in a truncation error when we convert from DT_BYTES(16). I guess this have to do with the fact that wide strings are in fact 2-byte characters so we should multiple our byte length with 2 to avoid the truncation.

So the Conditional Split expression becomes:

(DT_WSTR, 32)[NewHashValue] != (DT_WSTR, 32)[ExistingHashValue]

Note: If you somehow compary to binary values with T-SQL they are returned as hexadecimals strings with a leading "0x0" - thats 3 extra characters making any DT_BYTES(16) value stored as a string data type have a total length of 32 + 3 characters.

See more about this when using MD5 hash in T-SQL:

www.lazerwire.com/.../ms-sql-md5-hash.html

Leave a Comment

Please register or log in to leave a comment.