Checksum gotchas?

Question

Checksum gotchas?

Karma-343206

SSCrazy

Points: 2164
More actions
July 28, 2008 at 3:15 am

#99202

Hello,
I've been tasked with building a marketing db. The db will take a proportion of its data from our ERP and CRM systems as well as holding data of its own.
Since reading about the Checksum function some time ago i've been toying with the idea of using it in this project.
The idea is straight forward enough.
Incoming data is cleansed and stored in temporary tables - via DTS. A checksum for each row is calculated as the data is imported.
A query is then run comparing the imported row against the row in the marketing db - inner join on PK left outer join is null on checksum column.
The incoming CRM/ERP data is not modified in the marketing db so if the CRM/ERP row checksum is different i.e. the data has changed - then i can update the non-pk attributes in the marketing db.
New rows simply get appended as required.
I've tested it and it would appear to work, it's pretty quick as i can index the checksum column -that obviously means storing the checksum for each row in the db but the overhead seems minimal compared to the benefit of having it indexed.
The question i have is, are there any gotchas when using checksum in this way?
Thanks
K.

Viewing 8 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

Jacob Luebbers SSCarpal Tunnel Points: 4864 More actions · Answer 1

If I follow your description correctly you're using CHECKSUM() comparisons to determine whether a newly imported row is different to the stored copy of that row, correct? If so, watch out - the algorithm used is very simplistic (http://www.issociate.de/board/goto/853273/BINARY_CHECKSUM_algorithm.html) and like any one-way hash function there is no guarantee that two rows with the same checksum value are the same, only that two different hash vaues mean the rows are also different - so you run the risk of some changed rows in the imported data not being correctly updated in the target table due to hash collisions from different rows.

How likely is a collision? Probably not very - but can your process accept this small chance of inconsistency? If so great, if not consider alternatives like using a ROWVERSION column in the source table which is compared to a static BINARY column (containing a copy of the rowversion from the previous import) in the marketing tables to detect updated rows.

Regards,

Jacob

Karma-343206 SSCrazy Points: 2164 More actions · Answer 2

Hi,

Thanks for the link 🙂

Just to clarify, i'm using the primary key to verify that the row has or has not been imported previously and the checksum to see if the row is different. I can see that there may be a chance that the Checksum will not have changed even if the row has.

As i have no control over the ERP and CRM solutions -neither of which use SQL Server - adding a ROWVERSION column is not going to be possible in the source data. That leaves 1 alternative i've used previously; first query to determine all previously imported rows using the PK and then an outer join between all columns of that dataset against the stored data to determine rows with no matches i.e. a non-key attribute has changed. Then update the stored data. The other -from reading around a bit - is to implement a more expressive checksum using another type of hashing algorithm in a function, something i think i might try just to see how fast it is.

Thanks

K.

Jacob Luebbers SSCarpal Tunnel Points: 4864 More actions · Answer 3

OK - you're a bit stuck if you can't alter the source tables. Worst case you have to do the brute-force column-by-column comparison. You could still have a win using a checksum or other hash function to detect columns that are different (eg checksums don't match), and then fall back to the brute-force check for all columns where the checksums match to avoid potential collisions. Depends on the ratio of changed vs not changed rows in each import whether it'd be much faster though.

Yes there are better (read: less chance of collision) hash functions than checksum (SQL2005 has some nice builtin ones of which SHA1 is the "best", and there are some 3rd party hashing XPs available for SQL2000), but ALL hash functions have a collision risk of some kind.

Another approach - if your import process can accept "late" updates you could use your fast-but-risky hash-based import for most of the imports (eg hourly, whatever) and then do a less frequent brute-force check (eg daily) to verify that there were no collisions and update any that were found.

Regards,

Jacob

Karma-343206 SSCrazy Points: 2164 More actions · Answer 4

Hi,

uhmm, I've done quite a bit more reading now and added to what you've said i've decided to shelve the idea.

It's one of those ideas that seems perfectly logical, so logical in fact you can't understand why all the text books don't mention it as the obviously great idea it is 😉

At least I now have a fair idea when i'm doing something stupid 😀

Thanks for your time and help

K.

Jacob Luebbers SSCarpal Tunnel Points: 4864 More actions · Answer 5

Happy to help 🙂

Your idea wasn't stupid - just not suitable to your specific task. If your process was tolerant of a collision scenario and could deal with it in other ways it'd be a nice way to improve the performance of your comparisons. Hashes are often useful in other ways, like for indexing very long string/binary columns that otherwise wouldn't be indexable, or as a key for partitioning tables, or for a preliminary comparison of data that is then compared more rigorously afterwards.

Regards,

Jacob

SwePeso SSC-Dedicated Points: 39758 More actions · Answer 6

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=70832

N 56°04'39.16"
E 12°55'05.25"

SwePeso SSC-Dedicated Points: 39758 More actions · Answer 7

Jacob Luebbers (7/29/2008)
How likely is a collision? Probably not very - but can your process accept this small chance of inconsistency? If so great, if not consider alternatives like using a ROWVERSION column in the source table which is compared to a static BINARY column (containing a copy of the rowversion from the previous import) in the marketing tables to detect updated rows.

Actually it is.

Due to my latest findings about BINARY_CHECKSUM it is rather POWER(16, characters).

If checksum value is less than 4096, there are 256 combinations total that produces the same checksum value. If checksum is less than 65536, there are 4096 combinations of strings that produces the same checksum.

N 56°04'39.16"
E 12°55'05.25"