Roll Your Own Materialized Views

  • Comments posted to this topic are about the item Roll Your Own Materialized Views

  • I would primarily note that CHECKSUM() returns a 4 byte CRC-32 value, and therefore has a very small number of possible values; a four and a half billion row table (i.e. data warehouse) absolutely guarantees collisions.

    Unfortunately, the chance of a collision is much greater than that. For trivial collisions, try:

    SELECT CHECKSUM(1,1,1,1)

    SELECT CHECKSUM(0,1,1,1,1)

    SELECT CHECKSUM(0,0,1,1,1,1)

    I'd recommend changing this to actually do a full field by field comparison (noting which fields are NOT NULL and which allow NULL values). If you truly insist on a hash, your best, still fairly simple, bet is probably to use SHA512 out of some CLR code. HASHBYTES with SHA1 is built into SQL 2005+, though SHA1 is still not very good... but 160 bits of SHA1 hash is immensely better than 32 bits of CRC-32.

    Untested reference for SHA512:

    http://sqlblog.com/blogs/michael_coles/archive/2009/04/12/let-s-hash-a-blob.aspx

  • Point well taken, checksum is not very unique. The code should mitigate this, if there are duplicates you may end up deleting records that haven't changed, then adding them back. Little extra work for server, that's ok...

  • As another example of the issue: assume the data actually changes in the base table, but the checksum of the new data is the same as the checksum of the old data.

    From here on out, the view will continue to return invalid (old) results until the data changes to a data set that returns a different checksum.

    When you have bulk changes on large numbers of rows, large numbers of changes, or very specific collision generating changes (adding leading zeroes in particular, for CHECKSUM()), the code is more likely to fail to notice changes that actually happened.

  • Thanks for the script.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply