• Hugo Kornelis (2/6/2016)


    sestell1 (2/5/2016)


    david.dilworth (2/5/2016)


    I had to look that function up as I hadn't seen it before.

    Is anyone able to provide a real-world example demonstrating the usage? What I'm interested to know is what business requirement was solved by using it. Thanks.

    It's new to me too, but I think you could use it to detect changes within a range of data.

    No, checksums are never a reliable method for detecting changes. Since the number of distinct outputs from a checksum coomputation is less than the number of distinct inputs, it is unavoidable that there are multiple inputs with the same output. In other words, two completely different sets of data can result in the same CHECKSUM_AGG value.

    In the case of checksum_agg, there are only 32 bits of output, so collisions will be much more frequent than if a decent redundancy function (not neccessarily suitable for crypological purposes) function were used to produce say 128 bits.

    Checksum_agg is what we used to call vertical parity in the very early days of data transmission (whether between or within computers/peripherals) and has such incredibly poor behaviour (failing to detect changes) in the presence of the noise patterns typically encountered that just about every use of it was taken over by some form of CRC based on the paper by Peterson and Brown published in the Proceedings of the IRE (which is now - since 1962 - called Proceedings or the IEEE) in January 1961.

    You can use it for the reverse, though: detecting when two sets of value are identical. When the checksum values are different, you are sure that the sets are different. When the checksum values are the same, you still have to do a full test on all values because it can still be different sets. But using the checksum will seriously reduce the number of times you have to do a full comparison, becuase most sets with a difference will be filtered out on the cheaper test.

    If all bits the set of integers are equally likely to be set a vast proportion of changes will be detected, but in other cases the proportion detected can be pretty small. It would be silly to use checksum_agg on a set of integers which are known to come from ranking with no matches (since the only thing that can change the checksum is a change in the list's length, so count(expression) makes more sense) or on a set where every value present is known to occur an even number of times because of the way insertions, deletions, and updates work (so that checksum_agg is always zero), but even in not so bizarre cases there may be patterns in your data that mean there will be far more collisions than you expect. It's entertaining to look at lists of primes to get an example: Suppose L<N> is the list of the first N primes, so L1 is the list containing 2 and nothing else and L6542 is the list of all primes smaller than 65536. If we feed each of these lists to checksum_agg, we don't get 6542 different checksums, we get only 4896 - 25% of the time there's a change checksum_agg won't detect it. Or for an example where the lists are all the same length, consider lists of 5000 consecutive primes smaller than 65536; there are 1543 such lists (6542-5000+1) but they have only 1219 distinct checksums, so 20% of changes go undetected by checksum_agg.

    Does our data show any patterns that will mean we are going to have to do the expensive check in more than a quarter of cases even if the data is almost always changed, as suggested by this example? Do we have a reasonable chance of working out whether our data has some such nasty pattern before it bites us?

    And there's still a problem even if we can detect a high proportion of changes. Of course it's desirable to eliminate some expensive checks, but if the most common situation is that there isn't a difference then the most common situation is that we do the expensive check - don't let the detection of 80% or even 99.99999% of changes fool you into thinking that you won't do many expensive checks: with a checksum or hash used to detect change you gain very little if there is almost always no change, because you can't detect that case without the expensive checks, you only eliminate expensive checks in the cases where there has been a change and it was detected.

    Tom