CHECKSUM_AGG

  • Comments posted to this topic are about the item CHECKSUM_AGG

  • This was removed by the editor as SPAM

  • Good question, got it wrong but learnt something new, thanks.

    ...

  • HappyGeek (2/4/2016)


    Good question, got it wrong but learnt something new, thanks.

    +1

  • I had to look that function up as I hadn't seen it before.

    Is anyone able to provide a real-world example demonstrating the usage? What I'm interested to know is what business requirement was solved by using it. Thanks.

  • david.dilworth (2/5/2016)


    I had to look that function up as I hadn't seen it before.

    Is anyone able to provide a real-world example demonstrating the usage? What I'm interested to know is what business requirement was solved by using it. Thanks.

    It's new to me too, but I think you could use it to detect changes within a range of data.

  • I don't like this part of the answer: "across all rows in a table" because the result can be confined by a where clause.

    Don Simpson



    I'm not sure about Heisenberg.

  • DonlSimpson (2/5/2016)


    I don't like this part of the answer: "across all rows in a table" because the result can be confined by a where clause.

    And you can also use grouping to segregate the values. The BOL entry states "group" which is more accurate.

    However, I liked the question. Not something I have ever used and will probably forget the next time I could make use of it. 😀

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Nice question, thanks.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Sean Lange (2/5/2016)


    DonlSimpson (2/5/2016)


    I don't like this part of the answer: "across all rows in a table" because the result can be confined by a where clause.

    And you can also use grouping to segregate the values. The BOL entry states "group" which is more accurate.

    And the FROM clause may specify a view instead of a table, or some complicated mix of inner joins and outer joins and cross joins using views as well as tables. This of course suggests that the word "table" in the question is absolutely correct because it is clearly being used in its general sense which includes "derived table" (a derived table is any rowset which can be derived from the base tables of the schema using SQL).

    However, I liked the question. Not something I have ever used and will probably forget the next time I could make use of it. 😀

    I quite like the question too. I thought about using it (checksum_agg, not the question :hehe:) once, then thought a bit harder about what the order-independant "checksum" might be, ran some trivial experiments, and decided this aggregate was much too likely to return false negatives to be any use at all for what I was trying to do (and, I believe, discovered exactly what the aggregate was: the reduction of XOR over the list of values). So I also believe, of course, that BoL is wrong about the properties of this aggregate - changing one of the values fed into the aggregation will, I believe, always change the resulting aggregate - to avoid changing the aggregate you have to change at least two of the input values, and BoL has said ever since I remember said that changing one of the input values may (rarely) not change the result.

    Tom

  • sestell1 (2/5/2016)


    david.dilworth (2/5/2016)


    I had to look that function up as I hadn't seen it before.

    Is anyone able to provide a real-world example demonstrating the usage? What I'm interested to know is what business requirement was solved by using it. Thanks.

    It's new to me too, but I think you could use it to detect changes within a range of data.

    No, checksums are never a reliable method for detecting changes. Since the number of distinct outputs from a checksum coomputation is less than the number of distinct inputs, it is unavoidable that there are multiple inputs with the same output. In other words, two completely different sets of data can result in the same CHECKSUM_AGG value.

    You can use it for the reverse, though: detecting when two sets of value are identical. When the checksum values are different, you are sure that the sets are different. When the checksum values are the same, you still have to do a full test on all values because it can still be different sets. But using the checksum will seriously reduce the number of times you have to do a full comparison, becuase most sets with a difference will be filtered out on the cheaper test.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Hugo Kornelis (2/6/2016)


    sestell1 (2/5/2016)


    david.dilworth (2/5/2016)


    I had to look that function up as I hadn't seen it before.

    Is anyone able to provide a real-world example demonstrating the usage? What I'm interested to know is what business requirement was solved by using it. Thanks.

    It's new to me too, but I think you could use it to detect changes within a range of data.

    No, checksums are never a reliable method for detecting changes. Since the number of distinct outputs from a checksum coomputation is less than the number of distinct inputs, it is unavoidable that there are multiple inputs with the same output. In other words, two completely different sets of data can result in the same CHECKSUM_AGG value.

    In the case of checksum_agg, there are only 32 bits of output, so collisions will be much more frequent than if a decent redundancy function (not neccessarily suitable for crypological purposes) function were used to produce say 128 bits.

    Checksum_agg is what we used to call vertical parity in the very early days of data transmission (whether between or within computers/peripherals) and has such incredibly poor behaviour (failing to detect changes) in the presence of the noise patterns typically encountered that just about every use of it was taken over by some form of CRC based on the paper by Peterson and Brown published in the Proceedings of the IRE (which is now - since 1962 - called Proceedings or the IEEE) in January 1961.

    You can use it for the reverse, though: detecting when two sets of value are identical. When the checksum values are different, you are sure that the sets are different. When the checksum values are the same, you still have to do a full test on all values because it can still be different sets. But using the checksum will seriously reduce the number of times you have to do a full comparison, becuase most sets with a difference will be filtered out on the cheaper test.

    If all bits the set of integers are equally likely to be set a vast proportion of changes will be detected, but in other cases the proportion detected can be pretty small. It would be silly to use checksum_agg on a set of integers which are known to come from ranking with no matches (since the only thing that can change the checksum is a change in the list's length, so count(expression) makes more sense) or on a set where every value present is known to occur an even number of times because of the way insertions, deletions, and updates work (so that checksum_agg is always zero), but even in not so bizarre cases there may be patterns in your data that mean there will be far more collisions than you expect. It's entertaining to look at lists of primes to get an example: Suppose L<N> is the list of the first N primes, so L1 is the list containing 2 and nothing else and L6542 is the list of all primes smaller than 65536. If we feed each of these lists to checksum_agg, we don't get 6542 different checksums, we get only 4896 - 25% of the time there's a change checksum_agg won't detect it. Or for an example where the lists are all the same length, consider lists of 5000 consecutive primes smaller than 65536; there are 1543 such lists (6542-5000+1) but they have only 1219 distinct checksums, so 20% of changes go undetected by checksum_agg.

    Does our data show any patterns that will mean we are going to have to do the expensive check in more than a quarter of cases even if the data is almost always changed, as suggested by this example? Do we have a reasonable chance of working out whether our data has some such nasty pattern before it bites us?

    And there's still a problem even if we can detect a high proportion of changes. Of course it's desirable to eliminate some expensive checks, but if the most common situation is that there isn't a difference then the most common situation is that we do the expensive check - don't let the detection of 80% or even 99.99999% of changes fool you into thinking that you won't do many expensive checks: with a checksum or hash used to detect change you gain very little if there is almost always no change, because you can't detect that case without the expensive checks, you only eliminate expensive checks in the cases where there has been a change and it was detected.

    Tom

  • I don't fully understand how this works to detect changes at the table level. BOL states that 'ALL' is the default value yet using 'ALL' doesn't work when you try to run it? What am I missing?

    SELECT CHECKSUM_AGG(ALL) FROM dbo.MyTable

    ______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience

  • MyDoggieJessie (2/6/2016)


    I don't fully understand how this works to detect changes at the table level. BOL states that 'ALL' is the default value yet using 'ALL' doesn't work when you try to run it? What am I missing?

    SELECT CHECKSUM_AGG(ALL) FROM dbo.MyTable

    Check the syntax diagram in Books Online: "CHECKSUM_AGG ( [ALL | DISTINCT ] expression )"

    Your example code has the ALL, but fails to specify the expression,

    Also, as already explained above, using this to detect changes is only valid if you can accept to miss some changes.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Stewart "Arturius" Campbell (2/4/2016)


    Interesting function - have never used it before

    thanks for the question, Steve - learned something new today

    + 1

    Thanks & Best Regards,
    Hany Helmy
    SQL Server Database Consultant

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply