The checksum

  • I agree with the others. This answer is completely wrong. It should be "No, never".

    The possibility of having collisions in fact is the reason why you never can rely on it to notice changed records, even if the order of the columns would be the same.

    The fact that in some unlikely cases it might correctly show that data didn't change (when all the 3 fields contain the same data), doesn't mean you can avoid it.

  • I agree--he wouldn't ever be able to avoid round trips, because him checksum tells him nothing about whether the row was updated or not. he has no more information after the checksum than he did before.

  • The idea in checksum is to reduce work. ALL hashing algorithms have the possibility of collisions. The faster/shorter/cheaper the algorithm is, the more likely there are collisions. CHECKSUM() is dangerous in this case.

    However, you're approaching this as an all or nothing idea if you say never. If I could eliminate 60% of round trips because my checksum is "good enough", then that's fine. It works for the system and I can avoid round trips.

    In this case, it's a poor programming practice, which is what I wanted people to see. Bobby can avoid round trips with his choices, but he's made a mistake in calculating the checksum. He may or may not detect a change, but if he did not, then it's likely a false negative. If he does detect a change, it's more than likely a false positive. If he detects a change (or not) and the data is changed (or not), it's very, very lucky.

  • Steve Jones - SSC Editor (9/15/2015)


    The idea in checksum is to reduce work. ALL hashing algorithms have the possibility of collisions. The faster/shorter/cheaper the algorithm is, the more likely there are collisions. CHECKSUM() is dangerous in this case.

    However, you're approaching this as an all or nothing idea if you say never. If I could eliminate 60% of round trips because my checksum is "good enough", then that's fine. It works for the system and I can avoid round trips.

    In this case, it's a poor programming practice, which is what I wanted people to see. Bobby can avoid round trips with his choices, but he's made a mistake in calculating the checksum. He may or may not detect a change, but if he did not, then it's likely a false negative. If he does detect a change, it's more than likely a false positive. If he detects a change (or not) and the data is changed (or not), it's very, very lucky.

    I said "No, never" because as written (with the incorrect order), the checksum Bobby calculates provides no valid information (not even a hint, as you'd get with the correct order) as to whether or not the data has changed. So he CANNOT avoid a round trip with his code as written.

    But now, I think the correct answer is "Yes, Always". Here's how it works:

    1. Bobby writes this code.

    2. Bobby gets fired for using a dangerous function and using it incorrectly.

    3. Bobby has successfully avoided all round trips because the application is no longer "his".

  • The question may not have been fully qualified, but if you make some minimal assumptions about it, the answer given is correct. The whole point of the question is that the order of the columns used in calculating the checksum value is critical in generating a usable checksum, not that every checksum is unique.

    For example, assume the query was restated as "Will the following query return 1 when the specified record has not changed?" :

    select 1

    from MyProducts

    where productid = @queryid

    and producthash = checksum(sku, productname, productdescription;

    a) Always

    b) Never

    c) Sometimes, but very unlikely

    The answer here is clearly "c" - sometimes the same checksum may be computed from a different order of the columns, but it is highly unlikely.

    I use checksum or hashbytes for data conversions where I need to handle changes to the source data over time. On large data conversions, where the actual conversion process may run for several days, it is often necessary to run it on an initial snapshot of the data, then days or weeks/months later take another snapshot and run, but only process the changed records.

    I would never use it to determine if two different indexed records were the same, but to detect changes within a record it works well, and the real-world probability of failing to detect a change is so small that it can be ignored in most applications.

  • sknox (9/15/2015)


    But now, I think the correct answer is "Yes, Always". Here's how it works:

    1. Bobby writes this code.

    2. Bobby gets fired for using a dangerous function and using it incorrectly.

    3. Bobby has successfully avoided all round trips because the application is no longer "his".

    I get what you mean, but in that case (and more or less in line what Steve tried to proof) the correct answer should be "Yes, but in exceptional cases not".

    And there the exceptional cases are the collisions and when all three fields contain the same data.

  • sknox (9/15/2015)


    I said "No, never" because as written (with the incorrect order), the checksum Bobby calculates provides no valid information (not even a hint, as you'd get with the correct order) as to whether or not the data has changed. So he CANNOT avoid a round trip with his code as written.

    But now, I think the correct answer is "Yes, Always". Here's how it works:

    1. Bobby writes this code.

    2. Bobby gets fired for using a dangerous function and using it incorrectly.

    3. Bobby has successfully avoided all round trips because the application is no longer "his".

    Incorrect. you can absolutely avoid a round trip. You would have a bug in your code, but in your mind, or limited testing, you might think it works.

  • Steve Jones - SSC Editor (9/15/2015)


    The idea in checksum is to reduce work. ALL hashing algorithms have the possibility of collisions. The faster/shorter/cheaper the algorithm is, the more likely there are collisions. CHECKSUM() is dangerous in this case.

    However, you're approaching this as an all or nothing idea if you say never. If I could eliminate 60% of round trips because my checksum is "good enough", then that's fine. It works for the system and I can avoid round trips. .

    Don't think so. CHECKSUM can return the same checksum even when the columns hold different data -- even if specified in the same order (because of the probability of collisions). So, just because I get CHECKSUM(old) = CHECKSUM(new) that doesn't mean that nothing changed. Something might well have change but due to a collision the checksums are the same.

    You have to look at the actual data anyway, unless you can tolerate false positives and negatives, that is.

    Gerald Britton, Pluralsight courses

  • Heh... remind me not to even attempt such questions in the future. I mentally added "and do it right" to the question which, of course, would have led me to the correct answer if I hadn't made the presumption that it needed to be done right.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Steve Jones - SSC Editor (9/15/2015)


    sknox (9/15/2015)


    I said "No, never" because as written (with the incorrect order), the checksum Bobby calculates provides no valid information (not even a hint, as you'd get with the correct order) as to whether or not the data has changed. So he CANNOT avoid a round trip with his code as written.

    But now, I think the correct answer is "Yes, Always". Here's how it works:

    1. Bobby writes this code.

    2. Bobby gets fired for using a dangerous function and using it incorrectly.

    3. Bobby has successfully avoided all round trips because the application is no longer "his".

    Incorrect. you can absolutely avoid a round trip. You would have a bug in your code, but in your mind, or limited testing, you might think it works.

    You can also avoid a round trip by just not doing it. But that would make the whole question meaningless.

    The question doesn't make it clear whether you're asking if CHECKSUM is useful or if his algorithm is useful. That means "No, never" is a viable answer, and in many ways, more correct than "Sometimes, but very unlikely".

    If all you're asking about is the usefulness of CHECKSUM, then the answer should be "Most of the time, but not always", because collisions are the exception, not the rule, with CHECKSUM.

    Since that's not a valid answer, that implies that you're asking about the usefulness of his algorithm, which is best answered "No, never".

  • Just for grins... I generated a list of CHECKSUM values for strings of "a", "aa", "aaa", ... up to a string of 255 "a" characters.

    The Checksum values for these patterns exactly repeat every sixteen values.

    E.g., the Checksum function returns 142 for all of the following string inputs:

    1 a

    17 a's

    33 a's

    ...

    It also returns 0 for all strings of:

    16 a's

    32 a's

    48 a's

    ...

    Also - since SQL is case-insensitive, you get the same values for "A" strings as for "a" strings. Ergo - if the change to a record is only a matter of capitalization, it will NOT be detected with Checksum.

    Changing to NVARCHAR produces repeating patterns every 32 characters, but again the checksum of a string "a" characters is the same as the checksum of a string of "A" characters.

  • Steve Jones - SSC Editor (9/15/2015)


    The idea in checksum is to reduce work. ALL hashing algorithms have the possibility of collisions. The faster/shorter/cheaper the algorithm is, the more likely there are collisions. CHECKSUM() is dangerous in this case.

    However, you're approaching this as an all or nothing idea if you say never. If I could eliminate 60% of round trips because my checksum is "good enough", then that's fine. It works for the system and I can avoid round trips.

    In this case, it's a poor programming practice, which is what I wanted people to see. Bobby can avoid round trips with his choices, but he's made a mistake in calculating the checksum. He may or may not detect a change, but if he did not, then it's likely a false negative. If he does detect a change, it's more than likely a false positive. If he detects a change (or not) and the data is changed (or not), it's very, very lucky.

    Therefore, it is possible Bobby can avoid round trips, even though his columns are in the wrong order, but it is extremely unlikely.

    Steve, this logic is flawed as he (Bobby) will have to do a round trip to check if the data has changed or not regardless of the output of the checksum function, there is no way of telling when identical checksum values represent the same data values or not, without checking the data. What you have correctly stated is that different column values can and eventually will produce the same checksum but you have failed to prove any usability of such occurrences.

    😎

  • Jeff Moden (9/15/2015)


    Heh... remind me not to even attempt such questions in the future. I mentally added "and do it right" to the question which, of course, would have led me to the correct answer if I hadn't made the presumption that it needed to be done right.

    he he, reminds me of a testing trend quite few years back, give few options which are all wrong and then judge by the chosen level of incorrectness.

    😎

  • This was removed by the editor as SPAM

  • Even if he would use the correct order, he must not use CHECKSUM because it is not case sensitive (if you are using the default collation for your DB):

    SELECT CHECKSUM('ABC') chk_upper, CHECKSUM('abc') chk_lower, BINARY_CHECKSUM('ABC') bin_chk_upper, BINARY_CHECKSUM('abc') bin_chk_lower, HASHBYTES('MD5', 'ABC') hash_upper, HASHBYTES('MD5', 'abc') hash_lower

    will return

    chk_upper chk_lower bin_chk_upper bin_chk_lower hash_upper hash_lower

    1132495864 1132495864 17763 26435 0x902FBDD2B1DF0C4F70B4A5D23525E932 0x900150983CD24FB0D6963F7D28E17F72

    So if he uses CHECKSUM the users will not be able to update casing errors (e.g. Usa -> USA)

Viewing 15 posts - 16 through 30 (of 39 total)

You must be logged in to reply to this topic. Login to reply