checksums and unicode data

  • Can someone explain why

    select checksum(convert(nvarchar(3),'674'));

    select checksum(convert(nvarchar(3),'645'));

    yield the same value?

    While

    select checksum('674')

    select checksum('645')

    do not. What is is about unicode data that caused the checksum to be the same?

    TIA

  • The primary differnce is the way the data is stored single-byte versus double-byte.

    It just so happens that the checksum algorithmn generates duplicates in this particular case.

  • ...and the checksum algorithm is nothing more than a simple "Exclusive OR" at the byte level. It does NOT quarantee that two rows are different and it doesn't guarantee that a row hasn't been somehow changed. It should only be used as a basic sanity check (if that). I don't know why they even bothered except that every language seems to have one and many of them have the same problem.

    Closer to being perfect is the LUN 10 checksum (sanity check on credit card numbers)... but even that can have a failure.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Closer to being perfect is the LUN 10 checksum (sanity check on credit card numbers)... but even that can have a failure.

    Just in case anyone else is interested in finding out more regarding the checksum Jeff mentions I found it is actually:

    LUHN10 checksum

  • Adam Gojdas (8/7/2009)


    Closer to being perfect is the LUN 10 checksum (sanity check on credit card numbers)... but even that can have a failure.

    Just in case anyone else is interested in finding out more regarding the checksum Jeff mentions I found it is actually:

    LUHN10 checksum

    Heh... thanks for the correction... I ate garlic and didn't want to breath on anyone too hard so I left the "H" out. 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (8/5/2009)


    I don't know why they even bothered except that every language seems to have one and many of them have the same problem. Closer to being perfect is the LUHN 10 checksum (sanity check on credit card numbers)... but even that can have a failure.

    CHECKSUM is perhaps an unfortunate name for the function. It is not intended to be used as a checksum in the CRC sense:

    Books Online: CHECKSUM (T-SQL)


    CHECKSUM computes a hash value, called the checksum, over its list of arguments. The hash value is intended for use in building hash indexes. If the arguments to CHECKSUM are columns, and an index is built over the computed CHECKSUM value, the result is a hash index. This can be used for equality searches over the columns.

    CHECKSUM satisfies the properties of a hash function: CHECKSUM applied over any two lists of expressions returns the same value if the corresponding elements of the two lists have the same type and are equal when compared using the equals (=) operator. For this definition, null values of a specified type are considered to compare as equal. If one of the values in the expression list changes, the checksum of the list also generally changes. However, there is a small chance that the checksum will not change. For this reason, we do not recommend using CHECKSUM to detect whether values have changed, unless your application can tolerate occasionally missing a change. Consider using HashBytes instead. When an MD5 hash algorithm is specified, the probability of HashBytes returning the same result for two different inputs is much lower than that of CHECKSUM.

    The order of expressions affects the resultant value of CHECKSUM. The order of columns used with CHECKSUM(*) is the order of columns specified in the table or view definition. This includes computed columns.

    HashBytes (perhaps including the PK as a salt) is a good choice if the task is to detect changes.

    CHECKSUM is quick and efficient for creating hash indexes - which are awesome if you need to search long strings (for example).

    Paul

  • Paul White (8/8/2009)


    HashBytes (perhaps including the PK as a salt) is a good choice if the task is to detect changes.

    CHECKSUM is quick and efficient for creating hash indexes - which are awesome if you need to search long strings (for example).

    Even then, it can and will give an incorrect answer if you rely only on it. I'll admit that CHECKSUM will help narrow things down on long strings, though.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (8/9/2009)


    Even then, it can and will give an incorrect answer if you rely only on it. I'll admit that CHECKSUM will help narrow things down on long strings, though.

    My admission for the day: I have never relied on HashBytes (with the key as a salt) - I always end up get nervous and comparing all the values instead 😀

    Technically I think the chances of a collision using HashBytes (with SHA1) in the way I describe is overwhelmingly unlikely, though the mathematics to demonstrate that are far beyond me. My co-workers consider CHECKSUM with a PK salt to be sufficient (please don't get me started on that).

    Summary: when hash indexing, I use CHECKSUM to narrow the search followed by an explicit comparison.

    To detect changes, I compare all column values (unless I am fortunate enough to be working in 2K8 when I try to use CDC). Many people swear by HashBytes - I am more likely to swear at it - though my views are not universally shared, it must be said.

    Paul

  • Here's a thought.

    Given that a binary representation of a value is as unambiguous a way to represent a value as you can get, explain how this occurs:

    select convert(varbinary, convert(nvarchar(3),'674')); -- result 0x36003 7 003400

    select convert(varbinary, convert(nvarchar(3),'654')); -- result 0x36003 5 003400

    Running a checksum on the two binary values gives them a different result from the original checksum and from each other this time (unlike the original where the two were the same). Is this due to the way checksum parses unicode vs. binary?

    Gaby________________________________________________________________"In theory, theory and practice are the same. In practice, they are not." - Albert Einstein

  • GabyYYZ (8/10/2009)


    Is this due to the way checksum parses unicode vs. binary?

    Seems so:

    -- 0x3600 3700 3400

    SELECT (CONVERT(VARBINARY, N'674'));

    SELECT CONVERT(BINARY(2), UNICODE(N'6')) -- 0x0036

    SELECT CONVERT(BINARY(2), UNICODE(N'7')) -- 0x0037

    SELECT CONVERT(BINARY(2), UNICODE(N'4')) -- 0x0034

    The binary representation is byte-reversed.

    Paul

  • Jeff Moden (8/9/2009)


    Paul White (8/8/2009)


    HashBytes (perhaps including the PK as a salt) is a good choice if the task is to detect changes.

    CHECKSUM is quick and efficient for creating hash indexes - which are awesome if you need to search long strings (for example).

    Even then, it can and will give an incorrect answer if you rely only on it. I'll admit that CHECKSUM will help narrow things down on long strings, though.

    I'm, going to have to disagree with this one, Jeff, at least as far as HashBytes using the SHA1 algorithm is concerned. Assuming you did a Trillion HashByte comparisons of different data a day (and no one on earth is doing that many), it would still take about 6000 to 12000 Years before you were likely to get an accidental match. Those are odds that I can live with. 🙂

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Ummmm.... does SQL Server's CHECKSUM use the SHA1 algorithm because I was talking about that CHECKSUM. Obviously that doesn't take 6000 to 12000 years to happen because I've seen two of these same types of posts in the last 3 months. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (8/10/2009)


    Ummmm.... does SQL Server's CHECKSUM use the SHA1 algorithm because I was talking about that CHECKSUM. Obviously that doesn't take 6000 to 12000 years to happen because I've seen two of these same types of posts in the last 3 months. 😉

    Oh no, sorry, CHECKSUM produces duplicates if you so much as swap the order of two of the characters. So it stinks for this purpose, and we are in agreement on that. 😀

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • To clarify: HashBytes w/ SHA1 is nigh-unbreakable.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • RBarryYoung (8/10/2009)


    To clarify: HashBytes w/ SHA1 is nigh-unbreakable.

    Hey Barry,

    On the other hand, it is relatively slow, only works on strings, and returns varbinary(8000).

    If the task is to detect changes in a row of data, would you use HashBytes (SHA1) alone?

    Paul

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply