Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 12»»

checksums and unicode data Expand / Collapse
Author
Message
Posted Wednesday, August 5, 2009 3:17 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Thursday, July 17, 2014 5:34 PM
Points: 40, Visits: 201
Can someone explain why

select checksum(convert(nvarchar(3),'674'));
select checksum(convert(nvarchar(3),'645'));

yield the same value?
While

select checksum('674')
select checksum('645')

do not. What is is about unicode data that caused the checksum to be the same?

TIA



Post #765856
Posted Wednesday, August 5, 2009 4:18 PM
SSC Veteran

SSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC Veteran

Group: General Forum Members
Last Login: Friday, July 11, 2014 10:54 AM
Points: 292, Visits: 1,620
The primary differnce is the way the data is stored single-byte versus double-byte.

It just so happens that the checksum algorithmn generates duplicates in this particular case.
Post #765886
Posted Wednesday, August 5, 2009 10:46 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 12:14 AM
Points: 36,735, Visits: 31,186
...and the checksum algorithm is nothing more than a simple "Exclusive OR" at the byte level. It does NOT quarantee that two rows are different and it doesn't guarantee that a row hasn't been somehow changed. It should only be used as a basic sanity check (if that). I don't know why they even bothered except that every language seems to have one and many of them have the same problem.

Closer to being perfect is the LUN 10 checksum (sanity check on credit card numbers)... but even that can have a failure.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #765951
Posted Friday, August 7, 2009 2:39 PM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Friday, November 1, 2013 8:49 AM
Points: 79, Visits: 1,328
Closer to being perfect is the LUN 10 checksum (sanity check on credit card numbers)... but even that can have a failure.


Just in case anyone else is interested in finding out more regarding the checksum Jeff mentions I found it is actually:

LUHN10 checksum
Post #767280
Posted Friday, August 7, 2009 5:53 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 12:14 AM
Points: 36,735, Visits: 31,186
Adam Gojdas (8/7/2009)
Closer to being perfect is the LUN 10 checksum (sanity check on credit card numbers)... but even that can have a failure.


Just in case anyone else is interested in finding out more regarding the checksum Jeff mentions I found it is actually:

LUHN10 checksum


Heh... thanks for the correction... I ate garlic and didn't want to breath on anyone too hard so I left the "H" out.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #767323
Posted Saturday, August 8, 2009 6:07 AM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Yesterday @ 1:36 AM
Points: 11,192, Visits: 11,091
Jeff Moden (8/5/2009)
I don't know why they even bothered except that every language seems to have one and many of them have the same problem. Closer to being perfect is the LUHN 10 checksum (sanity check on credit card numbers)... but even that can have a failure.

CHECKSUM is perhaps an unfortunate name for the function. It is not intended to be used as a checksum in the CRC sense:

Books Online: CHECKSUM (T-SQL)
CHECKSUM computes a hash value, called the checksum, over its list of arguments. The hash value is intended for use in building hash indexes. If the arguments to CHECKSUM are columns, and an index is built over the computed CHECKSUM value, the result is a hash index. This can be used for equality searches over the columns.

CHECKSUM satisfies the properties of a hash function: CHECKSUM applied over any two lists of expressions returns the same value if the corresponding elements of the two lists have the same type and are equal when compared using the equals (=) operator. For this definition, null values of a specified type are considered to compare as equal. If one of the values in the expression list changes, the checksum of the list also generally changes. However, there is a small chance that the checksum will not change. For this reason, we do not recommend using CHECKSUM to detect whether values have changed, unless your application can tolerate occasionally missing a change. Consider using HashBytes instead. When an MD5 hash algorithm is specified, the probability of HashBytes returning the same result for two different inputs is much lower than that of CHECKSUM.

The order of expressions affects the resultant value of CHECKSUM. The order of columns used with CHECKSUM(*) is the order of columns specified in the table or view definition. This includes computed columns.

HashBytes (perhaps including the PK as a salt) is a good choice if the task is to detect changes.
CHECKSUM is quick and efficient for creating hash indexes - which are awesome if you need to search long strings (for example).

Paul




Paul White
SQL Server MVP
SQLblog.com
@SQL_Kiwi
Post #767402
Posted Sunday, August 9, 2009 12:36 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 12:14 AM
Points: 36,735, Visits: 31,186
Paul White (8/8/2009)
HashBytes (perhaps including the PK as a salt) is a good choice if the task is to detect changes.
CHECKSUM is quick and efficient for creating hash indexes - which are awesome if you need to search long strings (for example).


Even then, it can and will give an incorrect answer if you rely only on it. I'll admit that CHECKSUM will help narrow things down on long strings, though.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #767534
Posted Sunday, August 9, 2009 5:18 PM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Yesterday @ 1:36 AM
Points: 11,192, Visits: 11,091
Jeff Moden (8/9/2009)
Even then, it can and will give an incorrect answer if you rely only on it. I'll admit that CHECKSUM will help narrow things down on long strings, though.

My admission for the day: I have never relied on HashBytes (with the key as a salt) - I always end up get nervous and comparing all the values instead

Technically I think the chances of a collision using HashBytes (with SHA1) in the way I describe is overwhelmingly unlikely, though the mathematics to demonstrate that are far beyond me. My co-workers consider CHECKSUM with a PK salt to be sufficient (please don't get me started on that).

Summary: when hash indexing, I use CHECKSUM to narrow the search followed by an explicit comparison.
To detect changes, I compare all column values (unless I am fortunate enough to be working in 2K8 when I try to use CDC). Many people swear by HashBytes - I am more likely to swear at it - though my views are not universally shared, it must be said.

Paul




Paul White
SQL Server MVP
SQLblog.com
@SQL_Kiwi
Post #767574
Posted Monday, August 10, 2009 10:58 AM


SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Yesterday @ 12:54 PM
Points: 809, Visits: 2,118
Here's a thought.

Given that a binary representation of a value is as unambiguous a way to represent a value as you can get, explain how this occurs:

select convert(varbinary, convert(nvarchar(3),'674'));  -- result 0x36003 7 003400
select convert(varbinary, convert(nvarchar(3),'654')); -- result 0x36003 5 003400

Running a checksum on the two binary values gives them a different result from the original checksum and from each other this time (unlike the original where the two were the same). Is this due to the way checksum parses unicode vs. binary?


Gaby
________________________________________________________________
"In theory, theory and practice are the same. In practice, they are not."
- Albert Einstein
Post #768004
Posted Monday, August 10, 2009 3:54 PM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Yesterday @ 1:36 AM
Points: 11,192, Visits: 11,091
GabyYYZ (8/10/2009)
Is this due to the way checksum parses unicode vs. binary?

Seems so:

-- 0x3600 3700 3400
SELECT (CONVERT(VARBINARY, N'674'));

SELECT CONVERT(BINARY(2), UNICODE(N'6')) -- 0x0036
SELECT CONVERT(BINARY(2), UNICODE(N'7')) -- 0x0037
SELECT CONVERT(BINARY(2), UNICODE(N'4')) -- 0x0034

The binary representation is byte-reversed.

Paul




Paul White
SQL Server MVP
SQLblog.com
@SQL_Kiwi
Post #768245
« Prev Topic | Next Topic »

Add to briefcase 12»»

Permissions Expand / Collapse