Click here to monitor SSC
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


checksums and unicode data


checksums and unicode data

Author
Message
iBuildDW
iBuildDW
SSC Rookie
SSC Rookie (40 reputation)SSC Rookie (40 reputation)SSC Rookie (40 reputation)SSC Rookie (40 reputation)SSC Rookie (40 reputation)SSC Rookie (40 reputation)SSC Rookie (40 reputation)SSC Rookie (40 reputation)

Group: General Forum Members
Points: 40 Visits: 241
Can someone explain why

select checksum(convert(nvarchar(3),'674'));
select checksum(convert(nvarchar(3),'645'));

yield the same value?
While

select checksum('674')
select checksum('645')

do not. What is is about unicode data that caused the checksum to be the same?

TIA



Lamprey13
Lamprey13
Old Hand
Old Hand (304 reputation)Old Hand (304 reputation)Old Hand (304 reputation)Old Hand (304 reputation)Old Hand (304 reputation)Old Hand (304 reputation)Old Hand (304 reputation)Old Hand (304 reputation)

Group: General Forum Members
Points: 304 Visits: 1657
The primary differnce is the way the data is stored single-byte versus double-byte.

It just so happens that the checksum algorithmn generates duplicates in this particular case.
Jeff Moden
Jeff Moden
SSC-Forever
SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)

Group: General Forum Members
Points: 45006 Visits: 39880
...and the checksum algorithm is nothing more than a simple "Exclusive OR" at the byte level. It does NOT quarantee that two rows are different and it doesn't guarantee that a row hasn't been somehow changed. It should only be used as a basic sanity check (if that). I don't know why they even bothered except that every language seems to have one and many of them have the same problem.

Closer to being perfect is the LUN 10 checksum (sanity check on credit card numbers)... but even that can have a failure.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
Although they tell us that they want it real bad, our primary goal is to ensure that we dont actually give it to them that way.
Although change is inevitable, change for the better is not.
Just because you can do something in PowerShell, doesnt mean you should. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Adam Gojdas
Adam Gojdas
SSC Journeyman
SSC Journeyman (96 reputation)SSC Journeyman (96 reputation)SSC Journeyman (96 reputation)SSC Journeyman (96 reputation)SSC Journeyman (96 reputation)SSC Journeyman (96 reputation)SSC Journeyman (96 reputation)SSC Journeyman (96 reputation)

Group: General Forum Members
Points: 96 Visits: 1429
Closer to being perfect is the LUN 10 checksum (sanity check on credit card numbers)... but even that can have a failure.


Just in case anyone else is interested in finding out more regarding the checksum Jeff mentions I found it is actually:

LUHN10 checksum
Jeff Moden
Jeff Moden
SSC-Forever
SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)

Group: General Forum Members
Points: 45006 Visits: 39880
Adam Gojdas (8/7/2009)
Closer to being perfect is the LUN 10 checksum (sanity check on credit card numbers)... but even that can have a failure.


Just in case anyone else is interested in finding out more regarding the checksum Jeff mentions I found it is actually:

LUHN10 checksum


Heh... thanks for the correction... I ate garlic and didn't want to breath on anyone too hard so I left the "H" out. :-P

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
Although they tell us that they want it real bad, our primary goal is to ensure that we dont actually give it to them that way.
Although change is inevitable, change for the better is not.
Just because you can do something in PowerShell, doesnt mean you should. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Paul White
Paul White
SSChampion
SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)

Group: General Forum Members
Points: 10338 Visits: 11350
Jeff Moden (8/5/2009)
I don't know why they even bothered except that every language seems to have one and many of them have the same problem. Closer to being perfect is the LUHN 10 checksum (sanity check on credit card numbers)... but even that can have a failure.

CHECKSUM is perhaps an unfortunate name for the function. It is not intended to be used as a checksum in the CRC sense:

Books Online: CHECKSUM (T-SQL)
CHECKSUM computes a hash value, called the checksum, over its list of arguments. The hash value is intended for use in building hash indexes. If the arguments to CHECKSUM are columns, and an index is built over the computed CHECKSUM value, the result is a hash index. This can be used for equality searches over the columns.

CHECKSUM satisfies the properties of a hash function: CHECKSUM applied over any two lists of expressions returns the same value if the corresponding elements of the two lists have the same type and are equal when compared using the equals (=) operator. For this definition, null values of a specified type are considered to compare as equal. If one of the values in the expression list changes, the checksum of the list also generally changes. However, there is a small chance that the checksum will not change. For this reason, we do not recommend using CHECKSUM to detect whether values have changed, unless your application can tolerate occasionally missing a change. Consider using HashBytes instead. When an MD5 hash algorithm is specified, the probability of HashBytes returning the same result for two different inputs is much lower than that of CHECKSUM.

The order of expressions affects the resultant value of CHECKSUM. The order of columns used with CHECKSUM(*) is the order of columns specified in the table or view definition. This includes computed columns.

HashBytes (perhaps including the PK as a salt) is a good choice if the task is to detect changes.
CHECKSUM is quick and efficient for creating hash indexes - which are awesome if you need to search long strings (for example).

Paul



Paul White
SQLPerformance.com
SQLblog.com
@SQL_Kiwi
Jeff Moden
Jeff Moden
SSC-Forever
SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)

Group: General Forum Members
Points: 45006 Visits: 39880
Paul White (8/8/2009)
HashBytes (perhaps including the PK as a salt) is a good choice if the task is to detect changes.
CHECKSUM is quick and efficient for creating hash indexes - which are awesome if you need to search long strings (for example).


Even then, it can and will give an incorrect answer if you rely only on it. I'll admit that CHECKSUM will help narrow things down on long strings, though.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
Although they tell us that they want it real bad, our primary goal is to ensure that we dont actually give it to them that way.
Although change is inevitable, change for the better is not.
Just because you can do something in PowerShell, doesnt mean you should. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Paul White
Paul White
SSChampion
SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)

Group: General Forum Members
Points: 10338 Visits: 11350
Jeff Moden (8/9/2009)
Even then, it can and will give an incorrect answer if you rely only on it. I'll admit that CHECKSUM will help narrow things down on long strings, though.

My admission for the day: I have never relied on HashBytes (with the key as a salt) - I always end up get nervous and comparing all the values instead :-D

Technically I think the chances of a collision using HashBytes (with SHA1) in the way I describe is overwhelmingly unlikely, though the mathematics to demonstrate that are far beyond me. My co-workers consider CHECKSUM with a PK salt to be sufficient (please don't get me started on that).

Summary: when hash indexing, I use CHECKSUM to narrow the search followed by an explicit comparison.
To detect changes, I compare all column values (unless I am fortunate enough to be working in 2K8 when I try to use CDC). Many people swear by HashBytes - I am more likely to swear at it - though my views are not universally shared, it must be said.

Paul



Paul White
SQLPerformance.com
SQLblog.com
@SQL_Kiwi
GabyYYZ
GabyYYZ
Say Hey Kid
Say Hey Kid (673 reputation)Say Hey Kid (673 reputation)Say Hey Kid (673 reputation)Say Hey Kid (673 reputation)Say Hey Kid (673 reputation)Say Hey Kid (673 reputation)Say Hey Kid (673 reputation)Say Hey Kid (673 reputation)

Group: General Forum Members
Points: 673 Visits: 2332
Here's a thought.

Given that a binary representation of a value is as unambiguous a way to represent a value as you can get, explain how this occurs:


select convert(varbinary, convert(nvarchar(3),'674')); -- result 0x36003 7 003400
select convert(varbinary, convert(nvarchar(3),'654')); -- result 0x36003 5 003400


Running a checksum on the two binary values gives them a different result from the original checksum and from each other this time (unlike the original where the two were the same). Is this due to the way checksum parses unicode vs. binary?

Gaby
________________________________________________________________
"In theory, theory and practice are the same. In practice, they are not."
- Albert Einstein

Paul White
Paul White
SSChampion
SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)

Group: General Forum Members
Points: 10338 Visits: 11350
GabyYYZ (8/10/2009)
Is this due to the way checksum parses unicode vs. binary?

Seems so:


-- 0x3600 3700 3400
SELECT (CONVERT(VARBINARY, N'674'));

SELECT CONVERT(BINARY(2), UNICODE(N'6')) -- 0x0036
SELECT CONVERT(BINARY(2), UNICODE(N'7')) -- 0x0037
SELECT CONVERT(BINARY(2), UNICODE(N'4')) -- 0x0034



The binary representation is byte-reversed.

Paul



Paul White
SQLPerformance.com
SQLblog.com
@SQL_Kiwi
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search