An in-depth look at change detection in SQL Server - Part 02

All enterprise class products have at least one requirement that requires the application to be able to identify changed records. These requirements may revolve around the following uses:

Auditing
Detecting checksum issues for custom replication conflict resolution
Tamper detection - detecting content changes/manipulation
Handling concurrency issues

The question therefore is: What are the various change detection and tamper protection mechanisms available within SQL Server?

Microsoft SQL Server provides about four (4) methods that may be used by teams for identification of modified records. While the requirement will dictate which of these functionalities need to be used, I have often seen teams getting confused between these options, which ultimately results in an incomplete realization of the requirement. The four functionalities are:

ROWVERSION
CHECKSUM
BINARY_CHECKSUM
HASHBYTES

We have looked at the first two change detection mechanisms in Part 1 of this two part series. In this part, we will be looking at the remaining two methods (BINARY_CHECKSUM and HASHBYTES), and finally come up with a comparative analysis/study of all the four.

BINARY_CHECKSUM

The BINARY_CHECKSUM system function is an improvement to CHECKSUM and performs a binary evaluation of the data. BINARY_CHECKSUM computes the checksum based on this binary representation of the data. Therefore, the BINARY_CHECKSUM promises to be a dramatic improvement over CHECKSUM.

We know that CHECKSUM can fail in the following scenarios:

Detecting case changes on a case-insensitive changes
CHECKSUM cannot compare data across collations because different collations result in different CHECKSUM values
CHECKSUM may fail to detect a symmetric change

Let us therefore test if BINARY_CHECKSUM can overcome the limitations of CHECKSUM:

BINARY_CHECKSUM is case-sensitive

To prove that BINARY_CHECKSUM is sensitive to case changes, let’s perform the following test. The tests are similar to those we performed for CHECKSUM in that we will be creating a temporary table with data columns and compute the BINARY_CHECKSUM over these.

USE tempdb;
GO
SET NOCOUNT ON;

--Create a test table
CREATE TABLE #BinaryCheckSumDemo 
      (Id INT IDENTITY (1,1),
       EmployeeName NVARCHAR(50),
       CityName NVARCHAR(50),
       BinaryCheckSum AS BINARY_CHECKSUM(EmployeeName, CityName)
      );
GO

--Insert some test data
INSERT INTO #BinaryCheckSumDemo 
       (EmployeeName, CityName)
VALUES ('John','Boston'),
       ('Tim','Paris'),
       ('Robert','New York'),
       ('Jim','Orlando');
GO

SELECT * FROM #BinaryCheckSumDemo;
GO

The BINARY_CHECKSUM values for the rows currently available in the table are given below:

BINARY_CHECKSUM is case-sensitive
Id	EmployeeName	CityName	BinaryCheckSum
1	John	Boston	67427454
2	Tim	Paris	5458739
3	Robert	New York	-691938033
4	Jim	Orlando	1213118463

Now, let us update the value of CityName for one of the records such that the case of one of the values changes.

--Try changing the case of the value and
--see if the BINARY_CHECKSUM value changes
UPDATE #BinaryCheckSumDemo
SET CityName = UPPER(CityName)
WHERE EmployeeName = 'Tim';
GO

SELECT * FROM #BinaryCheckSumDemo;
GO

Identifying record changes using BINARY_CHECKSUM - Case sensitivity
Id	EmployeeName	CityName	BinaryCheckSum
1	John	Boston	67427454
2	Tim	PARIS	5335315
3	Robert	New York	-691938033
4	Jim	Orlando	1213118463

As can be seen above highlighted in green, the BINARY_CHECKSUM computation for the updated record has been updated. This is because the binary representation of "PARIS" and "paris" is different, resulting in the different BINARY_CHECKSUM value.

IMPORTANT: BINARY_CHECKSUM is able to detect case changes in a case insensitive database.

BINARY_CHECKSUM is Collation In-sensitive

One of the main drawbacks in case of CHECKSUM is the fact that it is collation sensitive. If the same data is stored using two different collations, CHECKSUM would incorrectly flag them as being different. If the data is the same, the checksum values should also be the same. Therefore, let us recreate this scenario for BINARY_CHECKSUM and study the behaviour:

USE tempdb;
GO

--Create a test table to study collation sensitivity of CHECKSUM
CREATE TABLE #BinaryChecksumCollationSensivity 
      (SourceKey INT IDENTITY(1,1),
       CaseInsensitiveColumn VARCHAR(20) 
                             COLLATE SQL_Latin1_General_CP1_CI_AS,
       CaseSensitiveColumn VARCHAR(20)
                           COLLATE SQL_Latin1_General_CP1_CS_AS,
       BinaryChecksumInsensitive AS BINARY_CHECKSUM(CaseInsensitiveColumn),
       BinaryChecksumSensitive AS BINARY_CHECKSUM(CaseSensitiveColumn)
      );
GO

--Insert some test data, make sure that both values have the same data
INSERT INTO #BinaryChecksumCollationSensivity
       (CaseInsensitiveColumn, CaseSensitiveColumn)
VALUES ('Microsoft', 'Microsoft' ),
       ('Windows', 'Windows' ), 
       ('Sql Server', 'Sql Server'),
       (NULL, NULL ),
       ('', '' );
GO

--Fetch the data from the test table
SELECT * FROM #BinaryChecksumCollationSensivity;
GO

As can be seen from the results table shown below, the BINARY_CHECKSUM values for both collations (case sensitive and the case insensitive) is the same.

Identifying record changes using BINARY_CHECKSUM - Collation insensitivity
SourceKey	CaseInsensitive Column	CaseSensitive Column	BinaryChecksum Insensitive	BinaryChecksum Sensitive
1	Microsoft	Microsoft	-196585121	-196585121
2	Windows	Windows	1375217667	1375217667
3	Sql Server	Sql Server	-447589019	-447589019
4	NULL	NULL	2147483647	2147483647
5			0	0

IMPORTANT: BINARY_CHECKSUM is not dependent upon the collation case-sensitivity of the database.

BINARY_CHECKSUM may fail to detect symmetric changes

Just when BINARY_CHECKSUM was appearing to be an ideal replacement for CHECKSUM, we run into another road-block. For the same reasons as CHECKSUM, the BINARY_CHECKSUM may fail to detect symmetric changes.

To confirm, let us repeat the test we did for CHECKSUM:

USE tempdb;
GO
SET NOCOUNT ON;

--Symmetric values fail CHECKSUM test
SELECT BINARY_CHECKSUM('Nakul') AS 'Before Change',
       BINARY_CHECKSUM('NANANANANANANANANakul') AS 'After Change';
GO

The result we get is shown below and confirms that BINARY_CHECKSUM was unable to detect a symmectic change:

Identifying record changes using BINARY_CHECKSUM - fails to detect symmetric changes
BeforeChange	AfterChange
4750396	4750396

IMPORTANT: BINARY_CHECKSUM fails to detect symmetric changes.

BINARY_CHECKSUM is sensitive to the expression evaluation order

If the application has a very limited possibility of symmetrical changes and the application can afford to fail in detecting these changes once in a blue moon, then BINARY_CHECKSUM is a vast improvement over CHECKSUM.

Like any good tamper protection algorithm, BINARY_CHECKSUM also maintains sensitivity around expression evaluation order because of the very nature of the mathematics around computing a CHECKSUM.

Here’s a quick test:

USE tempdb;
GO
SET NOCOUNT ON;

--Create a test table
CREATE TABLE #BinaryCheckSumDemo 
      (Id INT IDENTITY (1,1),
       EmployeeName NVARCHAR(50),
       CityName NVARCHAR(50),
       DataBinaryCheckSum1 AS BINARY_CHECKSUM(EmployeeName, CityName),
       DataBinaryCheckSum2 AS BINARY_CHECKSUM(CityName, EmployeeName)
      );
GO

--Insert some test data
INSERT INTO #BinaryCheckSumDemo
       (EmployeeName, CityName)
VALUES ('John','Boston'),
       ('Tim','London'),
       ('Robert','New York'),
       ('Jim','Orlando');
GO

SELECT * FROM #BinaryCheckSumDemo;
GO

Identifying record changes using BINARY_CHECKSUM - sensitive to expression evaluation order
Id	EmployeeName	CityName	DataBinaryCheckSum1	DataBinaryCheckSum2
1	John	Boston	67427454	1149296654
2	Tim	London	78425422	1251506973
3	Robert	New York	-691938033	538363564
4	Jim	Orlando	1213118463	-2069377527

As can be seen in the table above, the BINARY_CHECKSUM values do not match for any evaluation sequence making it a great candidate for an ideal change detection mechanism.

IMPORTANT: In summary, BINARY_CHECKSUM improves a lot upon CHECKSUM in that it is a much advanced change-detection and tamper protection mechanism. However, there is room for improvement.

An improvement over both the CHECKSUM algorithms is achieved by using the HASHBYTES function.

HASHBYTES

The HASHBYTES system function returns a hash value based on the supplied input. This hash value does not follow a checksum algorithm, and therefore is immune from the problems faced by CHECKSUM and BINARY_CHECKSUM. The hash value generated by HASHBYTES can use one of the many supported encryption algorithms - MD2, MD4, MD5, SHA, SHA1, or SHA2.

The HASHBYTES function carries over all the desirable characteristics from BINARY_CHECKSUM, which are:

The HASHBYTES function is able to detect data changes
HASHBYTES can distinguish between the “false positives” and genuine updates
HASHBYTES is case-sensitive
HASHBYTES is collation insensitive
HASHBYTES is sensitive to expression evaluation order

The only point where BINARY_CHECKSUM falls short is:

Detecting changes due to symmetric changes

In order to establish that HASHBYTES is indeed supreme, all we need to test for is the symmetric update test.

HASHBYTES can detect symmetric changes

To check whether or not HASHBYTES can detect symmetric changes, let us repeat the same test as we did for BINARY_CHECKSUM, this time however, we will use the HASHBYTES function with the MD5 encryption.

USE tempdb;
GO
SET NOCOUNT ON;

--Symmetric values fail CHECKSUM test
SELECT HASHBYTES('MD5','Nakul') AS 'Before Change',
       HASHBYTES('MD5','NANANANANANANANANakul') AS 'After Change';
GO

As seen in the results below, the HASHBYTES value for both data values are different. This is because the HASHBYTES is an encrypted form of the data - if the data is different, the encrypted result has to be different.

Using HASHBYTES to detect symmetric changes
BeforeChange	AfterChange
0x67944B79EB69F8F7F16FA9240ABE80FC	0xADD6C6B9FC9D7BF0EDDA9DCA2DE70022

IMPORTANT: The HASHBYTES therefore is a “complete” change detection & tamper protection solution.

Storage considerations

In return for offering a complete and "tamper-proof" solution, HASHBYTES comes with a cost in terms of increased storage requirements.

Let us run the following query that creates a table with all known types of change detection mechanisms and attempts to identify the associated data-types:

USE tempdb;
GO
SET NOCOUNT ON;

--Create a test table
CREATE TABLE CheckSumDemo 
       (Id INT IDENTITY (1,1),
        EmployeeName NVARCHAR(50),
        CityName NVARCHAR(50),
        DataVersion ROWVERSION,
        DataCheckSum AS CHECKSUM(EmployeeName, CityName),
        DataBinaryCheckSum AS BINARY_CHECKSUM(EmployeeName, CityName),
        DataHashBytes AS HASHBYTES('MD5', (EmployeeName + CityName))
       );
GO

--Check for associated data-types
SELECT OBJECT_NAME(sc.object_id),
       sc.name AS ColumnName,
       st.name AS Datatype,
       sc.max_length AS ColumnLength,
       sc.collation_name AS Collation,
       sc.is_computed AS IsColumnComputed
FROM sys.columns AS sc 
LEFT OUTER JOIN sys.types AS st ON sc.user_type_id = st.user_type_id
WHERE sc.object_id = OBJECT_ID('CheckSumDemo')
ORDER BY sc.column_id;
GO

The result is shown below. As can be seen from the results table, CHECKSUMs occupy the least space, whereas HASHBYTES occupies the most.

Storage considerations for ROWVERSION, CHECKSUM, BINARY_CHECKSUM and HASHBYTES
TableName	ColumnName	DataType	Column Length	Collation	IsColumn Computed
CheckSumDemo	Id	int	4	NULL	0
CheckSumDemo	EmployeeName	nvarchar	100	SQL_Latin1_General_CP1_CI_AS	0
CheckSumDemo	CityName	nvarchar	100	SQL_Latin1_General_CP1_CI_AS	0
CheckSumDemo	DataVersion	timestamp	8	NULL	0
CheckSumDemo	DataCheckSum	int	4	NULL	1
CheckSumDemo	DataBinaryCheckSum	int	4	NULL	1
CheckSumDemo	DataHashBytes	varbinary	8000	NULL	1

The result obtained is quite interesting, and we can deduce the following facts:

The ROWVERSION is not a computed column – the values are actually persisted within the database
From a storage standpoint, both CHECKSUM and BINARY_CHECKSUM are the same
HASHBYTES consume a comparatively large disk space – 8000 bytes worth of it!

Summary

Let us take a moment to compare the 4 known mechanisms for change detection and tamper protection within Microsoft SQL Server.

A comparison of the various change detection and tamper protection mechanisms available in Microsoft SQL Server
	ROWVERSION	CHECKSUM	BINARY_CHECKSUM	HASHBYTES
Concurrency Detection	Yes	Yes	Yes	Yes
Change Detection	No	Basic	Yes	Yes
Tamper Protection	No	No	Basic	Yes
Storage/Data-type	Timestamp	INT	INT	VARBINARY
Size Requirements	8 bytes	4 bytes	4 bytes	8000 bytes
Best Option				Yes
"Value for Money"	Yes		Yes

PLEASE NOTE: Timestamp is deprecated. The fact that ROWVERSION continues to be seen as a “timestamp” in the sys.types listing has been reported under Connect cases:

This comparison shows us that if the goal is simply to implement concurrency protection, ROWVERSION is the best and only choice. However, for change detection and basic tamper protection, BINARY_CHECKSUM should be used.

While HASHBYTES offers the best of all worlds, it comes at a cost – storage. Therefore, HASHBYTES should only be used if data accuracy is of utmost importance.

If you want to read more, Part 1 is here.

Food for thought

The change detection functions can be used to detect and delete duplicate data
If the goal for any system is basic change detection, BINARY_CHECKSUM offers the best value for money
ROWVERSION can be used for concurrency detection and therefore can be used to implement optimistic concurrency protection (An implementation of ROWVERSION for optimistic concurrency detection can be found on my blog at: http://bit.ly/VpEwbH)

Disclaimer

The scripts in this article are provided "as-is", i.e. without warranties of any kind and are intended for demonstration purposes only
Request you to use these scripts for understanding and study purposes in your development environments only - they are not meant for use in production. The author is not responsible for any damage caused by misuse of these scripts
Scripts are tested on SQL 2008 R2
The opinions expressed herein are his own personal opinions and do not represent his employer’s view in anyway

About the author

Nakul Vachhrajani is a Technical Specialist and systems development professional with iGATE having a total IT experience of more than 8 years. Nakul is an active blogger with BeyondRelational.com (230+ blogs), and can also be found on forums at SQLServerCentral and BeyondRelational.com. Nakul has also been a guest columnist for SQLAuthority.com and SQLServerCentral.com. Nakul presented a webcast on the “Underappreciated Features of Microsoft SQL Server” at the Microsoft Virtual Tech Days Exclusive Webcast series (May 02-06, 2011) on May 06, 2011. He is also the author of a research paper on Database upgrade methodologies, which was published in a Computer Society of India (CSI) journal, published nationwide. In addition to his passion about SQL Server, Nakul also contributes to the academia out of personal interest. He visits various colleges and universities as an external faculty to judge project activities being carried out by the students.

Blog | LinkedIn | Twitter | Google+

An in-depth look at change detection in SQL Server - Part 02

BINARY_CHECKSUM

BINARY_CHECKSUM is case-sensitive

BINARY_CHECKSUM is Collation In-sensitive

BINARY_CHECKSUM may fail to detect symmetric changes

BINARY_CHECKSUM is sensitive to the expression evaluation order

HASHBYTES

HASHBYTES can detect symmetric changes

Storage considerations

Summary

Food for thought

Disclaimer

About the author

Rate

Share

Categories

Share

Rate

An in-depth look at change detection in SQL Server - Part 02

BINARY_CHECKSUM

BINARY_CHECKSUM is case-sensitive

BINARY_CHECKSUM is Collation In-sensitive

BINARY_CHECKSUM may fail to detect symmetric changes

BINARY_CHECKSUM is sensitive to the expression evaluation order

HASHBYTES

HASHBYTES can detect symmetric changes

Storage considerations

Summary

Food for thought

Disclaimer

About the author

Rate

Share

Categories

Share

Rate

Related content

An in-depth look at change detection in SQL Server - Part 01

HASHBYTES: Is CHECKSUM really required?

Avoiding Unexpected Results with DATEDIFF in Year Comparisons