An in-depth look at change detection in SQL Server - Part 01

All enterprise class products have at least one requirement that requires the application to be able to identify changed records. These requirements may revolve around any of the following uses:

Auditing
Detecting checksum issues for custom replication conflict resolution
Tamper detection - detecting content changes/manipulation
Handling concurrency issues

The question therefore is: What are the various change detection and tamper protection mechanisms available within SQL Server?

Microsoft SQL Server provides about four similar functionalities that may be used by teams for identification of modified records. While the requirement will dictate which of these functionalities need to be used, I have often seen teams getting confused between these options, which ultimately results in an incomplete realization of the requirement.

In this article, we will be taking an in-depth look at the first two of these four mechanisms:

ROWVERSION
CHECKSUM

The other two mechanisms will be covered in part 2.

ROWVERSION

Let us start with the most popular of all: ROWVERSION. ROWVERSION is a mechanism provided by SQL Server for version-stamping of rows. The term “version-stamping” means that ROWVERSION does not store a date/time stamp, it stores an incrementing number.

Every insert or update statement that gets executed against a SQL Server database increments an internal counter called the database ROWVERSION, provided the table being updated has a ROWVERSION column on it. Each table can have only one ROWVERSION column.

Let’s take a look at how ROWVERSION works. To begin with, let’s create a temporary table, and insert some test data into it. Note that this table has the ROWVERSION column whose values are of most interest to us:

USE tempdb
GO
SET NOCOUNT ON;

--Create a test table
CREATE TABLE #RowVersionDemo
     (Id INT IDENTITY (1,1),
      EmployeeName NVARCHAR(50),
      CityName NVARCHAR(50),
      DataVersion ROWVERSION
     );
GO

--Insert some test data
INSERT INTO #RowVersionDemo 
       (EmployeeName, CityName)
VALUES ('John','Boston'),
       ('Tim','London'),
       ('Robert','New York'),
       ('Jim','Orlando');
GO

SELECT * FROM #RowVersionDemo;
GO

The SELECT statement used to take a look at the RowVersion values yields the following result. Notice that the ROWVERSION is an incrementing counter value – the first record has a value of 80B, the next is 80B + 1 = 80C and so on.

Identifying Changed records using ROWVERSION
Id	EmployeeName	CityName	DataVersion
1	John	Boston	0x000000000000080B
2	Tim	London	0x000000000000080C
3	Robert	New York	0x000000000000080D
4	Jim	Orlando	0x000000000000080E

Next, let us update one of the records, and then take a look at the value of the DataVersion column.

--Modify some test data, 
UPDATE #RowVersionDemo
SET CityName = 'Paris'
WHERE EmployeeName = 'Tim';
GO

--Check the RowVersions
SELECT * FROM #RowVersionDemo;
GO

Identifying Changed records using ROWVERSION - Modified record
Id	EmployeeName	CityName	DataVersion
1	John	Boston	0x000000000000080B
2	Tim	Paris	0x000000000000080F
3	Robert	New York	0x000000000000080D
4	Jim	Orlando	0x000000000000080E

As seen above highlighted in green, the value of the DataVersion column has changed for the updated record. Again, this is a serial change from the last operation. The last record to be inserted/updated had a ROWVERSION of 80E, hence the update has a ROWVERSION of 80E + 1 = 80F. The other records did not undergo a change. Every time a row with a ROWVERSION column is inserted or modified, the value stored by the ROWVERSION column changes.

We now know that ROWVERSION can detect changes. But, what needs to be seen now is whether the ROWVERSION is really a “good” change-detection mechanism. To do so, let us update the column with itself.

--FALSE UPDATE SCENARIO:
--Update the record value with itself and 
--then check the RowVersion
UPDATE #RowVersionDemo
SET CityName = CityName
WHERE EmployeeName = 'Tim';
GO

SELECT * FROM #RowVersionDemo;
GO

Identifying changed records using ROWVERSION - A false update also changes the ROWVERSION value
Id	EmployeeName	CityName	DataVersion
1	John	Boston	0x000000000000080B
2	Tim	Paris	0x0000000000000810
3	Robert	New York	0x000000000000080D
4	Jim	Orlando	0x000000000000080E

Something is not right in the results shown above. As highlighted in red, the value of the DataVersion column was changed (80F + 1 = 810) for the updated record despite the fact that there is no difference in the data available within the table.

IMPORTANT:
ROWVERSION fails to identify if the data actually changed. ROWVERSION is therefore, not a good choice for detecting changes to a data row
ROWVERSION columns are not a good choice for keys, especially primary keys

Why would ROWVERSION not be a good choice for keys? This is because with each update to the data row, the old key is no longer valid, and all foreign keys need to be updated with the new value.

ROWVERSION however, is ideal to identify if someone or some process “touched” a data row when making a data modification. Therefore, ROWVERSION is the way to go if you are looking to implement an optimistic concurrency solution within the product.

CHECKSUM

As seen above, one of the biggest challenge with ROWVERSION is the detection of "actual" data changes v/s the false positives.

CHECKSUM is a system function available with Microsoft SQL Server which returns the checksum value computed over the row of a table, or a list of expressions. A checksum is essentially a hash value computed over the supplied arguments.

With this knowledge, let us run the same tests again, however, this time, we will use a CHECKSUM column instead of a ROWVERSION column. To begin with, let’s create a test temporary table that computes the CHECKSUM:

USE tempdb;
GO
SET NOCOUNT ON;

--Create a test table
CREATE TABLE #CheckSumDemo
      (Id INT IDENTITY (1,1),
       EmployeeName NVARCHAR(50),
       CityName NVARCHAR(50),
       DataCheckSum AS CHECKSUM(EmployeeName, CityName)
      );
GO

--Insert some test data
INSERT INTO #CheckSumDemo 
       (EmployeeName, CityName)
VALUES ('John','Boston'),
       ('Tim','London'),
       ('Robert','New York'),
       ('Jim','Orlando');
GO

SELECT * FROM #CheckSumDemo;
GO

The contents of the table can be seen below.

Identifying changed records using CHECKSUM
Id	EmployeeName	CityName	DataCheckSum
1	John	Boston	-1140789188
2	Tim	London	-1718893255
3	Robert	New York	1723249590
4	Jim	Orlando	1075367414

In the results shown above, please observe the values in the DataCheckSum column. Unlike RowVersion, there is no relation between these values because they are hash values.

Now, let us modify a record and look at the values of the CHECKSUM. It is essentially the same test as we did for ROWVERSION and so, the modifications are also the same.

--Modify some test data
UPDATE #CheckSumDemo
SET CityName = 'Paris'
WHERE EmployeeName = 'Tim';
GO


--Check the Checksums
SELECT * FROM #CheckSumDemo;
GO

Identifying changed records using CHECKSUM - after the 1st update
Id	EmployeeName	CityName	DataCheckSum
1	John	Boston	-1140789188
2	Tim	Paris	-1522883151
3	Robert	New York	1723249590
4	Jim	Orlando	1075367414

As highlighted in green in the results shown above, CHECKSUM was able to identify the data modifications. However, the acid test is whether CHECKSUM will be able to identify whether an update is a false update, i.e. a row has been updated with itself?

--FALSE UPDATE:
--Update the record value with itself and
--then check the Checksum value
UPDATE #CheckSumDemo
SET CityName = CityName
WHERE EmployeeName = 'Tim';
GO

SELECT * FROM #CheckSumDemo;
GO

Identify changed rows with CHECKSUM - FALSE updates are detected and ignored
Id	EmployeeName	CityName	DataCheckSum
1	John	Boston	-1140789188
2	Tim	Paris	-1522883151
3	Robert	New York	1723249590
4	Jim	Orlando	1075367414

As seen in the results above, the green highlight is used to indicate that CHECKSUM was actually able to distinguish between an actual data change and a false positive.

IMPORTANT: CHECKSUMS therefore, can be used to identify data change.

This is very useful in a scenario where one needs to compare a replicating subscriber with the data published at the publisher. CHECKSUMs can easily tell us whether a record available at the subscriber is same as that on the publisher or is the publisher having a newer copy which due to multiple other reasons out of scope for the current discussion, has not made its way to the subscriber.

CHECKSUM is Case In-sensitive

The question now comes about the nature of data changes that CHECKSUM is able to identify. To do this let us perform another update. This time we will change the case of the value being updated. Please note that on my test system, the tempdb is not case-sensitive:

--FALSE UPDATE:
--Update a record value with a different case and 
--then check the Checksum value
UPDATE #CheckSumDemo
SET CityName = UPPER(CityName)
WHERE EmployeeName = 'Tim';
GO

SELECT * FROM #CheckSumDemo;
GO

Identify changed rows with CHECKSUM - CHECKSUM is NOT case-sensitive
Id	EmployeeName	CityName	DataCheckSum
1	John	Boston	-1140789188
2	Tim	PARIS	-1522883151
3	Robert	New York	1723249590
4	Jim	Orlando	1075367414

As seen above highlighted in red, the value of DataCheckSum column did not change indicating that CHECKSUM is not case sensitive.

CHECKSUM is collation sensitive

CHECKSUM failed to detect a change in case because CHECKSUM is collation sensitive. As verification of this fact, the same data stored under a different collation should return a different CHECKSUM value.

To test this theory, we will create a temporary table with two columns - one using a case sensitive collation (CS) and the other using a case insensitive collation (CI). Columns to hold CHECKSUM values are also created against each column. To make two columns use different collations within the same table, the DDL uses the COLLATE keyword followed by the required collation name.

The test data set used in this test has been chosen to cover most common user scenarios. We have a combination of user data, followed by NULL data (i.e. undefined data) and a set of empty strings.

USE tempdb;
GO

--Create a test table to study collation sensitivity of CHECKSUM
CREATE TABLE #ChecksumCollationSensivity 
      (SourceKey INT IDENTITY(1,1),
       CaseInsensitiveColumn VARCHAR(20) 
                             COLLATE SQL_Latin1_General_CP1_CI_AS,
       CaseSensitiveColumn VARCHAR(20) 
                           COLLATE SQL_Latin1_General_CP1_CS_AS,
       ChecksumInsensitive AS CHECKSUM(CaseInsensitiveColumn),
       ChecksumSensitive AS CHECKSUM(CaseSensitiveColumn)
      );
GO

--Insert some test data, make sure that both values have the same data
INSERT INTO #ChecksumCollationSensivity
       (CaseInsensitiveColumn, CaseSensitiveColumn)
VALUES ('Microsoft', 'Microsoft' ),
       ('Windows', 'Windows' ), 
       ('Sql Server', 'Sql Server'),
       (NULL, NULL ),
       ('', '' );
GO

--Fetch the data from the test table
SELECT * FROM #ChecksumCollationSensivity;
GO

Now let us take a look at the data available in the test table:

Identifying changed rows using CHECKSUM - CHECKSUM is collation sensitive
SourceKey	CaseInsensitiveColumn	CaseSensitiveColumn	ChecksumInsensitive	ChecksumSensitive
1	Microsoft	Microsoft	-110717694	-394820589
2	Windows	Windows	-1376614688	-1125935136
3	Sql Server	Sql Server	-1164226520	-1198886856
4	NULL	NULL	2147483647	2147483647
5			0	0

In the results shown above, note that although the data in both columns (the case sensitive and insensitive collations) is the same, the CHECKSUM values for all records are different – except the empty string and the undefined NULL. Because the string is empty there is no data, and therefore CHECKSUM yields a 0. Because a NULL is an undefined value, the CHECKSUM is the same irrespective of the collation.

IMPORTANT: CHECKSUM is able to detect a data-change within the confines of the rules determined by the collation defined for the database/object.

CHECKSUM may fail to detect a change

Till now, we have established that CHECKSUM is able to distinguish between data updates v/s false positives and is case-insensitive, collation sensitive. However, CHECKSUM faces another issue – there are chances that within the same collation, CHECKSUM may fail to detect a change. We know that if the changes are symmetric, then mathematically the CHECKSUM of these values may have the cumulative effect of “cancelling” each other out. Let’s take a small example:

USE tempdb;
GO
SET NOCOUNT ON;

--Symmetric values fail CHECKSUM test
SELECT CHECKSUM('Nakul') AS 'Before Change',
       CHECKSUM('NANANANANANANANANakul') AS 'After Change';
GO

Checking the CHECKSUM values yields the following results:

Identifying changed records using CHECKSUM - CHECKSUM may fail to detect changes
Before Change	After Change
9663161	9663161

As you can see above, CHECKSUM failed to detect the data change – the data is different, but the nature of the change tricked the algorithm. With this observation, let us reword the conclusion from the previous section:

IMPORTANT: CHECKSUM is able to detect most data-changes within the confines of the rules determined by the collation defined for the database/object.

CHECKSUM is sensitive to the order of expression evaluation

The fact that CHECKSUM is sensitive to collations and prone to failure raises another question – Is CHECKSUM sensitive to the order in which it is evaluated?

To answer this question, let’s perform another simple test by creating a temporary table with two text data columns. The table also has two CHECKSUM columns, but the order in which the two data columns contribute to the checksum have been changed.

USE tempdb;
GO
SET NOCOUNT ON;
--Create a test table
CREATE TABLE #CheckSumDemo 
       (Id INT IDENTITY (1,1),
        EmployeeName NVARCHAR(50),
        CityName NVARCHAR(50),
        DataCheckSum1 AS CHECKSUM(EmployeeName, CityName),
        DataCheckSum2 AS CHECKSUM(CityName, EmployeeName)
       );
GO

--Insert some test data
INSERT INTO #CheckSumDemo 
       (EmployeeName, CityName)
VALUES ('John','Boston'),
       ('Tim','London'),
       ('Robert','New York'),
       ('Jim','Orlando');
GO

SELECT * FROM #CheckSumDemo;
GO

Identifying changed rows using CHECKSUM - CHECKSUM is sensitive to the order of expression evaluation
Id	EmployeeName	CityName	DataCheckSum1	DataCheckSum2
1	John	Boston	-1140789188	804583974
2	Tim	London	-1718893255	-1118192201
3	Robert	New York	1723249590	636918292
4	Jim	Orlando	1075367414	255996234

A quick glance at the results reveals that the two CHECKSUM values are different for every record. CHECKSUM is therefore dependent upon the order of the expressions being evaluated.

IMPORTANT: CHECKSUM can therefore be used to create a tamper protection solution.

Dependency on the order of evaluation is a desirable characteristic for a tamper protection solution. This is because a system may use columns in a non-sequential order to generate the checksum and this dependency on the evaluation sequence makes it difficult to reverse engineer the logic being used.

Summary

We have looked at two change detection mechanisms in this article: ROWVERSION and CHECKSUM. Summarizing the two methods gives us these conclusions:

ROWVERSION fails to identify if the data actually changed. ROWVERSION is therefore, not a good choice for detecting changes to a data row
ROWVERSION columns are not a good choice for keys, especially primary keys
Use CHECKSUM instead of ROWVERSION for "actual" data change detection
CHECKSUM is case in-sensitive
CHECKSUM is collation sensitive
CHECKSUM fails to detect symmetric changes
CHECKSUM is sensitive to the order of expression evaluation

In the next part, we will be looking at the remaining two methods (BINARY_CHECKSUM and HASHBYTES), and finally come up with a comparitive analysis/study of all the four.

Disclaimer

The scripts in this article are provided "as-is", i.e. without warranties of any kind and are intended for demonstration purposes only
Request you to use these scripts for understanding and study purposes in your development environments only - they are not meant for use in production. The author is not responsible for any damage caused by misuse of these scripts
Scripts are tested on SQL Server 2008 R2
The opinions expressed herein are his own personal opinions and do not represent his employer’s view in anyway

About the author

Nakul Vachhrajani is a Technical Specialist and systems development professional with iGATE having a total IT experience of more than 8 years. Nakul is an active blogger with BeyondRelational.com (230+ blogs), and can also be found on forums at SQLServerCentral and BeyondRelational.com. Nakul has also been a guest columnist for SQLAuthority.com and SQLServerCentral.com. Nakul presented a webcast on the “Underappreciated Features of Microsoft SQL Server” at the Microsoft Virtual Tech Days Exclusive Webcast series (May 02-06, 2011) on May 06, 2011. He is also the author of a research paper on Database upgrade methodologies, which was published in a Computer Society of India (CSI) journal, published nationwide. In addition to his passion about SQL Server, Nakul also contributes to the academia out of personal interest. He visits various colleges and universities as an external faculty to judge project activities being carried out by the students.

Blog | LinkedIn | Twitter | Google+

An in-depth look at change detection in SQL Server - Part 01

ROWVERSION

CHECKSUM

CHECKSUM is Case In-sensitive

CHECKSUM is collation sensitive

CHECKSUM may fail to detect a change

CHECKSUM is sensitive to the order of expression evaluation

Summary

Disclaimer

About the author

Rate

Share

Categories

Share

Rate

An in-depth look at change detection in SQL Server - Part 01

ROWVERSION

CHECKSUM

CHECKSUM is Case In-sensitive

CHECKSUM is collation sensitive

CHECKSUM may fail to detect a change

CHECKSUM is sensitive to the order of expression evaluation

Summary

Disclaimer

About the author

Rate

Share

Categories

Share

Rate

Related content

An in-depth look at change detection in SQL Server - Part 02

HASHBYTES: Is CHECKSUM really required?

Identity Values – Two things to know

Indexing HierarchyID

Using the CheckSum Function - SQL School Video

Cookies on SQLServerCentral