All enterprise class products have at least one requirement that requires the application to be able to identify changed records. These requirements may revolve around the following uses:
- Detecting checksum issues for custom replication conflict resolution
- Tamper detection - detecting content changes/manipulation
- Handling concurrency issues
The question therefore is: What are the various change detection and tamper protection mechanisms available within SQL Server?
Microsoft SQL Server provides about four (4) methods that may be used by teams for identification of modified records. While the requirement will dictate which of these functionalities need to be used, I have often seen teams getting confused between these options, which ultimately results in an incomplete realization of the requirement. The four functionalities are:
We have looked at the first two change detection mechanisms in Part 1 of this two part series. In this part, we will be looking at the remaining two methods (BINARY_CHECKSUM and HASHBYTES), and finally come up with a comparative analysis/study of all the four.
The BINARY_CHECKSUM system function is an improvement to CHECKSUM and performs a binary evaluation of the data. BINARY_CHECKSUM computes the checksum based on this binary representation of the data. Therefore, the BINARY_CHECKSUM promises to be a dramatic improvement over CHECKSUM.
We know that CHECKSUM can fail in the following scenarios:
- Detecting case changes on a case-insensitive changes
- CHECKSUM cannot compare data across collations because different collations result in different CHECKSUM values
- CHECKSUM may fail to detect a symmetric change
Let us therefore test if BINARY_CHECKSUM can overcome the limitations of CHECKSUM:
BINARY_CHECKSUM is case-sensitive
To prove that BINARY_CHECKSUM is sensitive to case changes, let’s perform the following test. The tests are similar to those we performed for CHECKSUM in that we will be creating a temporary table with data columns and compute the BINARY_CHECKSUM over these.
USE tempdb; GO SET NOCOUNT ON; --Create a test table CREATE TABLE #BinaryCheckSumDemo (Id INT IDENTITY (1,1), EmployeeName NVARCHAR(50), CityName NVARCHAR(50), BinaryCheckSum AS BINARY_CHECKSUM(EmployeeName, CityName) ); GO --Insert some test data INSERT INTO #BinaryCheckSumDemo (EmployeeName, CityName) VALUES ('John','Boston'), ('Tim','Paris'), ('Robert','New York'), ('Jim','Orlando'); GO SELECT * FROM #BinaryCheckSumDemo; GO
The BINARY_CHECKSUM values for the rows currently available in the table are given below:
Now, let us update the value of CityName for one of the records such that the case of one of the values changes.
--Try changing the case of the value and --see if the BINARY_CHECKSUM value changes UPDATE #BinaryCheckSumDemo SET CityName = UPPER(CityName) WHERE EmployeeName = 'Tim'; GO SELECT * FROM #BinaryCheckSumDemo; GO
As can be seen above highlighted in green, the BINARY_CHECKSUM computation for the updated record has been updated. This is because the binary representation of "PARIS" and "paris" is different, resulting in the different BINARY_CHECKSUM value.
IMPORTANT: BINARY_CHECKSUM is able to detect case changes in a case insensitive database.
BINARY_CHECKSUM is Collation In-sensitive
One of the main drawbacks in case of CHECKSUM is the fact that it is collation sensitive. If the same data is stored using two different collations, CHECKSUM would incorrectly flag them as being different. If the data is the same, the checksum values should also be the same. Therefore, let us recreate this scenario for BINARY_CHECKSUM and study the behaviour:
USE tempdb; GO --Create a test table to study collation sensitivity of CHECKSUM CREATE TABLE #BinaryChecksumCollationSensivity (SourceKey INT IDENTITY(1,1), CaseInsensitiveColumn VARCHAR(20) COLLATE SQL_Latin1_General_CP1_CI_AS, CaseSensitiveColumn VARCHAR(20) COLLATE SQL_Latin1_General_CP1_CS_AS, BinaryChecksumInsensitive AS BINARY_CHECKSUM(CaseInsensitiveColumn), BinaryChecksumSensitive AS BINARY_CHECKSUM(CaseSensitiveColumn) ); GO --Insert some test data, make sure that both values have the same data INSERT INTO #BinaryChecksumCollationSensivity (CaseInsensitiveColumn, CaseSensitiveColumn) VALUES ('Microsoft', 'Microsoft' ), ('Windows', 'Windows' ), ('Sql Server', 'Sql Server'), (NULL, NULL ), ('', '' ); GO --Fetch the data from the test table SELECT * FROM #BinaryChecksumCollationSensivity; GO
As can be seen from the results table shown below, the BINARY_CHECKSUM values for both collations (case sensitive and the case insensitive) is the same.
|3||Sql Server||Sql Server||-447589019||-447589019|
IMPORTANT: BINARY_CHECKSUM is not dependent upon the collation case-sensitivity of the database.
BINARY_CHECKSUM may fail to detect symmetric changes
Just when BINARY_CHECKSUM was appearing to be an ideal replacement for CHECKSUM, we run into another road-block. For the same reasons as CHECKSUM, the BINARY_CHECKSUM may fail to detect symmetric changes.
To confirm, let us repeat the test we did for CHECKSUM:
USE tempdb; GO SET NOCOUNT ON; --Symmetric values fail CHECKSUM test SELECT BINARY_CHECKSUM('Nakul') AS 'Before Change', BINARY_CHECKSUM('NANANANANANANANANakul') AS 'After Change'; GO
The result we get is shown below and confirms that BINARY_CHECKSUM was unable to detect a symmectic change:
IMPORTANT: BINARY_CHECKSUM fails to detect symmetric changes.
BINARY_CHECKSUM is sensitive to the expression evaluation order
If the application has a very limited possibility of symmetrical changes and the application can afford to fail in detecting these changes once in a blue moon, then BINARY_CHECKSUM is a vast improvement over CHECKSUM.
Like any good tamper protection algorithm, BINARY_CHECKSUM also maintains sensitivity around expression evaluation order because of the very nature of the mathematics around computing a CHECKSUM.
Here’s a quick test:
USE tempdb; GO SET NOCOUNT ON; --Create a test table CREATE TABLE #BinaryCheckSumDemo (Id INT IDENTITY (1,1), EmployeeName NVARCHAR(50), CityName NVARCHAR(50), DataBinaryCheckSum1 AS BINARY_CHECKSUM(EmployeeName, CityName), DataBinaryCheckSum2 AS BINARY_CHECKSUM(CityName, EmployeeName) ); GO --Insert some test data INSERT INTO #BinaryCheckSumDemo (EmployeeName, CityName) VALUES ('John','Boston'), ('Tim','London'), ('Robert','New York'), ('Jim','Orlando'); GO SELECT * FROM #BinaryCheckSumDemo; GO
As can be seen in the table above, the BINARY_CHECKSUM values do not match for any evaluation sequence making it a great candidate for an ideal change detection mechanism.
IMPORTANT: In summary, BINARY_CHECKSUM improves a lot upon CHECKSUM in that it is a much advanced change-detection and tamper protection mechanism. However, there is room for improvement.
An improvement over both the CHECKSUM algorithms is achieved by using the HASHBYTES function.
The HASHBYTES system function returns a hash value based on the supplied input. This hash value does not follow a checksum algorithm, and therefore is immune from the problems faced by CHECKSUM and BINARY_CHECKSUM. The hash value generated by HASHBYTES can use one of the many supported encryption algorithms - MD2, MD4, MD5, SHA, SHA1, or SHA2.
The HASHBYTES function carries over all the desirable characteristics from BINARY_CHECKSUM, which are:
- The HASHBYTES function is able to detect data changes
- HASHBYTES can distinguish between the “false positives” and genuine updates
- HASHBYTES is case-sensitive
- HASHBYTES is collation insensitive
- HASHBYTES is sensitive to expression evaluation order
The only point where BINARY_CHECKSUM falls short is:
- Detecting changes due to symmetric changes
In order to establish that HASHBYTES is indeed supreme, all we need to test for is the symmetric update test.
HASHBYTES can detect symmetric changes
To check whether or not HASHBYTES can detect symmetric changes, let us repeat the same test as we did for BINARY_CHECKSUM, this time however, we will use the HASHBYTES function with the MD5 encryption.
USE tempdb; GO SET NOCOUNT ON; --Symmetric values fail CHECKSUM test SELECT HASHBYTES('MD5','Nakul') AS 'Before Change', HASHBYTES('MD5','NANANANANANANANANakul') AS 'After Change'; GO
As seen in the results below, the HASHBYTES value for both data values are different. This is because the HASHBYTES is an encrypted form of the data - if the data is different, the encrypted result has to be different.
IMPORTANT: The HASHBYTES therefore is a “complete” change detection & tamper protection solution.
In return for offering a complete and "tamper-proof" solution, HASHBYTES comes with a cost in terms of increased storage requirements.
Let us run the following query that creates a table with all known types of change detection mechanisms and attempts to identify the associated data-types:
USE tempdb; GO SET NOCOUNT ON; --Create a test table CREATE TABLE CheckSumDemo (Id INT IDENTITY (1,1), EmployeeName NVARCHAR(50), CityName NVARCHAR(50), DataVersion ROWVERSION, DataCheckSum AS CHECKSUM(EmployeeName, CityName), DataBinaryCheckSum AS BINARY_CHECKSUM(EmployeeName, CityName), DataHashBytes AS HASHBYTES('MD5', (EmployeeName + CityName)) ); GO --Check for associated data-types SELECT OBJECT_NAME(sc.object_id), sc.name AS ColumnName, st.name AS Datatype, sc.max_length AS ColumnLength, sc.collation_name AS Collation, sc.is_computed AS IsColumnComputed FROM sys.columns AS sc LEFT OUTER JOIN sys.types AS st ON sc.user_type_id = st.user_type_id WHERE sc.object_id = OBJECT_ID('CheckSumDemo') ORDER BY sc.column_id; GO
The result is shown below. As can be seen from the results table, CHECKSUMs occupy the least space, whereas HASHBYTES occupies the most.
The result obtained is quite interesting, and we can deduce the following facts:
- The ROWVERSION is not a computed column – the values are actually persisted within the database
- From a storage standpoint, both CHECKSUM and BINARY_CHECKSUM are the same
- HASHBYTES consume a comparatively large disk space – 8000 bytes worth of it!
Let us take a moment to compare the 4 known mechanisms for change detection and tamper protection within Microsoft SQL Server.
|Size Requirements||8 bytes||4 bytes||4 bytes||8000 bytes|
|"Value for Money"||Yes||Yes|
PLEASE NOTE: Timestamp is deprecated. The fact that ROWVERSION continues to be seen as a “timestamp” in the sys.types listing has been reported under Connect cases:
This comparison shows us that if the goal is simply to implement concurrency protection, ROWVERSION is the best and only choice. However, for change detection and basic tamper protection, BINARY_CHECKSUM should be used.
While HASHBYTES offers the best of all worlds, it comes at a cost – storage. Therefore, HASHBYTES should only be used if data accuracy is of utmost importance.
If you want to read more, Part 1 is here.
Food for thought
- The change detection functions can be used to detect and delete duplicate data
- If the goal for any system is basic change detection, BINARY_CHECKSUM offers the best value for money
- ROWVERSION can be used for concurrency detection and therefore can be used to implement optimistic concurrency protection (An implementation of ROWVERSION for optimistic concurrency detection can be found on my blog at: http://bit.ly/VpEwbH)
- The scripts in this article are provided "as-is", i.e. without warranties of any kind and are intended for demonstration purposes only
- Request you to use these scripts for understanding and study purposes in your development environments only - they are not meant for use in production. The author is not responsible for any damage caused by misuse of these scripts
- Scripts are tested on SQL 2008 R2
- The opinions expressed herein are his own personal opinions and do not represent his employer’s view in anyway
About the author
Nakul Vachhrajani is a Technical Specialist and systems development professional with iGATE having a total IT experience of more than 8 years. Nakul is an active blogger with BeyondRelational.com (230+ blogs), and can also be found on forums at SQLServerCentral and BeyondRelational.com. Nakul has also been a guest columnist for SQLAuthority.com and SQLServerCentral.com. Nakul presented a webcast on the “Underappreciated Features of Microsoft SQL Server” at the Microsoft Virtual Tech Days Exclusive Webcast series (May 02-06, 2011) on May 06, 2011. He is also the author of a research paper on Database upgrade methodologies, which was published in a Computer Society of India (CSI) journal, published nationwide. In addition to his passion about SQL Server, Nakul also contributes to the academia out of personal interest. He visits various colleges and universities as an external faculty to judge project activities being carried out by the students.