SQLServerCentral Article

HASHBYTES: Is CHECKSUM really required?

,

All enterprise class products have at least one requirement that requires the application to be able to identify changed records. These requirements may revolve around the following uses:

  • Auditing
  • Detecting checksum issues for custom replication conflict resolution
  • Tamper detection - detecting content changes/manipulation
  • Handling concurrency issues

Microsoft SQL Server provides four (4) methods that may be used by teams for identification of modified records:

  • ROWVERSION
  • CHECKSUM
  • BINARY_CHECKSUM
  • HASHBYTES

My articles on "An in-depth look at change detection in SQL Server" (Part 01 and Part 02) attempt to compare each of these four methods. The comparison that I came up with is shown below: 

A comparison of the various change detection and tamper protection mechanisms available in Microsoft SQL Server
ROWVERSION CHECKSUM BINARY_CHECKSUM HASHBYTES
Concurrency Detection Yes Yes Yes Yes
Change Detection No Basic Yes Yes
Tamper Protection No No Basic Yes
Storage/Data-type Timestamp INT INT VARBINARY
Size Requirements 8 bytes 4 bytes 4 bytes 8000 bytes
Best Option Yes
"Value for Money" Yes Yes

As shown in the grid above, the best mechanism is the HASHBYTES function which works on the principle that the encrypted representation of the data is always different. However, HASHBYTES may come at a storage overhead which is why the best "Value for money" solution is to use ROWVERSION for optimistic concurrency detection and BINARY_CHECKSUM for change detection.

If you observed CHECKSUM did not feature as a recommended solution for change detection and tamper protection. That leads us to the question: Is CHECKSUM really required?

The answer is "Yes"!

The reason CHECKSUM is important is that CHECKSUM fulfills all the requirements of a hash function: When applied over any two lists of expressions, CHECKSUM returns the same value if the corresponding elements of the lists have the same type and are equal when compared using the equal to (=) operator (considering that NULL for a given data type compare as equal).

CHECKSUM therefore has a unique use in that it can be used to create hash-indexes on a table.

A practical example for hash indexes using CHECKSUM

Consider an application that allows users to search for a ZIP code based on an area address or vice versa. These are actual applications that can be found in India at the India Post (http://www.indiapost.gov.in/) or in the United States at the United States Postal Service (https://www.usps.com/) web-site. The main challenge that these applications face is that the search pattern may be very different depending upon the criteria used by the user - users can query either on the ZIP code (which can easily be indexed) or on the area address that consists at least of the City, State and Country.

To demonstrate how hash-indexes may be of use in such a situation, consider the main data source table to be similar to the following (Please NOTE: This example uses purely hypothetical data).

USE tempdb;
GO
SET NOCOUNT ON;
CREATE TABLE PostalCode 
       (PostalCode NVARCHAR(6),
        PostalArea NVARCHAR(20),
        PostalCity NVARCHAR(28),
        PostalState NVARCHAR(2),
        PRIMARY KEY CLUSTERED (PostalCode, PostalArea)
);
GO
--Hypothetical data!
INSERT INTO PostalCode 
       (PostalCode, PostalArea, PostalCity, PostalState)
VALUES ('382001','Sector 01','Gandhinagar','GJ'),
       ('382002','Sector 02','Gandhinagar','GJ'),
       ('382002','Koba','Gandhinagar','GJ'),
       ('382003','Sector 03','Gandhinagar','GJ'),
       ('382004','Sector 04','Gandhinagar','GJ'),
       ('382005','Sector 05','Gandhinagar','GJ'),
       ('382006','Sector 06','Gandhinagar','GJ'),
       ('382007','Sector 07','Gandhinagar','GJ'),
       ('382008','Sector 08','Gandhinagar','GJ'),
       ('382009','Sector 09','Gandhinagar','GJ'),
       ('382010','Sector 10','Gandhinagar','GJ'),
       ('382011','Sector 11','Gandhinagar','GJ'),
       ('382012','Sector 12','Gandhinagar','GJ'),
       ('382013','Sector 13','Gandhinagar','GJ'),
       ('382014','Sector 14','Gandhinagar','GJ'),
       ('382015','Sector 15','Gandhinagar','GJ'),
       ('382016','Sector 16','Gandhinagar','GJ'),
       ('382017','Sector 17','Gandhinagar','GJ'),
       ('382018','Sector 18','Gandhinagar','GJ'),
       ('382019','Sector 19','Gandhinagar','GJ'),
       ('382020','Sector 20','Gandhinagar','GJ'),
       ('382021','Sector 21','Gandhinagar','GJ'),
       ('382022','Sector 22','Gandhinagar','GJ'),
       ('382026','Sector 26','Gandhinagar','GJ'),
       ('382027','Sector 27','Gandhinagar','GJ'),
       ('382028','Sector 28','Gandhinagar','GJ'),
       ('382029','Sector 29','Gandhinagar','GJ'),
       ('382030','Sector 30','Gandhinagar','GJ'),
       ('360001','Race Course','Rajkot','GJ');
GO

The clustered index on the PostalCode and Area is sufficient for PostalCode based searches. However if the user needs to search for the PostalCode based on the Area, City & State information, the story is a bit different. A typical approach to address this challenge involves creation of a non-clustered index on the columns of interest (Area, City & State) and including the PostalCode as an included column is shown below.

--Typical Solution
CREATE NONCLUSTERED INDEX idx_PostalAreaSearchNC 
ON PostalCode (PostalArea, PostalCity, PostalState) 
INCLUDE (PostalCode);
GO

Both the clustered and the non-clustered index on the PostalCode table work well in the given scenario. This can demonstrated by the following query (which is to be executed by keeping the Actual Execution Plan ON).

--Search on the PostalCode
SELECT * 
FROM dbo.PostalCode 
WHERE PostalCode = '382028';
--Search on the Area, City & State
SELECT * 
FROM dbo.PostalCode 
WHERE PostalArea = 'Sector 28' 
  AND PostalCity = 'Gandhinagar'
  AND PostalState = 'GJ';

Here is the execution plan (shown in Fig 1):

Fig 1 - Clustered Index and Non-clustered indexes used depending upon the query parameters

As can be seen from the screen-shots above, the queries result in Index Seeks which is desirable. Now, for research purposes, let us create a checksum column on the PostalCode table that computes a CHECKSUM on the PostalArea, PostalCity and PostalState columns.

--Create a CHECKSUM column on the table
ALTER TABLE PostalCode
ADD PostalCheckSum AS CHECKSUM(PostalArea, PostalCity, PostalState);
GO

After creating the CHECKSUM column, let us create another non-clustered index on the checksum column with the PostalCode as an included column.

--Create another non-clustered index
--Use this checksum value, with the PostalCode as an included column
CREATE NONCLUSTERED INDEX idx_PostalChecksumSearchNC 
ON PostalCode (PostalCheckSum) 
INCLUDE (PostalCode);
GO

Using the new index, a user can query the table as shown below:

--Search on the Area, City & State
DECLARE @PostalArea NVARCHAR(20) = 'Sector 28'
DECLARE @PostalCity NVARCHAR(28) = 'Gandhinagar'
DECLARE @PostalState NVARCHAR(2) = 'GJ'
SELECT * 
FROM dbo.PostalCode 
WHERE PostalCheckSum = CHECKSUM(@PostalArea, @PostalCity, @PostalState);

A CHECKSUM column contains hash values for a given expression, and therefore, the index created on a CHECKSUM column is called a hash index.

The prime benefit of a hash index is that they reduce the amount of space required for index storage by a drastic amount.

Let’s check the relative sizes of the normal non-clustered index and the hash index used in this example:

--Check the relative size of the regualr non-clustered index and the hash index
SELECT OBJECT_NAME(ips.object_id) AS TableName,
       si.name AS IndedName,
       ips.index_id AS IndexId,
       ips.index_type_desc AS IndexType,
       ips.avg_record_size_in_bytes AS AverageRecordSize,
       ips.min_record_size_in_bytes AS MinimumRecordSize,
       ips.max_record_size_in_bytes AS MaximumRecordSize
FROM sys.dm_db_index_physical_stats(DB_ID(),OBJECT_ID('PostalCode'),NULL,NULL,'DETAILED') AS ips
INNER JOIN sys.indexes AS si ON ips.index_id = si.index_id
WHERE si.name = 'idx_PostalAreaSearchNC' OR si.name = 'idx_PostalChecksumSearchNC';
Comparison of the space utilized by a regular non-clustered index v/s that utilized by a hash index
TableName IndexName

Index

Id

Index

Type

Avg.

Record

Size

Min.

Record

Size

Max.

Record

Size

PostalCode idx_PostalAreaSearchNC 2 NONCLUSTERED INDEX 69.448 60 70
PostalCode idx_PostalChecksumSearchNC 5 NONCLUSTERED INDEX 43.793 34 48

Conclusion

We can see that simply using a hash index in such a simple scenario provides a reduction of about 36% [{(69.448-43.793)/69.448}*100 = 36.94%] in the space required by an index to fulfill the same requirement.

Please note that this does imply that all non-clustered indexes be converted to hash indexes. Those indexes which have a wider disk footprint resulting out of a large number of character columns included on the index are ideal candidates for hash indexes. Higher the record size of an index, the more likely is the system going to be benefited by the use of hash indexes.

Another use of Hash Indexes

Microsoft SQL Server has the maximum size of the index keys limited to 900 bytes. To create an index with a key greater than 900 bytes, use a hash index using CHECKSUM!

Please note that creating a hash index materializes the checksum column.

Disclaimer

  • The scripts in this article are provided "as-is", i.e. without warranties of any kind and are intended for demonstration purposes only
  • Request you to use these scripts for understanding and study purposes in your development environments only - they are not meant for use in production. The author is not responsible for any damage caused by misuse of these scripts
  • Scripts are tested on SQL 2008 R2
  • The opinions expressed herein are his own personal opinions and do not represent his employer’s view in anyway

About the author

Nakul Vachhrajani is a Technical Specialist and systems development professional with iGATE having a total IT experience of more than 8 years. Nakul is an active blogger with BeyondRelational.com (250+ blogs), and can also be found on forums at SQLServerCentral and BeyondRelational.com. Nakul has also been a guest columnist for SQLAuthority.com and SQLServerCentral.com. Nakul presented a webcast on the “Underappreciated Features of Microsoft SQL Server” at the Microsoft Virtual Tech Days Exclusive Webcast series (May 02-06, 2011) on May 06, 2011. He is also the author of a research paper on Database upgrade methodologies, which was published in a Computer Society of India (CSI) journal, published nationwide. In addition to his passion about SQL Server, Nakul also contributes to the academia out of personal interest. He visits various colleges and universities as an external faculty to judge project activities being carried out by the students.

Blog | LinkedIn | Twitter | Google+

Rate

4.87 (23)

You rated this post out of 5. Change rating

Share

Share

Rate

4.87 (23)

You rated this post out of 5. Change rating