SQLServerCentral Article

Change Data Capture - SQL Server 2008

,

Often applications need to keep track of all the changes that have been made to data. This helps in maintaining record history for auditing purposes, or for data security purposes. At times this can help with  other business decisions as well. Whenever we try to capture data from a transactional database for an auditing purpose or to save all the history of data changes for a data security purpose, we have either of the following options:

  • Compare the data between the already extracted data to the data present in database. The drawback is this takes time and impacts the performance of the transactional database.
  • Write “Triggers” so when a change is made to data by a transaction, this gets reported in an audit through the triggers. The drawback is this improves auditing on a real time basis but impacts the performance of a live application.

To overcome these issues we have wonderful feature in SQL Server 2008 – Change Data Capture (CDC). CDC allows us to maintain DML change history very efficiently and without complicated programming.

Here we will discuss how to configure CDC and will show how to make CDC work. For this we will create a new table then enable CDC on database and on this newly created table and then perform some DML operation and finally see how CDC captures this data change.

Step 1: Create table “tblEmployee”

Here we will create a table named “tblEmployee” in database “HRDatabase”. This table has 5 columns with EmployeeID as the Primary Key. The other columns hold the details of the Employees.

CREATE TABLE [dbo].[tblEmployee](
      [EmployeeID] [int] NOT NULL,
      [ManagerID] [int] NULL,
      [FirstName] [varchar](20) NOT NULL,
      [LastName] [varchar](20) NOT NULL,
      [Gender] [bit] NULL,
 CONSTRAINT [PK_tblEmployee] PRIMARY KEY CLUSTERED
(
      [EmployeeID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, 
       ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

Step 2: Enable Change Data Capture (CDC) on Database

Before we enable CDC on a table we need to enable it at a database level. This code does enables CDC.

USE HRDatabase
GO
EXEC sys.sp_cdc_enable_db
GO

Step 3: Enable CDC at Table Level

Now we enable CDC for the table “tblEmployee” as well. Here we can mention specific columns that we want to monitor for tracking changes. If no column is mentioned, by default all the columns within the table are tracked.

EXEC sys.sp_cdc_enable_table
 @Source_schema = N'dbo',
 @source_name = N'tblEmployee',
 @filegroup_name = N'PRIMARY',
 @supports_net_changes = 1
GO

 Once the above script is run, it creates the following jobs:

1] csc.HRDatabase_capture – This job executes the system stored procedure “sys.sp_MScdc_capture_job”, which internally calls another system stored procedure named “sys.sp_cdc_scan”. These two stored procedures enable SQL Server Agent, which enables the CDC feature in SQL Server 2008.

2] csc.HRDatabase_cleanup – This job executes the system stored procedure “sys.sp_MScdc_cleanup_job”. This is very useful as it prevents the change table from growing uncontrollably and enables the cleanup of the database change tables.

Both these jobs once created are also enabled automatically.

Step 4: Check if CDC is enabled correctly

Let’s check if the CDC has been enabled on the database and the table where we want it to be. This is basically a verification step, to make sure above queries have performed their task as expected. This code does that.

SELECT Name As DataBaseName, is_cdc_enabled FROM sys.databases Where is_cdc_enabled = 1
SELECT Name AS TableName, is_tracked_by_cdc FROM sys.tables Where is_tracked_by_cdc = 1 

 The results:

Once this is tested, we can check the system tables and find a new table named “cdc.dbo_tblEmployee_CT”. This table will henceforth contain all the DML changes performed on dbo.tblEmployee table. Once you expand the “Columns” of this table, we will find all the columns from “tblEmployee” table as well as an extra 5 more columns.

 

These columns are:

  1. __$start_lsn         : Start Log Sequence Number
  2. __$end_lsn          : End log Sequence Number
  3. __$seqval            : Sequence value used to order row changes within transaction
  4. __$operation        : Contains value corresponding to DML operations. Value list is as follows:

                                 1 = Delete

                                 2 = Insert

                                 3 = Before Update

                                 4 = After Update

  5. __$update_mask  : Provides details about the column which was/were updated in DML operation. This is bitmap field.

If we query both tables, we find they are both empty right now.

SELECT * FROM tblEmployee
SELECT * FROM cdc.dbo_tblEmployee_CT

 

Step 5: Perform DML operation and check Change Data Capture

First we insert a few records into the tblEmployee table.

INSERT INTO tblEmployee Values (1, Null, 'Mike', 'Fields', 1)
INSERT INTO tblEmployee Values (2, 1, 'John', 'Hopkins', 1)
INSERT INTO tblEmployee Values (3, 1, 'Henry', 'III', 1)

 Let’s check the impact of above Insert SQL’s on our tables (tblEmployee and cdc.dbo_tblEmployee_CT):

 

We did not enter any records into the table cdc.dbo_tblEmployee_CT table ourselves, yet due to CDC the complete insert DML operation was captured and recorded in this table. Let’s see how this behaves on Update and Delete operations.

Delete tblEmployee Where EmployeeID = 2
Update tblEmployee Set FirstName = 'Suresh', LastName = 'Sankar' Where EmployeeID = 3

Now let’s check changes in our 2 tables (tblEmployee and cdc.dbo_tblEmployee_CT):

Step 6: Retrieve Changed Information through CDC

Now since we have verified that CDC is working and tracking all DML changes on the table as expected, thus let’s prepare a strategy to retrieve this information for business use. We generally need to track changes over a particular timeframe. If we look at our CDC tracking table “cdc.dbo_tblEmployee_CT” there is no column with date/time value, however there is a field “__$start_lsn” which can help us to retrieve the required information. LSN stands for Log Sequence Number. Every record in the transaction log is uniquely identified by an LSN, and they are all incrementing numbers. Also the LSNs are associated with time and the mapping can be found using the table “cdc.lsn_time_mapping”. There are two more functions provided to help us achieve our goal.

These functions are created when table level CDC is enabled:

    1. cdc.fn_cdc_get_all_changes_dbo_tblEmployee: Capture ALL changes occurred during particular timeframe

    2. cdc.fn_cdc_get_net_changes_dbo_tblEmployee: Capture NET changes occurred during particular timeframe

 

Now let’s get all the changes that happened from yesterday till today. 

DECLARE @BEGIN_LSN BINARY(10), @END_LSN BINARY(10);
SET @BEGIN_LSN = (SELECT MIN(start_lsn) FROM cdc.lsn_time_mapping 
                 Where CAST(tran_begin_time As DATE) = CAST(GETDATE()-1 AS DATE) and tran_id != 0)
SET @END_LSN  = (SELECT MAX(start_lsn) FROM cdc.lsn_time_mapping 
                Where CAST(tran_begin_time As DATE) = CAST(GETDATE()-1 AS DATE) and tran_id != 0)
SELECT * FROM cdc.fn_cdc_get_all_changes_dbo_tblEmployee(@BEGIN_LSN, @END_LSN, 'all')

This provides the details of all the DML operation performed on the table during the specified time. If we need to get only the net changes on data due to various DML operations then we can use following:

SELECT * FROM cdc.fn_cdc_get_net_changes_dbo_tblEmployee(@BEGIN_LSN, @END_LSN, 'all')

CDC Cleanup Process

Once we set up CDC, all the changes occuring on a table get tracked and stored in CDC's respective tables. If CDC is enabled on all tables in a database, there is very good chance that we will fall short of available space on the hard-drive, which creates maintenance issues. To resolve this CDC enables an automatic cleanup process, which runs at regular intervals and by default is configured to cleanup after a three day interval. This period is configurable. The system stored procedure "sys.sp_cdc_cleanup_change_table" gets created once CDC is enabled on a database. This comes handy when we want to change the cleanup retention period.

The retention period value specified for CDC Cleanup Process is in minutes and is should be a positive value greater than 0 and <= 52594800, i.e less than 100 years.

EXEC SYS.SP_CDC_CHANGE_JOB @JOB_TYPE = 'Cleanup', @RETENTION = 500

Summary

With CDC we have a robust way in SQL Server 2008 to record all the DML operations that are performed on a database. This feature is a big help for auditing databases and also for debugging certain scenarioa where we need to know exact sequence of changes happening on the database.

Further Reading for Performance Tuning of Change Data Capture visit http://msdn.microsoft.com/en-us/library/dd266396(v=sql.100).aspx

Rate

4.74 (34)

You rated this post out of 5. Change rating

Share

Share

Rate

4.74 (34)

You rated this post out of 5. Change rating