In the previous article in the series we talked about Change Tracking (CT). In this article we will concentrate on another technique available in SQL Server 2008 and higher. This technique is Change Data Capture (CDC).
What is CDC?
CDC is one of the techniques introduced in SQL Server 2008, and it answers the question as to what row in a table changed. It also shows the intermediate values of data change if any. One of the differences between CDC and CT is that the tracking of changes using CDC is done asynchronously whereas in CT the changes are tracked synchronously. Therefore CDC will not have any effect on a transaction. The changes are read using the log reader just like in replication.
It is very easy to enable CDC. All you need to do is execute couple of procedures with the correct parameters. The first is on the database level and the next one is on table level. Let us try to set up CDC.
How can we set up CDC? To set up CDC we need to follow some steps. The steps are shown below. Since we are trying to set up CDC for the first time, it is better to create a test database. By creating a test database, if any problems arise, it is isolated to this database.
USE master; go IF EXISTS (SELECT * FROM sys.databases WHERE name = 'TestDB') DROP DATABASE TestDB; go CREATE DATABASE TestDB; go
It is advisable to have SNAPSHOT ISOLATION On. Snapshot isolation makes sure that there is no blocking when trying to get the change details. There is a small impact on setting the SNAPSHOT ISOLATION. When SNAPSHOT ISOLATION is set, all changes made to the data will be versioned in the TempDB. If you want to read more details regarding versioning, please check the article written by me by clicking here.
To ensure that snapshot isolation is on, you need to run the command:
ALTER DATABASE TestDB SET ALLOW_SNAPSHOT_ISOLATION ON GO
Now let us run the command that enables CDC on the database level.
USE TestDB; GO -- Enable CDC for the database EXECUTE sys.sp_cdc_enable_db; GO
As soon as we run this command it will create a new schema, named cdc, inside the user database. If we expand the user database in the object explorer in SSMS, we will see that a new schema has been created, as shown in the image below.
It will also create five tables inside the System with the schema cdc. The five tables are cdc.captured_columns, cdc.change_tables, cdc.ddl_history, cdc.index_column and cdc.lsn_time_mappings. Let us look at what these tables are used for:
- cdc.captured_columns : As the name implies it returns the list of captured columns.
- cdc.change_tables : This returns the list of tables enabled for CDC.
- cdc.ddl_history : Returns the list of DDL changes for the tables enabled for CDC.
- cdc.index_column : This returns the list of columns used for indexes on the tables enabled for CDC.
- cdc.lsn_time_mapping : As the name implies, this tables has the mapping of LSN numbers.
Let us create a test table and then set it up for CDC.
USE TestDB; GO CREATE TABLE dbo.tbProducts ( ProductID INT NOT NULL CONSTRAINT PK_tbProducts PRIMARY KEY, ProductName VARCHAR(30) NOT NULL, ProductCost MONEY NOT NULL ) GO
By running the above command, we create a table in the database TestDB. To enable CDC for this table, we can use the system stored procedure sys.sp_cdc_enable_table. The stored procedure sys.sp_cdc_enable_table has nine parameters. We will just look at three for this test. If you want to get more details regarding all the nine parameters, please check the MSDN site by clicking here.
EXEC sys.sp_cdc_enable_table @source_schema = N'dbo' , @source_name = N'tbProducts' , @role_name = NULL; GO
Let us look at the three parameters that we will be using for this test. Pre-requisite for running this command is that the SQL Agent should be running.
@source_schema: This parameter is used pass the name of the schema in which the table is created. This parameter cannot be NULL.
@source_name: The parameter @source_name will specify the name of the table that we are enabling for CDC. This is akso a NOT NULL parameter.
@role_name: This parameter is the name of the database role that will be used to access the changes. When @role_name is set as NULL, it takes the default value. The default value is cdc_admin. This statement might take 15 to 30 seconds to execute. Once the execution done, it will give the message shown below:
These are two jobs created when enabling CDC for the first time. In addition to these jobs, the stored procedure sp_cdc_enable_table also creates three stored procedures, two table value functions and a system table.
The two jobs that are created are Capture Job and Clean Up Job. The capture job is intiaited by running a stored procedure sp_MScdc_capture_job. This job in turn calls the sp_cdc_scan. The sp_cdc_scan will then invoke the sp_replcmds to scan the log file. The clean up job invokes the sp_MScdc_cleanup_job. This in turn calls the sp_cdc_cleanup_change_tables with the parameters of the retention period of the change tables. based on this retention period, the entries are deleted from the table. By adding another table to CDC will not create two more jobs. The same jobs will be used for the newly added tables.
The newly created table, cdc.dbo_tbProducts_CT, will have all the changes that will be made on the tbProducts table. This table will have all the columns from the table tbProducts and five new columns as shown in the figure.
Check if it works
As far as enabling of CDC is concerned, it is all done. Did it work? Let us find out. Currently there are no rows in tbProducts table. From now on if we do any data manipulation operation on tbProducts, those will be logged in the system table cdc.dbo_tbProducts_CT. When the job cdc.TestDb_capture runs, it scans the log file for any changes to the tables added for CDC. Once the log reader finds a record that was changed to a table that has been enabled for CDC, it will insert a row into the the system table cdc.dbo_tbproducts_CT.
Let us insert a record into tbProducts.
INSERT INTO dbo.tbProducts( ProductID, Productname, Productcost) VALUES(1, 'Product 1', 1000.00); GO
If we select from the system table we should get one row in it as well.
Select __$start_lsn, __$end_lsn, __$seqval, __$operation, __$update_mask, ProductID, productname, ProductCost from cdc.dbo_tbProducts_CT
The important columns to note are __$start_lsn (we will look at how to use this value later) and __$operation. The column __$operation will tell us what kind of operation was done for that row. The __$operation column is a column with INT data type. The values for this column can be One (Indicates that it is a delete operation), Two (indicates the operation is an insert operation), Three (means that the operation is an Update. But this will contain the Pre updated values for the columns of the table) and Four (Update operation with the committed values for the columns from the table that is enabled for CDC). The values three and four are explained in detail later in this article.
In this particular case the value is "2" since the operation was an Insert. Let us try another insert into the tbProducts table.
INSERT INTO dbo.tbProducts (ProductID, ProductName, ProductCost) VALUES(2, 'Samsung TV', 1500.00); GO
If we do a Select on the cdc.dbo_tbProducts_CT table, we will see two rows. The second one also has the value of two for the __$Operation column. Now let us do an Update operation on the tbProducts table.
UPDATE dbo.tbProducts SET ProductCost = 2200.00 WHERE ProductID = 1 GO
Now how many rows should be there in the cdc.dbo_tbProducts_CT table? It should have three rows in it, right? Let us select the system table and check if we guessed right.
Select __$start_lsn, __$end_lsn, __$seqval, __$operation, __$update_mask, ProductID, productname, ProductCost from cdc.dbo_tbProducts_CT Order by __$start_lsn, __$seqval, __$operation
There are not three rows, there are four rows. Let us go through the four rows of data in that table and explain them. The first two rows of data is due to the two inserts we did. You can see that the value for the __$operation column is '2". As discussed earlier, the value "2" indicates that the operation was an insert. Now let us take a look at row number three and four from the figure shown above. If we look at the third row (Third on the grid), we will see that the values for the ProductID, ProductName and Product Cost in the table cdc.dbo_tbProducts_CT is the same as the values in the table tbProducts before the update was done. You must have noted that the value for __$Operation is three and the next row has the value four. In the last row you will see the data after it got updated. By storing the before update and after update values together, the path of the data change can be tracked down. This is the mechanism used to keep track of how the data changed.
So far we have tested Insert and Update operations and the implementation of CDC seems to be working fine. Let us try a Delete operation.
DELETE dbo.tbProducts WHERE ProductID = 1 GO
After we execute this command, Select from the system table.
Select __$start_lsn, __$end_lsn, __$seqval, __$operation, __$update_mask, ProductID, productname, ProductCost from cdc.dbo_tbProducts_CT Order by __$start_lsn, __$seqval, __$operation
As you can see from the figure, there is one more row. The important column to notice is the __$operation. You will see that the value is "1". The value "1" denotes that the operation that was done was a Delete.
Now let us take a look at how all this information can be leveraged for the application. Let us say that we have a disconnected cache. You might ask what a disconnected cache is. Disconnected cache in simple terms is caching of data that is not always connected to the data source. How it works is the cache connects to the data source and gets the data that needs to be cached and then disconnects itself from the data source. At a predefined period, the cache will connect back to the data source and get all the deltas that happened after the last time it got the data. The method of getting latest data is called a refresh. We can leverage the system table cdc.dbo_tbProducts_CT to get all the changes that was done after the last refresh.
How can we achieve this? We can leverage the values in the column __$start_lsn to achieve this. We can do it by querying the table cdc.dbo_tbProducts_CT. If we do not specify what the last lsn that we have queried was, the query will return all the data changes. If we query the table cdc.dbo_tbProducts_CT based on the last lsn value that we have, the result set will have only changes that happened after the previous execution. From the previous figure we know that the last __$start_lsn is 0x000000640000003C0005. If we execute the query based on the criteria of using the column __$start_lsn being greater than 0x000000640000003C0005, we get the changes that were made after the previous execution. One thing to keep in mind is that the LSN vales will not be the same as what is shown in the figure above in your environment. We can test this scenario out now. For this test let us do some DML statements. Execute the below statements.
INSERT INTO dbo.tbProducts( ProductID, productname, Productcost) VALUES(3, 'Blackberry Curve', 300.00); GO INSERT INTO dbo.tbProducts (ProductID, ProductName, ProductCost) VALUES(4, 'Nikon', 250.00); GO UPDATE dbo.tbProducts SET ProductCost = 220.00 WHERE ProductID = 4 GO DELETE dbo.tbProducts WHERE ProductID = 3 GO
By executing the above statements, we made two inserts, one update and one delete operation. We can get just the changes done now by executing the below provided query.
Select * from cdc.dbo_tbProducts_CT where __$start_lsn >= 0x000000640000003C0005
The result set is as shown below. Please note that the greater than or equal to sign in the query is intentional. This was just to show that the last row from the perivious output is the first row of the current result set. This was just to prove that we did not loose any changes.
As you can see from the above figure, we got all the changes that were done after the last refresh. Unfortunately this is not so helpful when we are trying to refresh data on a disconnected cache. If you look at the data returned by the previous query, you will see that for ProductID four, there are three entries. When you are trying to refresh a disconnected cache, we do not care about the path of the data change, we care only about the final values.
Microsoft kept this in mind when they introduced this technology. When CDC was enabled for a table, the system created two functions. The functions created were cdc.fn_cdc_get_all_changes_dbo_tbProducts and cdc.fn_cdc_get_net_changes_dbo_tbProducts. We can use cdc.fn_cdc_get_net_changes_dbo_tbProducts to capture just the final changes. This function has three parameters. The parameters are from_lsn, to_lsn and row_filter_option. The from_lsn is of data type binary(10). The from_lsn represents the start point to include in the result set. The to_lsn parameter is also of datatype binary(10). The to_lsn will allow you to specify till what range the resultset includes. The third parameter row_filter_option will allow us to specify which rows needs to be returned in the result set. Let us try to find the net change that happened.
We already have the start_lsn (0x000000640000003C0005). You can get the final lsn by running this query.
select sys.fn_cdc_map_time_to_lsn('largest less than or equal', getdate()) GO
Now let try to get the net changes for the same period. Please note that that if you want to use cdc.fn_cdc_get_net_changes_function, your table must be defined with a primary key before it was enabled for CDC. If the primary key is created after the table was enabled for CDC, this newly created primary key will be ignored by CDC.
To get the final changes you have to execute the below query.
SELECT * FROM cdc.fn_cdc_get_net_changes_dbo_tbProducts ( -- Starting LSN 0x000000640000003C0005, -- Ending LSN 0x000000CF000000E00001, 'all' ) GO
When this query is executed, the result set will show just the net of all changes done on that table between these periods. For the data updates we had just done, the net result is just two rows as displayed below.
This query does not show any deleted rows or the path taken by the update. This is pretty efficient.
DDL Change History
CDC can also keep track of all DDL changes for the table that has been configured. Let us see how to keep track of DDL changes. Let us alter the table tbProducts and add a column, then we will drop that column and add it once again.
alter table dbo.tbProducts add somestuff varchar(20) go alter table dbo.tbProducts drop column somestuff go alter table dbo.tbProducts add somestuff varchar(20) go
Now we have added a column, dropped it and re added it again. If we look at the table cdc.dbo_tbProducts_CT we will not see the change history. To track the DDL changes, we have to query cdc.ddl_history. When we execute the following query we will see the change history.
Select * from cdc.ddl_history
If you look at the above figure, the result set shows the ddl command that was run and when it was executed. This shows that the CDC gives us history of all data changes including the path and all changes to the table structure.
Change history cleanup
On a very busy OLTP server if we keep storing history of all the data changes for the tables, we will end up having performance issues. To mitigate that, Microsoft also created an automated clean up process. You might have noticed that when we enabled CDC, two jobs were created. One job is to capture all the changes and the other is to clean up all the old data. By default, CDC keeps only three days of history. Any change history older than three days is purged by this job. The retention period is stored in the table dbo.cdc_jobs in the MSDB. The important columns to look at are retention (retention period in minutes) and threshold (maximum number of deletes that can be done). You can get more details regarding the clean up job here.
Now let us take a look at some of the additional queries that can be used to get more info on CDC.
When we execute this command we will get all the details of the tables includign the column list that are set for CDC. If we want to find out just the name of the tables that are set for CDC, we can just execute the below statement.
select [name] , is_tracked_by_cdc from sys.tables Where is_tracked_by_cdc = 1 order by [name]; GO
Let us try to do a query where we get all changes from the last two days without having to know what is the starting LSN and the ending LSN.
To achieve this, we can use the function sys.fn_cdc_map_time_to_lsn to map out the corresponding lsn to the date range. We can assign the lsn to a local variable and pass the variable to the function cdc.fn_cdc_get_all_changes_dbo_tbProducts.
DECLARE @StartDate datetime = dateadd(dd, -2, dateadd(dd, datediff(dd, 0, GETDATE()), 0)) ; --Last two days DECLARE @EndDate datetime = GETDATE() ; DECLARE @StartLSN binary(10) DECLARE @EndLSN binary(10) ; SELECT @StartLSN = sys.fn_cdc_map_time_to_lsn('smallest greater than or equal', @StartDate) ; SELECT @EndLSN = sys.fn_cdc_map_time_to_lsn('largest less than or equal', @EndDate) ; select 'All Changes' as FilterOption , sys.fn_cdc_map_lsn_to_time(__$start_lsn) as Startlsndate , case __$operation when 1 then 'Delete' when 2 then 'Insert' when 3 then 'Update_Originalvalue'; when 4 then 'Update' else 'Big Mess' end as Operation , ProductID, ProductName, ProductCost from cdc.fn_cdc_get_all_changes_dbo_tbProducts(@StartLSN, @EndLSN, 'all') ;
The results will look something like this:
As you can see from the image above, it gives the result set with all the changes for the last two days. You might have also noticed that for the function sys.fn_cdc_map_time_to_lsn, there are some relational operators. Here is a list of possible relational operators we can provide.
- largest less than
- largest less than or equal
- smallest greater than
- smallest greater than or equal
How can we use all this information for auditing? How can we find out if any productCost was modified. By using the function sys.fn_cdc_has_column_changed we can find out if data was modified in that column. Since we are looking for only update operation, we can also use the __$operation = 4 as a criteria for this query. The final query is as shown below.
DECLARE @StartDate datetime = dateadd(dd, -2, dateadd(dd, datediff(dd, 0, GETDATE()), 0)) ; DECLARE @EndDate datetime = GETDATE() ; DECLARE @StartLSN binary(10) DECLARE @EndLSN binary(10) ; SELECT @StartLSN = sys.fn_cdc_map_time_to_lsn('smallest greater than or equal', @StartDate) ; SELECT @EndLSN = sys.fn_cdc_map_time_to_lsn('largest less than or equal', @EndDate) ; SELECT sys.fn_cdc_map_lsn_to_time(__$start_lsn) as StartLSNDate , Case sys.fn_cdc_has_column_changed ('dbo_tbProducts','ProductCost',__$update_mask) when 1 then 'YES' Else 'NO' End as 'IsProductCostUpdated' , case __$operation when 1 then 'Delete' when 2 then 'Insert' when 4 then 'Update' when 5 then 'Insert/Update' else 'Who messed with my data' end as Operation , ProductID, Productname, ProductCost FROM cdc.fn_cdc_get_all_changes_dbo_tbProducts(@StartLSN, @EndLSN, 'all') cross apply (select column_id from cdc.captured_columns where [OBJECT_ID] = object_id('cdc.dbo_tbProducts_CT') and column_name = 'ProductCost') S WHERE __$operation = 4 order by StartLSNDate;
If we execute the above query, the result set will be as shown below.
This article has shown how to enable CDC for a table. If for some reason we need to disable it, we have to run this stored procedure.
EXEC sys.sp_cdc_disable_table @source_schema = N'dbo', @source_name = N'tbProducts', @capture_instance = 'dbo_tbProducts';
This will disable CDC on the table level. If we decide to disable CDC on the database level, then we will have to run
Since we have disabled CDC completely, let us complete the clean up and drop the database that we just created for testing.
drop database TestDB;
Cons of CDC
For any new functionality that is added it will have its impact. Let us look at couple of impacts of enabling CDC. By enabling CDC the server will have higher IO load. Why? When a new row is inserted into a table that has been enabled for CDC, there will by at least three additional write. The first write is to the LOG file (happens even if CDC is not enabled), the second additional write is to the change table and the third additional write is to the LOG file again.
The database size will grow if we have a high number of tables that are enabled with CDC. This is because for every data change done on any of these tables, it is duplicated in the history table.
Even though CDC keeps track of DDL changes, it will not keep track of the data changes in the new column. For instance if we add a column called RetailPrice to tbProducts, we can see the history of it in the cdc.ddl_history but if we add a value to the column RetailPrice, that change will not be stored. For CDC to keep track of changes to the new column as well, we have to first disable CDC on that table and then enable it again.
Change Data Capture is a great technology that answers almost all basic questions we have for auditing. The maintenance for CDC is very easy when compared to methods like triggers or modifying stored procedures. CDC does not answer the question “who” changed the data. We will find the solution for that on my final article of this series ‘What, When and Who".
I would like to Thank Johan (@ALZDBA) for reviewing my article and giving me some great ideas to make this article better. I would also like to Thank Steve Jones (@way0utwest) for going through my article with a big red brush and stating some of the obvious mistakes I made. Steve is always trying to get the writer improve his skills. I am very much indebted to him.