Blog Post

Easy CDC part 2: Automating a Quick Initial Load, and Some Gotchas

,

Introduction

A couple of weeks ago I made this post about quickly building a CDC system against a vendor ERP database. The key features of this system were the following:

  • the ERP had no built in support for CDC or data warehousing (eg, no inserted/updated datetime columns on tables)
  • the ERP runs on SQL Server 2005
  • change capture only needed to occur daily such that the CDC system picks up the most recent version of a row when CDC processing occurs
  • the entire CDC system had to be built and running by the first of Jan 2011 (giving about two weeks)
  • the mechanism for the CDC ETL would be a straight TSQL solution (although we do control it all at a high level via an SSIS package)
  • the schema and code for the CDC system was generated automatically using metadata queries to spit out DDL


I also mentioned that I intended to do an initial population of the CDC system data using bulk export and import. It would also be possible to simply run the procedures to accomplish this, but I figured an export/import would be a quicker way to load the data, since the CDC logic requires a left join across a linked server. It's pretty quick, but why perform an operation if you don't have to? By definition on an initial load every row is a new row, there is no need to check!


And so a couple of days ago after recovering, at least partially, from Christmas revelry, I wrote up another couple of functions to generate export and import commands. Again, I didn't want to have to manually create 200 data flow tasks, or go through 200 wizards to transfer the data, so agan I decided on an autogenerated code based solution. The mechanism uses the bcp queryout mechanism for export, and bulk insert for import.


The Export-Generating Function

Here is a function which takes a table name and a folder path and exports the necessary data to that path as a text file with the same name as the table. For this system, you might recall, this means all columns excluding timestamps, blobs, and columns with a data length greater than 255, as well as a hash of those columns used to find changed data, and two additional datetime columns. It uses the same metadata function, and very similar string concatenation and replacement logic, as the DDL generation functions from part 1.

Note that I'm generating not only the BCP statement, I'm also wrapping it up inside xp_cmdshell to execute from within SSMS. If you don't have xp_cmdshell enabled you can either get rid of that part (and the additional quote escape replacement immediately preceding the final concatenation) or temporarily enable xp_cmdshell using sp_configure

You will also need to change the server and database names to match your environment (and possibly modify it to be able to use schemas other than dbo. Our vendor's tables all live in the dbo schema):

alter function CDC.bulk_export_statement(
    @source_table_name sysname,
    @path varchar(255)
) 
returns varchar(max) 
as begin
    declare @cmd varchar(max) set @cmd = 
        'bcp "select ''2011-01-01'' as start_DT, ''9999-12-31'' as end_DT, '
        + 'hashbytes(''SHA1'', {hash_cols}) as row_hash, {cols} '
        + 'from your_db_name.dbo.[' + @source_table_name + ']" '
        + 'queryout ' + @path + @source_table_name + '.txt '
        + '-S"your_instance_name" -T -c -b10000'
        
    declare @cols varchar(max) set @cols = ''
    declare @cols2 varchar(max) set @cols2 = ''
    
   -- use concatenation trick to build strings
   select   @cols = @cols + '^' + t.column_name + '$ ',
            @cols2 = @cols2 + t.column_name + ', '
   from     CDC.source_column_definitions('dbo', @source_table_name) t
   order by t.ordinal_position asc
   
   set @cols = left(@cols, len(@cols) - 1)  
   set @cols2 = left(@cols2, len(@cols2) - 1)
   
   set @cols = replace(@cols, '^', 'isnull(cast(')
   set @cols = replace(@cols, '$', ' as varchar(255)), '''') + ''|'' + ') + ' as varchar(255)), '''')'
   set @cmd = replace(@cmd, '{hash_cols}', @cols)
      
   set @cmd = replace(@cmd, '{cols}', @cols2)
   
   set @cmd = replace(@cmd, '''', '''''')
   set @cmd = 'exec master..xp_cmdshell ''' + @cmd + ''''     
   
   return @cmd
end


Now, you may have problems with the generated export statement for some tables if they have a LOT of columns. The query used by bcp queryout cannot take a string of arbitrary length. The query names all the columns it needs as well as a calculated column which is a hash of the concatenation of those columns cast as varchars. The query text can therefore be very long indeed. If you have this problem you have a few options. The simplest option is to ignore the problem. We know the CDC system itself is capable of loading data from scratch, the bulk load is just an efficiency gain. If you don't have many tables that encounter this problem, you can just let those be populated during the first normal ETL. Since only 3 out of the 200-odd tables in our system had this problem, this was in fact my choice.

A second option is to use similar code-autogeneration logic to create views for each table, and then have the bcp utility select * from the views you have created. Pretty easy to accomplish given that the query for the view is already right there in the function, you could just wrap it up in a "create view" with some standard convention for the name and prepend that to the return value. If you do this I'd suggest creating the views in a different schema for easy cleanup later (or append to the return value necessary DDL to drop the view. Again, a simple inline solution).

The Import-Generating Function

Here is a simple function which, similarly, takes a table name and path and creates the required bulk import statement, expecting the table data to be in a text file named the same as the table in that path:


alter function CDC.bulk_insert_statement(
    @source_table_name sysname, 
    @path varchar(255)
) 
returns varchar(2000) 
as begin
    declare @cmd varchar(2000) set @cmd =
        'bulk insert CDC.[' + @source_table_name + '] '
        + 'from ''' + @path + @source_table_name + '.txt'' '
        + 'with (firstrow=1, rowterminator=''\n'', fieldterminator=''\t'')'
    
    return @cmd
end 


Again, you may have some problems with the import side. Clearly if any of your data contains tabs or newlines, that's going to cause an issue. Since we don't consider long varchar columns in our CDC system, this wasn't a problem for me. If you have this issue you can again just ignore it and let the normal ETL handle anything that wasn't able to be easily bulk loaded, or you can try to make the export and import more robust. Import problems due to malformed data can be difficult to fix, and certainly difficult to fix quickly. It may therefore not be worth investing the time to try to get the bulk operations to work in all possible cases. Keep in mind that the purpose of this bulk load is purely an efficiency gain, and is not strictly necessary. Personally, if I had such issues but managed to load a significant portion of the data using the bulk methods, I would just ignore them and let the first ETL run handle anything that failed to come across via the bulk load.

Finally...

You can use these functions to generate the statements in the same way as the DDL was generated in the previous post: Use the table with the names of all tables involved in CDC, and select these functions from that table.


Now, there is one more aspect of this system I still need to cover, and that is the set of gotchas you might run into with the ETL mechanism. It's nothing fatal as long as you're aware of the limitations. But I will have to leave that for Monday...

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating