Blog Post

Streaming ETL: SQL Change Data Capture (CDC) to Azure Event Hub

,

[read this post on Mr. Fox SQL blog]

I had a recent requirement to capture and stream real-time data changes on several SQL database tables from an on-prem SQL Server to Azure for downstream processing.

Specifically we needed to create a streaming ETL solution that …

  1. Captured intermediate DML operations on tables in an on-prem SQL database
  2. Transmit data securely and real-time into Azure
  3. Store the delta changes as TXT files in Azure Data Lake Store (ADLS)
  4. Visualise the real-time change telemetry on a Power BI dashboard (specifically the number of Inserts, Updates, Deletes over time).

The first part was easy; SQL has a feature called Change Data Capture (CDC) which does an amazing job of tracking DML changes to seperate system tables.  If you dont know about CDC then see here – https://docs.microsoft.com/en-us/sql/relational-databases/track-changes/about-change-data-capture-sql-server

The second part wasn’t easy, and after some searching I came across this blog post by Spyros Sakellariadis which gave me inspiration and starter code for my streaming ETL solution.  Excellent post.  See here – https://azure.microsoft.com/en-us/resources/samples/event-hubs-dotnet-import-from-sql/

And so, the final architecture looks something like this…

The solution picks up the SQL data changes from the CDC Change Tracking system tables, creates JSON messages from the change rows, and then posts the message to an Azure Event Hub.  Once landed in the Event Hub an Azure Stream Analytics (ASA) Job distributes the changes into the multiple outputs.

What I found pretty cool was that I could transmit SQL delta changes from source to target in as little as 5 seconds end to end!

And so, lets get into some CDC to Event Hub data streaming action!

Set Up the Source SQL Database and Tables

The solution requires a table that you want tracked in a source database, as well as a separate offset table to manage the last position/row that was sent to Azure.

This script will create the database, sample table, offset tracking table and enable CDC.

 
-- CREATE Source Database
USE [master]
GO
CREATE DATABASE [SQL2AEH]
GO
ALTER DATABASE [SQL2AEH] SET RECOVERY FULL WITH NO_WAIT
GO
USE [SQL2AEH]
GO
execute dbo.sp_changedbowner @loginame = N'sa'
GO
-- Enable CDC on Source Database
execute sys.sp_cdc_enable_db
GO
-- Create User Table to be Tracked
CREATE TABLE [dbo].[BCTran]
(
 [BCTranID] [int] IDENTITY(1,1) NOT NULL,
 [BCTranDate] [datetime] NULL,
 [PartyAID] [int] NULL,
 [PartyBID] [int] NULL,
 [RoomID] [int] NULL,
 [BCTranStatus] [char](1) NULL,
 CONSTRAINT [pk_BCTran] PRIMARY KEY CLUSTERED 
 (
 [BCTranID] ASC
 )
) 
GO
-- Enable CDC on the User Table
-- ### ENSURE SQL AGENT IS RUNNING BEFORE RUNNING THIS STEP ###
-- This will create the system tracking table [cdc].[dbo_BCTran_CT]
-- This will create and start the following SQL Agent jobs...
-- (1) cdc.SQL2AEH_capture --> Read SQL Txn Log, identify DML on [BCTran], copy change data to [cdc].[dbo_BCTran_CT]
-- (2) cdc.SQL2AEH_cleanup --> Delete data in [cdc].[dbo_BCTran_CT] older than 72 hours (default)
execute sys.sp_cdc_enable_table @source_schema = 'dbo', @source_name = 'BCTran', @role_name = 'cdc_reader'
GO
-- Create Separate Offset Table to Manage Last Position/Row Sent to Azure
CREATE TABLE [dbo].[SQL2AEH_TableOffset]
(
 [TableName] [varchar](250) NOT NULL,
 [LastMaxVal] [binary](10) NOT NULL,
 [LastUpdateDateTime] [datetime] NOT NULL DEFAULT getdate(),
 [LastCheckedDateTime] [datetime] NOT NULL DEFAULT getdate(),
 CONSTRAINT [PK_SQL2AEH_TableOffset_1] PRIMARY KEY NONCLUSTERED 
 (
 [TableName] ASC
 ) 
) 
GO
-- Insert Starting Point for the Tracked User Table
insert into [dbo].[SQL2AEH_TableOffset] select 'dbo.BCTran', 0x00000000000000000000, '1900-01-01 00:00:00', '1900-01-01 00:00:00'
GO

 

Set Up the “SQL CDC 2 Event Hub” Solution

To extract data from the SQL CDC change tracking system tables and create Event Hub messages you need a small c# command line program and an Azure Event Hub to send the messages to.

Create the Azure Event Hub from the Azure Portal

You can easily create a new Azure Event Hub in the Azure Portal under your Azure Subscription.  Once created you will need the connection string of that Event Hub, which you can get from the Azure Portal.

If you need to learn about Event Hubs then see here – https://azure.microsoft.com/en-us/services/event-hubs/

Create the SQL2AEH Command Line Program

For the SQL2AEH program, you can download the source code from my GitHub Repo called SQL CDC 2 Event Hub – https://github.com/rolftesmer/SQLCDC2EventHub

Open the solution in Visual Studio (2015) and edit the “app.config” file;

  • ExecutionControl – 1 = run continuously, 0 = run once then exit (default 1)
  • ExecutionControlSleepMs – milliseconds sleep between each iteration when program is set to run continuously (default 1000)
  • DataTableName – the name of the source SQL Server table in “owner.table” format
  • SQLBatchSize – the number of SQL CDC change rows to bundle into a single JSON Event Hub message
  • sqlDatabaseConnectionString – the connection string for the source SQL Server
  • Microsoft.ServiceBus.EventHubToUse – the name of the target Azure Event Hub
  • Microsoft.ServiceBus.ServiceBusConnectionString – the connection string for the target Azure Event Hub

Build the solution (release).  In the “…\Bin\Release” folder you will see the SQL2AEH.exe program.

This EXE can be run from command line, or for my scenario its configured as a SQL Agent scheduled job to continually poll the CDC change table for new rows to send to Event Hubs.

NOTE:  I am not a seasoned c# developer!  As such the application code is provided free without any support or warranty of any kind.  The code has not been thoroughly tested and is not considered production ready.  The code is provided free of charge and can be reused in any way you wish.  

 

Schedule SQL2AEH.exe in SQL Agent

As we want data to be streamed real-time from SQL CDC we’ll schedule the SQL2AEH.exe program in SQL Agent.   Create a new SQL Agent job as per the below…

Setup a new SQL Agent Job with a single job step

Define the job step to call the SQL2AEH.exe

Schedule the job to start on SQL Agent startup

Create Event Hub Messages from CDC Tracking Tables

Once the source SQL Server, tables and SQL2AEH.exe is setup, we can then test it out!

Lets create some new rows in our source SQL table.

-- Create Some New Rows in our Source SQL Table
insert into [dbo].[BCTran] 
(
 BCTranDate, PartyAID, PartyBID, RoomID, BCTranStatus
) 
select '2017-07-10 15:00:00', 1, 2, 550, 'C'
union 
select '2017-07-10 18:00:00', 2, 3, 650, 'D'
union 
select '2017-07-10 21:00:00', 3, 1, 750, 'I'
GO

 

Lets see what’s in the source SQL table, and then also what’s been tracked by CDC in our CDC system change tracking table (about 1-2 sec later).

select * from [dbo].[BCTran]
GO
select * from [cdc].[dbo_BCTran_CT]
GO

 

There are 3 new rows in our source table, and also 3 news rows tracked by CDC as operation “2” (INSERT).

When you run the SQL2AEH.exe program, it will read any new rows from the [cdc].[dbo_BCTran_CT] tracking table, create a JSON message, and send it to the Azure Event Hub.

In our case it has created a single JSON message with the 3 new rowsThis is the look/feel of a typical message which would land in the Event Hub. 

  • Columns 1-5 named “SQL2AEH_” are control fields added by the SQL2AEH program.
  • Columns 6-10 named “__$” are control fields added by SQL CDC.
  • Columns 11+ are all from the source table.
 
[
 {
 "SQL2AEH_TableName": "dbo.BCTran",
 "SQL2AEH_RowNbr": 1,
 "SQL2AEH_$start_lsn_string": "0x0000002500000D310005",
 "SQL2AEH_$seqval_string": "0x0000002500000D310002",
 "SQL2AEH_$update_mask_string": "0x3F",
 "__$start_lsn": "AAAAJQAADTEABQ==",
 "__$end_lsn": null,
 "__$seqval": "AAAAJQAADTEAAg==",
 "__$operation": 2,
 "__$update_mask": "Pw==",
 "BCTranID": 5,
 "BCTranDate": "2017-07-10T15:00:00",
 "PartyAID": 1,
 "PartyBID": 2,
 "RoomID": 550,
 "BCTranStatus": "C"
 },
 {
 "SQL2AEH_TableName": "dbo.BCTran",
 "SQL2AEH_RowNbr": 2,
 "SQL2AEH_$start_lsn_string": "0x0000002500000D310005",
 "SQL2AEH_$seqval_string": "0x0000002500000D310003",
 "SQL2AEH_$update_mask_string": "0x3F",
 "__$start_lsn": "AAAAJQAADTEABQ==",
 "__$end_lsn": null,
 "__$seqval": "AAAAJQAADTEAAw==",
 "__$operation": 2,
 "__$update_mask": "Pw==",
 "BCTranID": 6,
 "BCTranDate": "2017-07-10T18:00:00",
 "PartyAID": 2,
 "PartyBID": 3,
 "RoomID": 650,
 "BCTranStatus": "D"
 },
 {
 "SQL2AEH_TableName": "dbo.BCTran",
 "SQL2AEH_RowNbr": 3,
 "SQL2AEH_$start_lsn_string": "0x0000002500000D310005",
 "SQL2AEH_$seqval_string": "0x0000002500000D310004",
 "SQL2AEH_$update_mask_string": "0x3F",
 "__$start_lsn": "AAAAJQAADTEABQ==",
 "__$end_lsn": null,
 "__$seqval": "AAAAJQAADTEABA==",
 "__$operation": 2,
 "__$update_mask": "Pw==",
 "BCTranID": 7,
 "BCTranDate": "2017-07-10T21:00:00",
 "PartyAID": 3,
 "PartyBID": 1,
 "RoomID": 750,
 "BCTranStatus": "I"
 }
]
GO 

If needed to see whats going on, you can use the totally awesome and totally free Azure Service Bus Explorer to peek into your Event Hub to check out what the SQL CDC messages look like and to debug.   This is an essential tool for any Azure streaming solution.

For those who have not used it before – see here – https://code.msdn.microsoft.com/windowsapps/Service-Bus-Explorer-f2abca5a

 

Reading CDC Messages Downstream from Event Hub

Once data is in Azure Event Hub you have many options to process that data.  Essentially anything which can make a call to the Event Hub can consume the SQL CDC data!

I really like Azure Stream Analytics because its simple, fast, cheap and allows multiple streaming output destinations from the same single streaming input feed.

The ASA SQL Query to read from the Azure Event Hub, and write to an Azure SQL DB output would look like this code below.

Because the event sent into Event Hubs has the table name in the payload “SQL2AEH_TableName” then this provides flexibility to have the SQL2AEH program send multiple streams of CDC traffic to the same Event Hub and simply have a single ASA job with multiple ASA SQL queries with a where clause filtering content for a specific table.

 
SELECT
 SQL2AEH_$start_lsn_string, 
 SQL2AEH_$seqval_string, 
 SQL2AEH_$update_mask_string, 
 __$operation, 
 BCTranID, 
 BCTranDate, 
 PartyAID, 
 PartyBID, 
 RoomID, 
 BCTranStatus
INTO
 Target_AzureSQLDB
FROM
 Source_AzureEventHub
WHERE SQL2AEH_TableName = 'dbo.BCTran'

 

Source to Target Performance Benchmark

To round it out, I did basic benchmarking to see how long it took from the time a DML change occurred on my SQL table above on my on-prem SQL Server (VM) to an Azure SQL Database (PaaS), and also how many rows per sec I could push via the SQL2AEH program.

TEST Source Configuration: On-Prem SQL Server 2016;

  • SQL Server 2016 SP1, Developer Edition
  • 4x 1.73GHZ cores, 8GB RAM, 256GB SSD
  • Melbourne, Australia (20KM from Azure DC)
  • 10Mb/sec Internet path, no express route

TEST Target Configuration: Azure Event Hub + Stream Analytics + SQL Database;

  • All resources in Australia South East Region (Melbourne)
  • Azure Event Hub = Standard, TU5, 4 partitions, no compression
  • Azure Stream Analytics = SU6 (allocates entire node), using PartitionId
  • Azure SQL Database = PRS1 (125DTU), single target table (heap)

TEST Workload Configuration;

  • 100000 individual insert rows created in source SQL table
  • No delay between insert of rows into source SQL table
  • SQL CDC capture job to run continuously
  • SQL2AEH send program to run continuously
  • SQL2AEH delay between send events = 0 ms
  • SQL2AEH program batch size (rows per JSON event payload) = 500
  • SQL2AEH event size = approx 85 bytes

RESULTS Source-Target Performance Benchmark;

  • Total Time to Insert Individual Rows on Source SQL Server = 25 sec
  • Avg Row Insert Rate on Source SQL Server = 3850 rows/sec (231000 rows/min)
  • Row Receive Rate on Azure SQL DB = 166 rows/sec (9960 rows/min)
  • Min Event Speed: SQL Server to Azure SQL DB = 8 sec (fastest receive time)
  • Max Event Speed: SQL Server to Azure SQL DB = 582 sec (slowest receive time)
  • Avg Event Speed: SQL Server to Azure SQL DB = 288 sec
  • Azure Stream Analytics SU% = 5%
  • Azure SQL Database DTU% = 3%

The test didn’t push the limits of the local SQL box, nor the Azure services either.  In fact the Event Hub, Stream Analytics Job and SQL DB shows they weren’t being fully utilised, so no immediate scalability concern on the cloud side.

Instead performance was completely bound by how fast the SQL2AEH program can bundle and send the SQL CDC changes from the on-prem SQL Server.

In this case – it I couldn’t push past a sustained level of 166 rows/sec, which isn’t too bad, but couldn’t keep up with my very high row creation rate.  This is OK for occasional spikes, but not a sustained level.  However, I would think that with a bit of application tuning, and perhaps even looking to leverage multiple source SQL CDC table read + send threads, then I would anticipate orders of magnitude improvements!  (there’s a challenge for someone!)

 

So in Summary

So there you have it – a pretty simple streaming ETL solution to take SQL CDC changes from any SQL Server up to Azure and redirect it to many different ouptut destinations!

There are a couple of limitations that I can think of so far that could occur;

  • As at writing, the max Event Hub message size was 256KB (see ** below), so ensure that you take this into account when calculating how many CDC rows per batch.  Example – if the SQL row is 356 bytes, then your batch size should be less than 736
  • ** EDIT 21 Oct 2017 – Microsoft now support compressed input for Stream Analytics which means the payload can be compressed (ie gzip) on the event producer before sending to Event Hub.  This dramatically reduces the effective event size, and increases the potential send batch size.  See here – https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-define-inputs#compression 
  • The max SQL row is 8KB, however SQL rows can extend with LOB data types (ie varchar(max)).  I haven’t tested this scenario.
  • haven’t tested all SQL data types, but I did notice that some won’t convert well from their native format into JSON.  SQL [binary] is one such example.

So as per usual, and as I always say, please test this yourself as your milage may vary!


Disclaimer: all content on Mr. Fox SQL blog is subject to the disclaimer found here

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating