Defining the Problem
Imagine I have two different systems - say, a CRM and an ERP - on different SQL Servers. Both systems store some attributes about widgets, but neither can provide the complete picture by itself.
Lots of consumers want to interact with widgets holistically - that is to say, they need to know about attributes defined in both of those sources - which often results in nasty solutions like federated (linked server) queries in real time. The problem is extensive enough that we decide to consolidate both sources into a single database which is updated in near-real-time.
This database will be a kind of operational data store (ODS), which defines a canonical data model (CDM) composed of some transformed subset of the desired attributes from all of the upstream systems. Not only does this allow consolidation of the disparate source data, it also lets us make our own enterprise data model, eliminating design defects in the source systems, removing redundancies and ambiguities, and making it easier for our developers to interact with a model that matches our enterprise instead of matching some vendor's idea of a generic enterprise.
But how will I populate the ODS in near real time? That's something I am currently still working on, and where I've gotten to so far is going to be the subject of my post today. I expect that the architecture I describe might be a little different from the normal "near real time ETL" designs out there.
Creating dummy source systems
Let's create a bit of schema to represent our source systems. I'm going to create two schemas in the one database for simplicity:
-- create schemas to act as our two source systems create schema ERP go create schema CRM go -- source data -- for demo simplicity we will assume that the ERP and CRM share id's, -- ie, there is only one master system for widget creation create table ERP.Widgets(id int primary key, a char, b char) go create table CRM.Widgets(id int primary key, c char, d char) go
Defining the common model
Now let's move over to the ODS and create the common model for widgets.
My common model isn't just the union of all attributes from the sources, I'm going to add a little transformation. For my demo I've decided I don't need the "b" column from the ERP, and the "a" column data should be uppercase. I'm also going to add a couple of metadata columns.
I'll create my ODS Widget table in a different database so that I can incorporate replication later (which requires that publisher and subscriber be on different databases), but I'll create an ODS schema too just to make it obvious where each object "belongs".
use ODS go -- the ODS common model for Widgets create table ODS.Widgets(id int primary key, upper_a char, c char, d char, inserted_dt datetime default getdate(), updated_dt datetime) go
An interface for the T and the L of ETL. We are going to be agnostic of what's doing the E part
OK, so how should we move data from the source Widgets tables into our common model? For now let's ignore the "E" part of ETL, and focus on the T and the L.
I want my transforms and loads to be efficient. This means I want to limit IO. In a previous post I talked about the inefficiency of message oriented architectures, so with that in mind I want to provide a loading interface that can process sets of data where available. Maybe I receive a single tuple, maybe I receive ten-thousand at once. I want the interface to be the same in both cases.
Since I'm working in sets, I'd also like to do those transforms using TSQL, because that's the kind of thing TSQL is good at. Some might ask "what about SSIS"? I use SSIS a great deal in my job, but even after nearly 10 years with SSIS I still find TSQL to be much more powerful, intuitive and maintainable. This is particularly the case when your transforms aren't simple scalar operations, and often even with they are (SSIS still can't even format a date without a ridiculous amount of expression code).
Now here's the 60 thousand dollar question: How can I create an interface that is agnostic of what is writing to it, can receive a set of data, transform it to a different shape, and deliver the result to the ODS.Widgets table in a way that minimizes IO?
We could use stored procedures that take table-valued-parameters as arguments. But if we ever want to write data to the interface using SSIS that's going to be annoying (unless you really like creating custom data flow destinations), and it's just harder to develop against in general. Wouldn't it be nice if we could just issue regular insert, update and delete commands?
How about I create some views with instead-of triggers on them? This gives me a standard relational interface - it looks like a table as far as anyone is concerned - but it means I don't actually have to write to staging tables, then read back from those staging tables, do the transforms, and write the data to the common model. A single operation will do both the transform and the load without having to be written to any intermediary.
I'll create views to act as the "landing pad" for the data from each of the source systems. To get started, I'll create schemas inside the ODS database to contain these views. Each source system will have its own "landing pad" schema:
-- these schemas will act as the surface area for writing to the ODS database create schema ERP go create schema CRM go
And for each source system table, I'll create a landing pad view that matches the schema of the source in terms of columns.
Creating the landing-pad objects
But views have to select something, so what are we selecting?
I could select a single row of literal nulls cast to the correct datatype, but there's an arguably more useful approach. Select from the common model of course! So reading from these views to some extent "untransforms" the data in the common model back to the source model. We can't always do a full reverse transform. For example, I can't restore the original case of the source data once it's been transformed to uppercase. That original case information is lost. But I can provide meaningful output where it's sensible to do so.
I'm also going to add an extra "metadata" column to each interface view that doesn't exist as an attribute in the source entity: a "delete_flag" column that will make it easier to work with certain kinds of data - specifically, data coming from CDC-like sources. Don't worry about this column for now, it will all make sense in a moment:
-- The columns being output by the views match the columns that will come from the source, -- that way performing writes from the source to the destination is very straightforward to develop, just line up the column names! -- I'll add a "delete_flag" column so we can insert rows but process them as if they were deletes. -- This will be super handy when working with SSIS or CDC as upstream components. create or alter view ERP.Widgets as select id = id, a = upper_a, delete_flag = cast(null as bit) from ODS.Widgets go create or alter view CRM.Widgets as select id = id, c = c, d = d, delete_flag = cast(null as bit) from ODS.Widgets go
I could of course create staging tables with instead-of triggers on them rather than using views. Those tables would never actually contain any data, it would always be redirected by the instead-of trigger(s). But by using views these objects are a bit more useful. Also, using tables will only work with "cdc-like" input data, because updates and deletes won't have anything in the inserted or deleted tables inside the triggers. More about that in a second...
So, here's how this is going to work in principle: We're going to take changes from the source system and push those changes into the ODS using these views. For example, if a row is inserted into the Widgets table in the CRM system, we will issue an insert against the CRM.Widgets view in the ODS database. The triggers on the view will perform the transform to the common model and redirect the insert to the ODS.Widgets table.
So, why the deleted_flag column? Well, there are two ways a source system might want to communicate with the ODS. One way would be to issue an instruction that says "These rows were deleted at my side, so perform the same delete at your side". That's the "obvious" way - and that's how replication will do it.
But if you're using CDC things work a bit differently. When a row is inserted, updated or deleted at the source system a row will be inserted into a change table, with a column to indicate which kind of operation happened.
We could read from the CDC data, split the output into three sets based on the __$operation column that CDC provides, and perform three different kinds of operations against the view. But there's a "simpler" way: We can just insert everything from the CDC function into the view, including the column indicating the operation type, and write code in the trigger(s) to deal with each case appropriately.
As it turns out the only "type" of operation we need to explicitly nominate is a delete at the source. For inserts or updates at the source, the operation we need to perform in the ODS depends on whether the row already exists in the ODS - remember, multiple systems are writing data to the ODS! As such, inserts at the source could become updates in the ODS, or updates at the source could become inserts at the ODS. If we are clever with our merge code it doesn't matter!
Here's one way of writing the instead-of triggers. I'm using one trigger to handle all operations just for cooless in this article, but in reality I'd probably split off the instead-of-delete trigger to improve the efficiency of the conditionals
-- Im going to use one big fancy "do everything" trigger just for demo fun. It is simpler to have a separate trigger -- for deletes - and more efficient too no doubt, but more code. -- I will discard the "b" column, and am transforming the "a" column to uppercase -- Note that whether the original operation on the view was an insert or an update is actually not relevant. -- If inserted.id is not null we are doing an insert OR an update. We will choose between these two options -- based solely on whether we already have that row in the ODS! create or alter trigger ERP.Widgets_transform on ERP.Widgets instead of insert, update, delete as begin set nocount on merge ODS.Widgets tgt using ( select id = coalesce(i.id, d.id), iid = i.id, did = d.id, a = i.a, delete_flag = isnull(i.delete_flag, 0) from inserted i full join deleted d on i.id = d.id ) src on src.id = tgt.id when not matched by target and src.iid is not null and src.delete_flag = 0 then -- not in CDM, not a delete operation. insert (id, upper_a) values (src.id, upper(src.a)) when matched and src.iid is not null and src.delete_flag = 0 then -- in CDM, not a delete operation. update set tgt.upper_a = upper(src.a), tgt.updated_dt = getdate() when matched and (src.iid is null or src.delete_flag = 1) then -- in CDM, delete operation. delete; end go -- no transforms here, just moving the CRM columns into the CDM Widgets create or alter trigger [CRM].[Widgets_transform] on [CRM].[Widgets] instead of insert, update, delete as begin set nocount on merge ODS.Widgets tgt using ( select id = coalesce(i.id, d.id), iid = i.id, did = d.id, c = i.c, d = i.d, delete_flag =isnull(i.delete_flag, 0) from inserted i full join deleted d on i.id = d.id ) src on src.id = tgt.id when not matched by target and src.iid is not null and src.delete_flag = 0 then -- not in CDM, not a delete op. insert (id, c, d) values (src.id, src.c, src.d) when matched and src.iid is not null and src.delete_flag = 0 then -- in CDM, not a delete op. update set tgt.c = src.c, tgt.d = src.d, tgt.updated_dt = getdate() when matched and (src.iid is null or src.delete_flag = 1) then -- in CDM, delete op. delete; end go
What's been achieved here?
What I now have is a pattern which receives tuples from source systems, in the shape of those source systems. The interface can take single rows, or whole sets of data. It can be fed data from CDC-like, or replication-like sources. You use even use the landing-pad views as data-flow destinations in SSIS, or send from a RBAR message-oriented source (which is basically, and unfortunately, what SQL replication does). The interface applies transforms to the incoming data, writing it to a common data model in an ODS. The source systems don't need to know anything about the data in the ODS, the transforms always handle the rows appropriately depending on the current state of the ODS. The pattern doesn't stage the incoming data anywhere before reading it back out and transforming it to the final model, which saves on IO. Transforms are done in a set-based manner using TSQL.
But there's a problem with this design: It can't actually act as a receiver for operations coming from SQL replication! If you try to take a source table and replicate it, setting the view as the target object, replication won't work. After working through several roadblocks I have gotten to the point where replication complains that "alter table cannot be used against a view" when trying to init. I've been trying to figure out if there's a way to set up replication options that would allow setting the replication target to be a view, but have so far been unsuccessful.
Maybe replacing the interface views with the (somewhat less fancy) empty-table solution will work with replication. But that's going to involve some trickery, because as I wrote earlier, empty tables won't have contents in the inserted or deleted tables for update or delete operations inside the triggers. That's what I'm going to be working on next. I will come back with another blog entry once I know more!