I'm working on problems of enterprise data integration in the role of enterprise architect, but I am and have been a SQL person for 15 years, so it may be that I look at this topic with considerations that don't get much attention more widely.
So, enterprise data integration! Or more descriptively, "sharing data around your various enterprise applications". While we would ideally strive to avoid data redundancy, we must accept the reality that when an enterprise is using several software suites each will almost certainly have its own database requiring at least some information that other systems also need: customers, products, and so on.
In an ideal (fantasy) world, every business entity would have a single master system of entry and all other systems would magically be instantly updated when data is changed at the master. In reality magic doesn't exist and we normally have to find a way to copy data around the various systems, and unfortunately sometimes it's even the case that more than one system is the system of entry for any particular entity - or even any particular attribute!
So what's the problem? Copying data around the various systems takes time. Traditional ETL solutions are fine for data warehouses that are updated overnight, but become less useful when an enterprise wants (or sometimes needs) changes propagated rapidly to all applications.
Event and message based patterns do seem to be the best fit for real time or near real time integration. The first step here is having some way to generate messages in the first place. Some applications might provide a native mechanism of raising events to the outside world, but even where that's not the case a technology like SQL CDC can be used to detect changes and generate events from the data layer. Beyond this most fundamental step we have questions of how to guarantee message delivery, preserve message order, detect and resolve collisions in data, and so on. There are products tailored to solving some of these (eg, biztalk) and quite a lot of discussion on the various patterns one can employ.
But there is a serious lack of discussion on how a messaging based system is supposed to cope with bulk data changes.
Let us suppose one of the applications provides a mechanism to upload a reasonably large data file - say, the results of a stocktake, or periodic industry data, or validated address information. Such an upload may, even in a mid sized enterprise, result in tens or hundreds of thousands of rows being updated.
Your DBMS probably isn't going to have any problem with this. For extremely large changes we might split the operation into batches of tens-of-thousands, but SQL Server, for example, isn't going to flinch at a transaction that updates tens of thousands of rows.
The messaging system, on the other hand, is about to chuck a fit.
First of all, the message generation system (eg, CDC) will start to lag behind the state of the data because generating the message (or messages) representing this change will surely take far longer than the actual transaction took to commit.
Secondly, we're either going to have to generate a few enormous messages (usually in XML, which doesn't help matters), or many thousands of individual messages. Neither approach is ideal. Since the message processing and distribution is likely to happen outside the SQL environment people tend to go with an iterative approach in which each individual entity change is a message processed through a pipeline. But this is blind to the fact that at the end of the day these changes will almost certainly be written to some database, and if you are processing messages iteratively then you are also performing those eventual writes iteratively. Which means what started out as a set based (very fast) operation on one system turns into an extremely slow RBAR ("row by agonizing row") operation on the downstream systems. Leaving aside the problems this might create with filled queues or excessive message traffic, it also means that the lag time between system synchronicity is vastly increased, which is ironic considering that one of the points of moving to a message based integration in the first place is to achieve near-real-time integration instead of highly delayed integration.
I see this as perhaps the biggest challenge to real time, message based integration, and I'd love to know how you all approach this problem.
Below is some code to demonstrate just how costly it is, from an RDBMS point of view, to turn a set based operation into a row by row operation. The table "tgt" will be the recipient of changes queued up in table "src".Note that the gap between set based and iterative will widen more and more the bigger the batch gets.
First I'll transfer all rows across to tgt as a single, set based operation. Then I'll clear the target and do the same thing via iteration, first using a single transaction for the whole shebang, and then one transaction for each operation - this final scenario being the best representation of message based processing.
Note that the results obviously won't include the overhead of the message processing pipeline itself... I'm only simulating the effect of the final write to an integration target. Also note that in my iterative examples I am not making a new connection to the database to write each message. Doing so would of course add vastly more overhead to the iterative solution.
Results (20,000 rows):
set based: 63 milliseconds
iterative (single tran): 800 milliseconds (12x longer)
iterative (multi tran): 14420 milliseconds (228x longer)
Results (40,000 rows):
set based: 103 milliseconds
iterative (single tran): 1826 milliseconds (17x longer)
iterative (multi tran): 27636 milliseconds (268x longer)
Results (80,000 rows):
set based: 186 milliseconds
iterative (single tran): 3800 milliseconds (20x longer)
iterative (multi tran): 57506 milliseconds (309x longer)
-- set up the test create table src(i int primary key clustered) create table tgt(i int primary key clustered) go set nocount on go with ints(n) as ( select1 as n union all selectn + 1 fromints wheren < 20000 ) insert src(i) select n from ints option(maxrecursion 20000) go --------------------------------------------------------------------- -- set based processing declare @dt datetime = getdate() insert tgt(i) select i from src print ('set based: ' + str(datediff(ms, @dt, getdate())) + ' milliseconds') go truncate table tgt go --------------------------------------------------------------------- -- interative processing, single transaction declare @dt datetime = getdate(), @i int begin tran declare cur cursor fast_forward for select i from src open cur fetch next from cur into @i while (@@fetch_status = 0) begin insert tgt(i) values (@i) fetch next from cur into @i end close cur deallocate cur commit print ('iterative (single tran): ' + str(datediff(ms, @dt, getdate())) + ' milliseconds') go truncate table tgt go --------------------------------------------------------------------- -- iterative processing, tran per row declare @dt datetime = getdate(), @i int declare cur cursor fast_forward for select i from src open cur fetch next from cur into @i while (@@fetch_status = 0) begin insert tgt(i) values (@i) fetch next from cur into @i end close cur deallocate cur print ('iterative (multi tran): ' + str(datediff(ms, @dt, getdate())) + ' milliseconds') go drop table src drop table tgt