I'm looking for some input regarding how to architect a process for ongoing consolidation of data from multiple servers.
Here's the background.
We have (currently) 3 separate, but structurally identical, MMSQL database servers (Standard edition) which handle a moderate amount of data (about a million rows per day each). Each server runs about 12 separate databases (again, identical setup across all 3) with about 320 tables that service the various functions our company provides to our clients. There were a variety of (mostly good) reasons for splitting up our customer base across 3 separate groups, mostly having to do with data isolation as well as simply not putting all of our eggs in one basket. It is imperative that each of these three servers be VERY responsive to transactional queries made by our various applications and services. We want to keep the load on each of those servers very light.
The need at this time is to begin building out a data warehouse that will allow us to do analytics across all of our clients across all three of these groups. The first step we are taking is to, quite simply, aggregate the data from all three groups into a single, new set of the same 12 databases that will then serve as a set of staging databases from which to build out the warehouse. We are NOT envisioning any transformation or data cleansing at this point -- just aggregation.
Nearly all of the tables in all of the databases use GUIDs as primary keys; exceptions are lookup tables that are identical across all three groups anyway. So we don't have to be concerned about surrogate keys at this point. The goal is to simply mix the data from all three groups into a single, large SQL instance. All of the tables have CreatedDate and UpdatedDate columns, but the values in those columns are actually set by .NET code in multiple application servers (not the database server) and there is no guarantee that the system clocks are precisely synchronized. Therefore, I'm thinking of using Change Data Tracking (NOT Change Data Capture, which is only available in the Enterprise version) to look for Insert/Update/Delete notification.
A side benefit of this aggregation will be to allow us to regularly "prune" data out of the production SQL instances (data over a year old, for example) without actually losing it (because it will be retained indefinitely in this larger, aggregated instance). Therefore, deletion of data in the 3 production instances (which would only happen when pruning, since our processes only perform a logical delete, not a physical delete, of data) should NOT cause data to be deleted in the aggregated instance.
We envision keeping the aggregated, staging server fairly current -- latency of not more than 12 hours in the data. So we're looking to run some process on a 12 hour (or shorter) schedule.
If you're still reading, thank you!
Here's what I've thought of so far:
1) SSIS -- I've worked out a design for an SSIS package that could run against all three servers. However, since each table is different, I'm looking at 320 separate data flows that would have to be created using the SSIS GUI/designer, with multiple steps in each. Not sure I have to patience/guts for that. Plus, maintenance could be a nightmare.
2) Replication -- We're already using transactional replication within each of these three 'groups' to keep a copy of the production data available on a second server as a 'warm' failover as well as for reporting. I've looked into simply creating a subscription on the aggregating server to each of the three existing publications. I understand this is possible, but there appear to be a great many issues surrounding initializing the subscriptions as well has preventing deletes (pruning) on the production side to NOT cause a delete on the aggregation server. In other words, a potential minefield?
3) SQL queries/stored procedures -- I have created a template for using Change Data Tracking in a stored proc to extract changed data. I've considered using a code generation tool to take that template and programmatically create the 320+ stored procs that would be put on each of the 3 production servers and then executed by a 'master' process on the aggregating server (via a linked server connection). This seems the most promising at this point, especially since change to database schema that could occur down the road could be handled by re-generating the relevant SP's and deploying along with the schema changes.
Again, thanks for reading this far.
Anyone have any feedback on any of the three architectures mentioned above, or have a completely new, better path to send me down?
Thanks for your input!