High level architecture -- consolidating data from multiple servers

Question

High level architecture -- consolidating data from multiple servers

Rob Schripsema

SSCertifiable

Points: 7469
More actions
April 4, 2013 at 11:42 am

#291775

I'm looking for some input regarding how to architect a process for ongoing consolidation of data from multiple servers.
Here's the background.
We have (currently) 3 separate, but structurally identical, MMSQL database servers (Standard edition) which handle a moderate amount of data (about a million rows per day each). Each server runs about 12 separate databases (again, identical setup across all 3) with about 320 tables that service the various functions our company provides to our clients. There were a variety of (mostly good) reasons for splitting up our customer base across 3 separate groups, mostly having to do with data isolation as well as simply not putting all of our eggs in one basket. It is imperative that each of these three servers be VERY responsive to transactional queries made by our various applications and services. We want to keep the load on each of those servers very light.
The need at this time is to begin building out a data warehouse that will allow us to do analytics across all of our clients across all three of these groups. The first step we are taking is to, quite simply, aggregate the data from all three groups into a single, new set of the same 12 databases that will then serve as a set of staging databases from which to build out the warehouse. We are NOT envisioning any transformation or data cleansing at this point -- just aggregation.
Nearly all of the tables in all of the databases use GUIDs as primary keys; exceptions are lookup tables that are identical across all three groups anyway. So we don't have to be concerned about surrogate keys at this point. The goal is to simply mix the data from all three groups into a single, large SQL instance. All of the tables have CreatedDate and UpdatedDate columns, but the values in those columns are actually set by .NET code in multiple application servers (not the database server) and there is no guarantee that the system clocks are precisely synchronized. Therefore, I'm thinking of using Change Data Tracking (NOT Change Data Capture, which is only available in the Enterprise version) to look for Insert/Update/Delete notification.
A side benefit of this aggregation will be to allow us to regularly "prune" data out of the production SQL instances (data over a year old, for example) without actually losing it (because it will be retained indefinitely in this larger, aggregated instance). Therefore, deletion of data in the 3 production instances (which would only happen when pruning, since our processes only perform a logical delete, not a physical delete, of data) should NOT cause data to be deleted in the aggregated instance.
We envision keeping the aggregated, staging server fairly current -- latency of not more than 12 hours in the data. So we're looking to run some process on a 12 hour (or shorter) schedule.
If you're still reading, thank you!
Here's what I've thought of so far:
1) SSIS -- I've worked out a design for an SSIS package that could run against all three servers. However, since each table is different, I'm looking at 320 separate data flows that would have to be created using the SSIS GUI/designer, with multiple steps in each. Not sure I have to patience/guts for that. Plus, maintenance could be a nightmare.
2) Replication -- We're already using transactional replication within each of these three 'groups' to keep a copy of the production data available on a second server as a 'warm' failover as well as for reporting. I've looked into simply creating a subscription on the aggregating server to each of the three existing publications. I understand this is possible, but there appear to be a great many issues surrounding initializing the subscriptions as well has preventing deletes (pruning) on the production side to NOT cause a delete on the aggregation server. In other words, a potential minefield?
3) SQL queries/stored procedures -- I have created a template for using Change Data Tracking in a stored proc to extract changed data. I've considered using a code generation tool to take that template and programmatically create the 320+ stored procs that would be put on each of the 3 production servers and then executed by a 'master' process on the aggregating server (via a linked server connection). This seems the most promising at this point, especially since change to database schema that could occur down the road could be handled by re-generating the relevant SP's and deploying along with the schema changes.
Again, thanks for reading this far.
Anyone have any feedback on any of the three architectures mentioned above, or have a completely new, better path to send me down?
Thanks for your input!
Rob Schripsema
Propack, Inc.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply

tmitchelar Hall of Fame Points: 3391 More actions · Answer 1

Personal strategy would be to build an ODS database and use a combination of SSIS and merge statements to extract data off of your source systems and load it into your ODS. If proven to be reliable, use the updated date columns in your source system as your CDC - this will drastically increase performance outside of your initial load. I would build a separate package for each of my source tables.

I would highly recommend against bringing all of the tables in. Instead, bring them in as business requirements demand that particular data. Otherwise you could waste a lot of time building a process for data that won't ever be used.

herladygeekedness SSCertifiable Points: 5878 More actions · Answer 2

we host clients using our own app so we also have multiple servers, multiple databases, all identical.

We built a source component that "makes the rounds" to run the same query against each database and inject the records to the data flow, from which I can do whatever transforms, cleansing, lookups, etc., en masse. each pkg I make usually populates just one DW table, sometimes more. We also have hundreds of tables in the source dbs, growing all the time.

Agree that duplicating all your live data is unnecessary, and costly on many fronts. Take what you need when you need it. I would otherwise have to constantly modify the 'go get everything' because we are continually growing our core product, so tables are added, decommissioned, fields added/modified/removed...