In this series of articles I would like to address some typical problems that companies of all sizes face in our interconnected and data-centric era: how to make sense and profit from the mountains of data being collected in the course of doing business.
Let's start with some, however loose, definitions:
- A centralized HUB is a collection of data that is gathered from a variety of sources and merged into a coherent whole and identified with a particular time period. The data can be considered volatile or non-volatile.
- A Data Mart is a miniature data warehouse; in other words, it is just one segment of the data warehouse. Based on the data mart usage it can be in different format. If it used by BI tools it can be created with dimensional structure or as a mini classic data warehouse, or pure relational, or in any other form. For example, in my company data marts are used by various applications to display data consistently across multiple applications and multiple SQL servers including reporting and BI servers, API calls, WEB applications.
A typical solution to the problem of getting your hands around the sheer volume of available data is populate, in real time or through a batch schedule, a collection of data marts (databases) from a much larger centralized source of data.
In the Microsoft SQL universe, a solution proposed by Microsoft is to use features available in their enterprise-class offering, Microsoft SQL Server 2008 R2 Parallel Data Warehouse. This product, however, isn't for everyone due to a variety of factors, cost being one of the more prominent ones.
I am going to propose some ways to achieve the goal of data distribution in the absence of the big cohune.
Let’s focus our discussion on 3 areas:
- Improving price/performance ratio by creating our own process for the data delivery
- Utilizing the "Appliance Model" to simplify data deployment and maintenance.
- Achieving better agility and business alignment through the "Hub-and-Spoke" approach
We can implement the "Hub-and-Spoke" pattern without actually having the SQL Server 2008 R2 Parallel Data Warehouse Appliance installed. We can utilize a powerful server and use it as a centralized data hub. Data from various sources is consolidated into this hub by way of a set of ETL processes.
First we need to understand that data marts are the subject-oriented databases for new analytical, transactional, or other type of applications. In most cases, a data mart has to encapsulate a unit of data.
There are a few main parts of the SQL Server 2008 R2 Parallel Data Warehouse Appliance. Data comes to the Centralized HUB called “landing zone” from various data sources via ETL processes. Data is prepared in the “landing zone” and then stored in SQL Server 2008 R2 Parallel Data Warehouse, and then distributed into the data marts. The same ideas can be used for the smaller operations – combine data in centralized HUB, process it daily or process in real time and distribute to the local server(s).
Deploying the Data Mart Nightly
Let’s go over the process that is deploying data mart nightly.
The main part of the solution is that nightly traffic, e.g. number of calls to the database is very low – a few calls per minute, while the daily traffic is relatively high – hundreds of calls per second. Remember that in most cases all calls to the data mart are read only.
The first method is the nightly data mart complete replacement by the process. One of the issues, that despite the low user’s traffic, the calls to the data marts still exist. It means that we could still end up with some inconsistent data if we would allow some long running transactions (more than a few seconds).
The example process consists of 2 parts. The first part is implemented on the HUB and makes sure the data is loaded into mart database by truncating data and reloading it completely. HUB is the centralized storage (like landing zone for Parallel Data Warehouse Appliance) for all sources of data and it is used only by back end processes to process and distribute data. When this load is finished, a data mart is backed up by the process and the backup copy is deployed to the destination server.
Then another process restores the data mart on the destination server with a different name and replaces the old database with new one.
By using a restore procedure from the backup file you will require each time to make the user’s synchronization from the main data mart because each server may have different set of users in the same database.
You can choose different way of data deployment by keeping the exact replica of data mart on each server but with different name. As an example, you can load it from the central HUB by an ETL process like SSIS packages. With this methodology you will not need to synchronize all of your users each time the mart is restored. In addition, it may make it simpler than the copy backup. It is up to the DBA to choose how deliver data from HUB to the local temporary database.
When delivery process is done successfully or unsuccessfully, control has to be returned back to the HUB. Now the HUB knows the status of all local servers.
The final step is to release the command from the HUB to exchange data between the new and the old database. In our case it will be done by calling a job that exists on each local server and replace an existing data mart with new one on every local server. For example, one of the mechanisms to issue the call to the separate independent server is to use an SQL Server alert. You can read about it in details in the article in SQLServercentral.com http://www.sqlservercentral.com/articles/Administration/sqlserveralerts/1435/. Another mechanism can be SQL Server messaging mechanism which is available for SQL Server 2005 and SQL Server 2008.
Let’s see how each step can be implemented. For sake of clarity, I will name the main data mart as DM_Main and the load data mart as DM_Load. Remember that data mart DM_Load is already populated with data or restored from backup on each local server..
Step 1. Kill DM_Load connections
IF exists (select * from master.dbo.sysdatabases where name = 'DM_Load ')
exec dbo.KILL_DB_CONNECTIONS @db_nm = 'DM_Load '
Step 2. Kill DM_Main connections
IF exists (select * from master.dbo.sysdatabases where name = 'DM_Main ')
exec dbo.KILL_DB_CONNECTIONS db_nm = 'DM_Main '
ALTER DATABASE DM_Load SET SINGLE_USER WITH ROLLBACK IMMEDIATE
ALTER DATABASE DM_Main SET SINGLE_USER WITH ROLLBACK IMMEDIATE
Alter database DM_Main modify name = DM_Main_old
Alter database DM_Load modify name = DM_Main
Alter database DM_Main_old modify name = DM_Load
ALTER DATABASE DM_Main SET MULTI_USER WITH ROLLBACK IMMEDIATE
ALTER DATABASE DM_Load SET MULTI_USER WITH ROLLBACK IMMEDIATE
The code of stored procedure KILL_DB_CONNECTIONS can be written many ways but below you can find one of simple examples.
SET QUOTED_IDENTIFIER ON
SET ANSI_NULLS ON
CREATE procedure dbo.KILL_DB_CONNECTIONS
SET NOCOUNT ON
declare @tmpkill table (cmd varchar(255), tid int identity(1,1))
declare @cnt int, @cmd varchar(4000), @minid smallint, @maxid smallint
insert into @tmpkill(cmd)
select 'kill ' + cast(sp.spid as varchar) from master..sysprocesses sp
inner join master..sysdatabases sd on sp.dbid = sd.dbid
where sd.name = @db_nm and sp.spid <> @@SPID
select @minid = min(tid), @maxid = max(tid)
while (@minid <= @maxid)
select @cmd = cmd
where tid = @minid
set @minid = @minid + 1
SET NOCOUNT OFF
SET QUOTED_IDENTIFIER OFF
SET ANSI_NULLS ON
Now we have swapped the 2 databases. Database DM_Load has became DM_Main and vice versa.
Until the next load you can use DM_Load as a backup of DM_Main This method works fine and I've used it for 3 years on 10 production servers without any issues. It works best with application connection pools but it may have problems if used during a heavy load time with more than 5-10 calls to the database per second.
You can find some additional information about this type of load in the article in SQLServerCentral.com: http://www.sqlservercentral.com/articles/2901/
In part 2 of the article I will describe how to do the Real time data deployment and control mechanism