Going From Many Schemas To One Schema

A great deal of time in SQL Server is spent in relatively simple Extract-Transform-Load (ETL) scenarios that involve taking data in one format, cleaning it, normalizing it, and bringing it into SQL Server so it can be fully utilized. In some cases, these ETL tasks simply involve transferring data from one set of SQL Server tables to another; for instance, when one is moving from a production database to an analysis database, but the schema may not be the same between the databases. There are also cases when you need to move from several databases with similar, but not identical schemas, into one consolidating databases. For instance, this can happen when merging from several slightly different production databases into one unified analysis databases.

One of the simplest ways that the schemas can be different is if one of them has columns that the other does not. For instance, take a company with many subdivisions, each of which maintains their own database with its own customer table. Two of those tables may look like:

create table test.dbo.customer (
 custId varchar(25),
 lastName varchar(50),
 firstName varchar(50),
 phoneNumber varchar(11)
 )
create table test2.dbo.customer (
 custId varchar(25),
 lastName varchar(50),
 middleInitial varchar(1),
 firstName varchar(50),
 phoneNumber varchar(11)
 )

The only difference between them is that one has a column for middleInitial and the other does not. Assume there was a requirement to merge those into one master table such as:

create table test3.dbo.customerAnalysis (
 custId varchar(25),
 lastName varchar(50),
 middleInitial varchar(1),
 firstName varchar(50),
 phoneNumber varchar(11)
 )

With only those two source tables, it would be trivially easy to write two insert statements to bring them into the master analysis table and put them into a procedure and run as needed. But if there are many tables, and perhaps a variety of differences between them, then it may be worth looking at more dynamic solutions that addresses the problem at a higher level of abstraction. This could be done very effectively by using the information schemas available in SQL Server 2005 and 2008.

As long as all the tables share at least the primary key, then a procedure can be written which takes the database, schema, and table name and then dynamically generates the insert and update statements needed. It could look like:

CREATE procedure [dbo].[importCustData] 
 @database varchar(128),
 @schema varchar(128) = 'dbo',
 @tablename varchar(128) = 'customer',
 @test bit = 0 --setting to one will generate and 
 --print commands without executing
as
/*This procedure will import data from the customer table
in the target database and schema to the customerAnalysis 
table for analysis*/ 
declare @sql nvarchar(max) --Must be nvarchar for sp_executsql
declare @Ucolumns varchar(8000) --Columns for update statements
declare @Icolumns varchar(8000) --Columns for insert statements
--char(10) is newline, char(9) is tab, and simply makes debugging easier
--s represents source and d represents destination
select @sql = '
select
 @Ucolumns = 
 isnull(@Ucolumns, '''') + char(9) + c1.column_name + '' = s.'' + 
 c1.column_name + '', '' + char(10),
 @Icolumns = 
 isnull(@Icolumns, '''') + char(9) + c1.column_name + '', '' + char(10)
from
 ' + quotename(@database) + '.information_schema.columns c1
 join information_schema.columns c2
 on c1.table_name = @tablename 
 and c2.table_name = ''customerAnalysis''
 and c1.column_name = c2.column_name
where
 c1.table_schema = @schema
'
print @sql
exec sp_executesql @sql, 
 N'@schema varchar(128), @tablename varchar(128), 
 @Ucolumns varchar(8000) OUTPUT, @Icolumns varchar(8000) OUTPUT',
 @schema,
 @tablename,
 @Ucolumns OUTPUT,
 @Icolumns OUTPUT 
--remove the extra comma
select @Ucolumns = substring(@Ucolumns, 1, len(@Ucolumns) - 3) 
select @Icolumns = substring(@Icolumns, 1, len(@Icolumns) - 3) 
--In SQL Server 2008 Merge would be a better option.
select @sql = '
update 
 customerAnalysis
set 
' + @Ucolumns + '
from
 customerAnalysis d, 
 ' + quotename(@database) + '.' + quotename(@schema) + '.' + quotename(@tablename) + ' s
where
 s.custId = d.custId ' 
print @sql
if @test = 0 
 exec (@sql)
select @sql = '
insert into
 customerAnalysis
(
' + @Icolumns + '
)
select 
' + @Icolumns + '
from
 ' + quotename(@database) + '.' + quotename(@schema) + '.' + quotename(@tablename) + ' t
where
 custId not in (select custId from customerAnalysis)'
print @sql
if @test = 0
 exec (@sql)

This procedure will then allow the merging of any of the customer tables into the main customerAnalysis table as long as they share the same primary key and the shared columns have the same names. It will accommodate the fact that some servers may not use all of the columns.

A similar, though slightly more complicated approach, can be used if some of the tables may have different column names for the same data. This may, for instance, occur if a number of related but slightly different applications are developed separately. This may be hard to change after the schema has been into production as more and more code comes to rely on the original schema.

In this case, an equivalence table can be used to tell the script which columns may be substituted for each other. For instance, assume there was another table on another database like:

CREATE TABLE dbo.customerDiff(
 custId varchar(25),
 lastN varchar(50),
 mni varchar(1),
 firstN varchar(50),
 phoneNum varchar(11)
)

Then there may be a table such as:


--Shows the column equivalences
CREATE TABLE dbo.custEquiv (
 analysisCol varchar(128),
 baseCol varchar(128)
)

Which can be populated:

--A row constructor is easier in 2008
insert into dbo.custEquiv
(analysisCol, baseCol)
select 'lastName', 'lastN'
UNION ALL
select 'middleInitial', 'mni'
UNION ALL
select 'firstName', 'firstN'
UNION ALL
select 'phoneNumber', 'phoneNum'

Then the import procedure can be modified to use this equivalence table and be able to import the data even from tables where the column names do not match perfectly. One possible way of doing this looks like:

ALTER procedure [dbo].[importCustData2] 
 @database varchar(128),
 @schema varchar(128) = 'dbo',
 @tablename varchar(128) = 'customer',
 @test bit = 0 --setting to one will generate and 
 --print commands without executing
as
/*This procedure will import data from the customer table
in the target database and schema to the customerAnalysis 
table for analysis*/declare @sql nvarchar(max) --Must be nvarchar for sp_executsql
declare @Ucolumns varchar(8000) 
declare @Iscolumns varchar(8000)
declare @Ibcolumns varchar(8000)
--char(10) is newline, char(9) is tab, and simply makes debugging easier
--s represents source and d represents destination
select @sql = '
select
 @Ucolumns = 
 isnull(@Ucolumns, '''') + char(9) + c2.column_name + '' = s.'' + 
 c1.column_name + '', '' + char(10),
 @Iscolumns = 
 isnull(@Iscolumns, '''') + char(9) + c1.column_name + '', '' + char(10),
 @Ibcolumns = 
 isnull(@Ibcolumns, '''') + char(9) + c2.column_name + '', '' + char(10)
from
 ' + quotename(@database) + '.information_schema.columns c1
 left join dbo.custEquiv e
 on c1.table_name = @tablename
 and c1.column_name = e.baseCol
 join information_schema.columns c2
 on c1.table_name = @tablename 
 and c2.table_name = ''customerAnalysis''
 and (c1.column_name = c2.column_name OR e.analysisCol = c2.column_name)
where
 c1.table_schema = @schema
'
print @sql
exec sp_executesql @sql, 
 N'@schema varchar(128), @tablename varchar(128), 
 @Ucolumns varchar(8000) OUTPUT, 
 @Iscolumns varchar(8000) OUTPUT,
 @Ibcolumns varchar(8000) OUTPUT',
 @schema,
 @tablename,
 @Ucolumns OUTPUT,
 @Iscolumns OUTPUT,
 @Ibcolumns OUTPUT
--remove the extra comma
select @Ucolumns = substring(@Ucolumns, 1, len(@Ucolumns) - 3) 
select @Iscolumns = substring(@Iscolumns, 1, len(@Iscolumns) - 3) 
select @Ibcolumns = substring(@Ibcolumns, 1, len(@Ibcolumns) - 3)
--In SQL Server 2008 Merge would be a better option.
select @sql = '
update 
 customerAnalysis
set 
' + @Ucolumns + '
from
 customerAnalysis d, 
 ' + quotename(@database) + '.' + quotename(@schema) + '.' + quotename(@tablename) + ' s
where
 s.custId = d.custId '
print @sql
if @test = 0 
 exec (@sql)
select @sql = '
insert into
 customerAnalysis
(
' + @Ibcolumns + '
)
select 
' + @Iscolumns + '
from
 ' + quotename(@database) + '.' + quotename(@schema) + '.' + quotename(@tablename) + ' t
where
 custId not in (select custId from customerAnalysis)'
print @sql
if @test = 0
 exec (@sql)

While this is considerably more complicated than the original, it provides greater flexibility. It will also make it easy to import data into one centralized analysis database from any number of production databases, even with substantial variations in the schema.

Techniques such as this can be used to help minimize repetitive coding and make code more general purpose. Of course, when dealing with dynamic SQL of any kind, care must be taken to ensure that it does not lead to openings for SQL injection attacks.

Going From Many Schemas To One Schema

Rate

Share

Categories

Share

Rate

Going From Many Schemas To One Schema

Rate

Share

Categories

Share

Rate

Related content

Bash for ETL pre-processing

The ETL from Hell - Diagnosing Batch System Performance Issues

Merge, Metadata and the Data Mart ETL

Importing Data From Excel Using SSIS - Part 1

Managing Data Removal Using SSIS