SQLServerCentral Article

Generic Data Comparison

,

One of the

problems frequently faced by the DBAs or database developers is data comparison. Specifically, data residing in separate databases. The company I am working for has 2 types of databases: Access and SQL Server. Very often I need to compare data in different tables in different databases residing on different servers. After some extensive research on the WEB I have found a great article by Jonathan Gennick on http://www.oreillynet.com/pub/a/network/2002/04/08/compsets.html)

This article is the best in representing a comparison of two sets of rows, whether from the same table or from different tables, with the goal to determine if in fact the two row sets are the same. But (there is always a ”but”) the code in the article requires knowledge of the compared tables structure, and therefore needs to be written for every table individually. My company has to adhere to some very strict FDA regulations, according to which every script for comparison must be validated. That renders pretty much any static SQL solution as not acceptable.

I looked into certain of-the-shelf comparison tools like SQL Data Compare (Red Gate Software Inc.). This solution is not exactly what I was looking for. The tool allows you to compare some SQL Server table data to another SQL Server table data if tables have the same name but located in different databases or in the same database but on different servers.

Lets define the tasks and then try to solve it as generically as possible.

Task #1 – Migration from non SQL Server to SQL Server

From time to time we are making database and application migration from Access to SQL Server Our FDA regulated environment requires data comparison and comparison reports to show that data for each table is the same after migration.

Task #2

Data should be compared in structurally the same tables with the same names but between different databases on the same or different servers. For example, some lookup/code tables with company’s specific data should match (states, languages, projects, zip codes etc.).

Task #3:

Data comparison should be done between tables with different names but with the same structure and located in the same or in different databases on the same or different servers.

Task #4

Data comparison between tables with different names and different structure. For example, when data transfer process is running, it is placing data to some generic table with columns named generically (gcol1, gcol2…). And comparison of the tables is then required to make sure that all rows are pulled out properly or for some other reason. Another example: audit tables have some additional columns compare to the original structure, which requires the comparison task to identify any current rows that are missing from the audit table.

It is becoming clear now that there is a multitude of situations where data comparison between tables is required. You might have your own task that is not falling in any of the above mentioned 4 categories.

As you can appreciate by now, the challenge here is to build some scripts that would be generic enough to support differences in databases, table names, column names, primary keys and column types. In addition to that, some restrictions must be met when you need to provide comparison tools for data residing on different types of database engines, such as Access or Oracle.

Restrictions:

a)If columns in compared tables have the same name but different count, the table with lesser number of columns should be the first table in parameter list assuming that we compare all columns in table 1 with the same columns in table 2. It can happened with audit or history tables where some additional columns are customarily added like inserted date, reason for change, login, and so on.

b)Sequence of data in columns should be the same regardless of table and column names if column names are different.

There are a few solutions to the problem.

Solution 1 is to compare rows column by column. This method is good if you need to know what columns have differences when rows are not matching. This solution is very time and server resource consuming.

In my case I don’t need to find which columns have different values. I only need to answer the question: “Is the data in 2 tables identical” in the form “Yes/No”

Solution 2 is compare the whole row in table 1 to the whole row in table 2 by converting the values in all the columns to the varchar data type and concatenate them into 1 value. It means that each row will consists of 1 large string and we will compare 1 value per row. In addition, solution 2 can be extended for performance reason. Strings can be converted to numbers and comparison will be done between 2 numbers for each row.

Let’s see a simple example.

Create table customer 
 (customer_id int, customer_nm varchar(25)
  , start_dt datetime, stop_dt datetime
 )
Create table processed_customer 
 (customer_id int, customer_nm varchar(25)
  , start_dt datetime, stop_dt datetime
 )
Insert into customer 
 (customer_id, customer_nm , start_dt, stop_dt)
 Values (1, 'Test Customer 1', '1/1/2002', '8/4/2004')
Insert into customer 
 (customer_id, customer_nm , start_dt, stop_dt) 
 Values (2, 'Test Customer 2', '2/1/2004', null)
Insert into processed_customer 
 (customer_id, customer_nm , start_dt, stop_dt)
 Values (1, 'Test Customer 1', '1/1/2002', '8/4/2004')
Insert into processed_customer 
 (customer_id, customer_nm , start_dt, stop_dt)
 Values (2, 'Customer 2', '2/1/2004', null)

Next query returns the records that are in table customer and

not in table processed_customer.

Select customer_id, customer_nm , start_dt, stop_dt
 From customer
 Where IsNull(cast(customer_id as varchar), ‘‘) +
       IsNull(cast(customer_nm as varchar) , ‘‘) +
 IsNull(cast(start_dt as varchar) , ‘‘) + 
 IsNull(cast(stop_dt as varchar) , ‘‘)
   not in 
     (select IsNull(cast(customer_id as varchar), ‘‘) +  
          IsNull(cast(customer_nm as varchar) , ‘‘) +
 IsNull(cast(start_dt as varchar) , ‘‘) + 
 IsNull(cast(stop_dt as varchar) , ‘‘) 
 From processed_customer)

This query returns:

customer_id customer_nm            start_dt            stop_dt
----------- ---------------------- ------------------- --------------- 
2           Test Customer 2        2004-02-01 00:00:00 NULL
(1 row(s) affected)

If you need to identify the records that are in the table processed_customer but not in the table customer, the query would look like:

Select
    customer_id, customer_nm , start_dt, stop_dt
 From processed_customer
 Where IsNull(cast(customer_id as varchar), ‘‘) + 
 IsNull(cast(customer_nm as varchar) , ‘‘) +
 IsNull(cast(start_dt as varchar) , ‘‘) +
 IsNull(cast(stop_dt as varchar) , ‘‘)
   not in 
     (select IsNull(cast(customer_id as varchar), ‘‘) + 
 IsNull(cast(customer_nm as varchar) , ‘‘) +
 IsNull(cast(start_dt as varchar) , ‘‘) +
 IsNull(cast(stop_dt as varchar) , ‘‘) 
       From customer)

This is a simple example of solution 2. All null values in columns replaced with a space (not with an empty string) in order to avoid the side effects of. For example:

CustomerID

CustomerFNM

CustomerLNM

1

NULL

Customer1

CustomerID

CustomerFNM

CustomerLNM

1

Customer1

NULL

Both tables will show output as ‘1Customer1’ if NULL will be replaced with empty space and declared identical, which is incorrect. In contrast, the first table will show ‘1 Customer1’ and the second ‘1Customer1’, which will enable us to detect a difference.

This method can be used as static if you know the table names and the columns.

When comparison is performed for some production tables, I assume that there is always a primary key in a table and it will be no duplicate records. Otherwise it may be necessary to verify that there are no duplicate records using the next query.

Select  IsNull(cast(customer_id as varchar), ‘‘) + 
  IsNull(cast(customer_nm as varchar) , ‘‘) +
  IsNull(cast(start_dt as varchar) , ‘‘) + 
  IsNull(cast(stop_dt as varchar) , ‘‘), Count(*)
 From customer
 Group by IsNull(cast(customer_id as varchar), ‘‘) + 
 IsNull(cast(customer_nm as varchar) , ‘‘) + 
 IsNull(cast(start_dt as varchar) , ‘‘) +
 IsNull(cast(stop_dt as varchar) , ‘‘)
 Having count(*) > 1

Now lets attempt to write a more generic data comparison script.

--=========================================================================
CREATE procedure dbo.COMPARE_TABLES_DATA
    @tbl_1 varchar(55),
    @dbnm_1 varchar(50) ,
    @tbl_2 varchar(55),
    @dbnm_2 varchar(50),
    @col_order_same_flag char(1), 
    -- Y if column order the same regardless of name; 
 -- N - if order is different but names are the same
    @col_name_same_flag char(1)
 -- First table has all the column names the same as the second one, 
    -- but the second one can have different order and more columns than the
    -- first one
as
begin
declare @cmd varchar(8000), @namepart varchar(7000), @otherA varchar(400)
, @otherS varchar(255), @rowscnt int, @minid int, @maxid int
, @namepart2 varchar(7000)
create table #tmpA (textA varchar(8000) )
create table #tmpS (textS varchar(8000) )
SET NOCOUNT ON
IF (@col_name_same_flag <> 'Y' and @col_order_same_flag <> 'Y')
 begin
raiserror (' Column name or column order must be the same ', 16,1)
   return
 end
select @minid = 0, @maxid = 0, @namepart = '', @namepart2 = ''
-- compare number of rows in tables. 
create table #tmp_rows (cnt int, tid int identity(1,1) )
set @cmd = ' select count(*) from ' + @dbnm_1 + '.dbo.' + @tbl_1 
insert into #tmp_rows(cnt)
exec (@cmd)
set @cmd = ' select count(*) from ' + @dbnm_2 + '.dbo.' + @tbl_2 
 
insert into #tmp_rows(cnt)
exec (@cmd)
IF ( (select cnt from #tmp_rows where tid = 1) <> 
(select cnt from #tmp_rows where tid = 2) )
 begin
   print 'Data in table 1 and 2 are not matching '
   return 
 end
create table #tmp(tid int identity(1,1), colnm varchar(80), colnm1 varchar(50))
create table #tmp_2(tid int identity(1,1), colnm varchar(80), colnm1 varchar(50))
select @cmd =  ' select ''ISNULL(cast('' + sc.name + '' as varchar), '''''''') '',
sc.name from ' + @dbnm_1 + '..syscolumns sc
inner join ' + @dbnm_1 + '..sysobjects so on so.id = sc.id where
so.name = ''' + @tbl_1 + '''' + ' order by sc.colid '
insert into #tmp(colnm,colnm1)
exec(@cmd)
select @minid = 1, @maxid = max(tid) from #tmp
while (@minid <= @maxid)
 begin
select @namepart = @namepart + colnm + ' + ' from #tmp where tid = @minid 
set @minid = @minid + 1
 end
-- take away last character '+' or ','
select @namepart = left(rtrim(ltrim(@namepart)), len(ltrim(rtrim(@namepart))) - 1)
IF (@col_name_same_flag <> 'Y' and @col_order_same_flag = 'Y')
 BEGIN
-- start IF for table 2
select @cmd = ' select ''ISNULL(cast('' + sc.name + '' as varchar), '''''''') '',
sc.name  from ' + @dbnm_2 + '..syscolumns sc
inner join ' + @dbnm_2 + '..sysobjects so on so.id = sc.id
where so.name = ''' + @tbl_2 + '''' + ' order by sc.colid '
insert into #tmp_2(colnm,colnm1)
exec(@cmd)
set @minid = 1
while (@minid <= @maxid)
 begin
select @namepart2 = @namepart2 + colnm + ' + ' 
 from #tmp_2  where tid = @minid
select @minid = @minid + 1
 end
-- take away last character '+' or ','
select @namepart2 = left(rtrim(ltrim(@namepart2)), len(ltrim(rtrim(@namepart2))) - 1)
 END
ELSE
 begin
set @namepart2 = @namepart
 end
select @otherS = ' FROM ' + @dbnm_1 + '..' + @tbl_1
select @otherA = ' FROM ' + @dbnm_2 + '..' + @tbl_2
set @cmd = ' select ' + @namepart + @otherS
insert into #tmpS (textS)
exec (@cmd)
select @cmd = ' select ' + @namepart2 + @otherA
insert into #tmpA (textA)
exec (@cmd)
--Number of records that exists in table 1 and not exists in table 2
select @rowscnt = count(*) from #tmpS
 where textS not in (select textA from #tmpA)
-- get records if some missing
IF ( @rowscnt > 0 )
 begin
--print 'Records that exists in table 1 ' + @tbl_1 + ' 
-- and not exists intable 2 ' + @tbl_2
select textS from #tmpS where textS not in (select textA from #tmpA)
 end
--Number of records that exists in table 2 and not exists in table 1
select @rowscnt = count(*) from #tmpA where textA not in (select textS from #tmpS)
-- get records if some missing
IF ( @rowscnt > 0 )
 begin
-- print 'Records that exists in table 2 ' + @tbl_2 + ' 
-- and not exists in table 1 ' + @tbl_1
select textA from #tmpA where textA not in (select textS from #tmpS)
 end
SET NOCOUNT OFF
end
-- ============================================================================

Solution 3 is to join the tables on all columns and if the result of the join has the same number of rows as each of them, then tables are matching.

For the tables above result will look as

Select count(*)
 From processed_customer pc
Inner join customer c 
ON IsNull(cast(pc.customer_id as varchar), ‘‘) =
IsNull(cast(c.customer_id as varchar), ‘‘) 
and IsNull(cast(pc.customer_nm as varchar) , ‘‘) =
IsNull(cast(c.customer_nm as varchar) , ‘‘) 
and IsNull(cast(pc.start_dt as varchar) , ‘‘) = 
IsNull(cast(c.start_dt as varchar) , ‘‘) 
and IsNull(cast(pc.stop_dt as varchar) , ‘‘) = 
IsNull(cast(c.stop_dt as varchar) , ‘‘) 

If number of records is not matching with number of records in the tables then there are some unmatched records. Otherwise, tables are matching. Generic procedure for the solution is in file Compare_Tables_Data_Solution3.txt

Those solutions are workable but are time consuming for a tables with 1,000,000+ rows and many columns. I have ran those procedures for tables with 1,300,000 rows and about 60 columns. Comparison

has been completed in about 4 min.

Solution 4 (Performance boost)

You may notice that converting to a string and then concatenating several strings and then comparing 2 long strings may not be very efficient as far as CPU utilization goes. The reason string concatenation may get slow is that it will most likely require a memory allocation for the result and discarding memory used for operands, unless of course you are using an optimizing compiler. This though can lead us to the idea of trying to calculate a hash against data in each compared row and only compare hash values. This approach may prove valid only in situations when a) your rows have a lot of columns to be compared and b) run time is an issue, which is not always the case for processes like the ones I am describing here. Still, let me illustrate how you could go about implementing the hash idea.

It can be done by converting values to a varbinary and then to a big integer. Let’s use customer and processed_customer tables for this example:

Select count(*)
 From processed_customer pc
Inner join customer c 
ON convert(bigint,convert(varbinary(255), ISNULL( pc.customer_id , ‘ ‘) ) ) = 
convert(bigint,convert(varbinary(255), ISNULL( c.customer_id , ‘ ‘) ) ) 
and convert(bigint,convert(varbinary(255), ISNULL( pc.customer_nm , ‘ ‘) ) ) = 
convert(bigint,convert(varbinary(255), ISNULL( c.customer_nm , ‘ ‘) ) ) 
and convert(bigint,convert(varbinary(255), ISNULL( pc.start_dt , ‘ ‘) ) ) = 
convert(bigint,convert(varbinary(255), ISNULL( c.start_dt , ‘ ‘) ) ) 
and convert(bigint,convert(varbinary(255), ISNULL( pc.stop_dt , ‘ ‘) ) ) =
convert(bigint,convert(varbinary(255), ISNULL( c.stop_dt , ‘ ‘) ) ) 

Generic procedure for the solution is in file Compare_Tables_Data_Solution4.txt. I have run these procedures for the same tables as solution 3 with 1,300,000 rows and about 60 columns. Comparison has been completed in about 75 seconds! It gives us a 3-4 times performance boost. The only danger with this solution, even though the chances are very slim, is that the different original values might produce the same hash and result will be wrong. This is really the case only for the tables with 1 or 2 columns. So, this method is recommended only for tables with 4+ columns and many rows to boost performance.

For the tables with 1-3 columns it is safer to use solution 1-2-3. And generally speaking performance will be the same or very close.

Again, let me emphasize that all those solutions are good for the situations when you only need to answer the question: “Is the data in 2 tables identical” in the form “Yes/No”, not for when you need to locate the differences. For the reason of clarity all script don’t have error handlers, input variables verification, and some other advanced features.

Conclusion

These ideas can be used not only for table to table data comparison but for many different types of data comparison. For example, compare values in some table columns to values in some other table columns. The stored procedure for solution 2 can be modified to output non matching records with one recordset. You can add server names as parameters of the stored procedures and modify stored procedures to be able to work between servers (If servers are linked). Or you can modify it in the way to utilize comparison between Access (or other database engine) and SQL Server. See example COMPARE_TABLES_DATA_ACCESS.txt

Example for the Access database shows how to compare data for exactly the same table, but if table structure for the second table somehow can be obtained within stored procedure then stored procedure can be extended the same way as example for data comparison between different tables in SQL Server shows.

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating