Finding Available Keys When Merging Two Tables

When our company was purchased by a private equity firm, we decided to consolidate our operations and databases with a sister comapny. I was responsible for merging about 300 tables between two different database over the course of two years. Duplicate keys were the most common problem, and in most cases I could simply renumber sequentially the duplicate keys without an issue.

However, in some cases, simply renumbering sequentially would create new duplicates; in my case, two employee files had duplicate employee numbers, and there was an artificial limit imposed that the keys had to be between 0000 and 9999, so sequentially renumbering was not an option as it would have created new duplicate keys.

To solve the problem I created a dummy employee table with entries from 0000 to 9999. I used Powerbuilder to loop through and insert the rows, but could have just as easily done an import from Excel, or used any scripting language that allows for a loop and SQL insert.

Since my new numbers could not be a duplicate in either table, I needed to remove from the dummy table any keys that were already in use. To do this, I used a table variable loaded with my dummy records, and then deleted any records that existed in either employee table. The records that were left in the dummy table were good, unused keys that I could renumber my existing records with.

Syntax

The syntax to create the dummy table was:

CREATE TABLE [dbo].[emp_no_test](
 [id] [numeric](18, 0) IDENTITY(1,1) NOT NULL,
 [emp_no] [varchar](4) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
) ON [PRIMARY]

The employee data existed in two separate databases, which I'll call database_one and database_two, in a table emp, with the key of emp_no.

-- Create the local table that will hold the good clean records
declare @empty_emp table(
emp_no varchar(4)
)

--Load the local table with the total set of records 0000 to 9999
insert into @empty_emp
select emp_no from database_one.dbo.emp_no_test

--Remove the records from database one that are duplicates
delete from @empty_emp where
emp_no in ( select emp_no from database_one.dbo.emp )

-- Remove the records from database two that are duplicates
delete from @empty_emp where 
emp_no in ( select emp_no from database_two.dbo.emp )

-- Select back all the clean records
select * from @empty_emp

Data Example

So for example, lets walk through this scenario with a much smaller dataset, confined to having to have an employee number key between 1 and 10. Database_one has the following employee table:

Database_One
Emp_no
1
3
7
8
9

Database_two has the following employee table:

Database_Two
Emp_no
1
4
5
6
7

So, the duplicate keys are 1 and 7. These two records must have their key changed to a new value. But because the numbers are not sequential, there is a possibility of creating another duplicate by just blindly assigning a new key. Using the above solution, here is what the data sets would look like.

First, I create the emp_no_test table with all possible values in the range ( in this case, 1 through 10 ).

**Database_One Emp No Test**
1
2
3
4
5
6
7
8
9
10

Next, the script deletes all records from the emp_no_test where the emp_no is in the Database_One employee table. The emp_no_test table now looks like ( I left the blanks to indicate records that were deleted ):

**Database_One Emp No Test**

2

4
5
6



10

Finally the script deletes all records from emp_no_test where emp_no is in the Database_Two employee table. The emp_no_test table now looks like:

**Database_One Emp No Test**

2







10

Returning back the records left in the emp_no_test table gives us two keys that are not being used in either table: 2 and 10. I can safely renumber the keys to 2 and 10.

Conclusion

Resolving duplicate keys is often straightforward, but when there are limits imposed, you have to get creative. In my case we ended up renumbering several hundred employee numbers, which would have taken days of time for the human resource departments to sort out. By using this approach, I was able to find my new keys in a matter of minutes, and did not have to deal with any new duplicates created by renumbering.

A Normalization Primer

by Brian Kelley

SQLServerCentral.com

T-SQL

For most DBAs, normalization is an understood concept, a bread and butter bit of knowledge. However, it is not at all unusual to review a database design by a development group for an OLTP (OnLine Transaction Processing) environment and find that the schema chosen is anything but properly normalized. This article by Brian Kelley will give you the core knowledge to data model.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

(3)

You rated this post out of 5. Change rating

2003-01-13

18,753 reads

Discuss

Reaching the Outer Limits

by Steve Jones

SQLServerCentral.com

T-SQL

What's this error:Arithmetic overflow error converting IDENTITY to data type int? It was a new one to me, but read on to find out what it means.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

(2)

You rated this post out of 5. Change rating

2002-10-14

7,022 reads

Discuss

Working with email addresses in SQL Server!

by Additional Articles

Other

T-SQL

This article shows you how to design the storage for email addresses, how to validate email addresses, how to retrieve demographic information from email addresses efficiently, using computed columns and indexes. It also covers the security aspect of dealing with email addresses.

2002-08-01

1,340 reads