When our company was purchased by a private equity firm, we decided to consolidate our operations and databases with a sister comapny. I was responsible for merging about 300 tables between two different database over the course of two years. Duplicate keys were the most common problem, and in most cases I could simply renumber sequentially the duplicate keys without an issue.
However, in some cases, simply renumbering sequentially would create new duplicates; in my case, two employee files had duplicate employee numbers, and there was an artificial limit imposed that the keys had to be between 0000 and 9999, so sequentially renumbering was not an option as it would have created new duplicate keys.
To solve the problem I created a dummy employee table with entries from 0000 to 9999. I used Powerbuilder to loop through and insert the rows, but could have just as easily done an import from Excel, or used any scripting language that allows for a loop and SQL insert.
Since my new numbers could not be a duplicate in either table, I needed to remove from the dummy table any keys that were already in use. To do this, I used a table variable loaded with my dummy records, and then deleted any records that existed in either employee table. The records that were left in the dummy table were good, unused keys that I could renumber my existing records with.
The syntax to create the dummy table was:
CREATE TABLE [dbo].[emp_no_test](
[id] [numeric](18, 0) IDENTITY(1,1) NOT NULL,
[emp_no] [varchar](4) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
) ON [PRIMARY]
The employee data existed in two separate databases, which I'll call database_one and database_two, in a table emp, with the key of emp_no.
-- Create the local table that will hold the good clean records
declare @empty_emp table(
--Load the local table with the total set of records 0000 to 9999
insert into @empty_emp
select emp_no from database_one.dbo.emp_no_test
--Remove the records from database one that are duplicates
delete from @empty_emp where
emp_no in ( select emp_no from database_one.dbo.emp )
-- Remove the records from database two that are duplicates
delete from @empty_emp where
emp_no in ( select emp_no from database_two.dbo.emp )
-- Select back all the clean records
select * from @empty_emp
So for example, lets walk through this scenario with a much smaller dataset, confined to having to have an employee number key between 1 and 10. Database_one has the following employee table:
Database_two has the following employee table:
So, the duplicate keys are 1 and 7. These two records must have their key changed to a new value. But because the numbers are not sequential, there is a possibility of creating another duplicate by just blindly assigning a new key. Using the above solution, here is what the data sets would look like.
First, I create the emp_no_test table with all possible values in the range ( in this case, 1 through 10 ).
Next, the script deletes all records from the emp_no_test where the emp_no is in the Database_One employee table. The emp_no_test table now looks like ( I left the blanks to indicate records that were deleted ):
Finally the script deletes all records from emp_no_test where emp_no is in the Database_Two employee table. The emp_no_test table now looks like:
Returning back the records left in the emp_no_test table gives us two keys that are not being used in either table: 2 and 10. I can safely renumber the keys to 2 and 10.
Resolving duplicate keys is often straightforward, but when there are limits imposed, you have to get creative. In my case we ended up renumbering several hundred employee numbers, which would have taken days of time for the human resource departments to sort out. By using this approach, I was able to find my new keys in a matter of minutes, and did not have to deal with any new duplicates created by renumbering.