Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

Get your favourite SSC scripts directly in SSMS with the free SQL Scripts addin. Search for scripts directly from SSMS, and instantly access any saved in your SSC briefcase from the favorites tab.
Download now (direct download link)

Identify Unique Columns

By Jesse McLain,

As an ETL developer, I often receive files in which the natural key is either unidentified or misidentified, and I need to determine a combination of columns that uniquely identifies each row (the natural key, which is usually also the primary or unique key of the table holding this data), in order to integrate the data from that file into the database. Many times it is simply not clear what the unique key should be, and this requires a painstaking cycle of guessing and testing combinations of columns for unique values. To address this problem, I developed a stored procedure that will automatically query the table to discover a candidate for the unique key. At first, I wanted to create a tool that would report all unique combos, but due to performance reasons, I later limited that to just the first one discovered. I added a way to retest and exclude that result if the first one is not satisfactory.

This goal requires first creating a list of all of the unique combinations of column names. If we have a table with 3 columns, A, B, and C, then we need to generate all of the combos: A, B, C, AB, AC, BC, ABC (note that for our purposes, AB and BA are the same thing). When I initially approached this problem, I developed a recursive process that would start with column A, then test all "strains" beginning with A: A, AB, ABC, AC (this is known as a depth-first search). However, there's a problem with this approach. Let's say that only the combination BC is unique in our table. When our algorithm tests ABC, the result is positive, and the algorithm then stops upon success (if BC is unique, then any strain of BC is unique also). But what we really want is the minimum set of unique columns, the BC result. Ultimately I developed a hybrid recursive/iterative solution that tests combos from smallest size to largest. Using our ABC table above, the tool will first test A, B, and C, and then recursively call each derivative (AB, AC, and BC). When AB is tested, it will also call its derivatives (the only one of which is ABC). Combos AC and BC have no further derivatives.

The biggest issue with this task is performance. With 3 columns in our sample table (columns A, B, and C), we have 7 possible combinations to check: A, B, C, AB, AC, BC, and ABC. The general formula for this is, given a table of n columns, there are 2^n - 1 possible combinations (the "-1" discounts the null combination). Those of you familiar with your powers of two will see how quickly this formula will grow - a table with just 10 columns has 1023 combinations to check; 25 columns means 33,554,431 possible unique key combinations. To expedite the querying I put a few shortcuts in place. The first is the "exclude columns" parameter that allows you to list columns that are already known to be attributes, and thus should not be considered as candidates for the UK. Also specifiable via parameter is a maximum search depth, which limits the number of columns to include in each check (how often do you deal with a unique key composed of 30 columns? I set this at 10 as a default). SQL Server has a built-in limitation of 32, which is the maximum depth of nested stored procedure calls. Lastly, I make use of a sample table for each check. Let's say that the table you wish to check has data on the order of millions of rows. Each combo check the spd does first checks a sample of 1000 rows; if the sample is not unique, then the full data set is not going to be unique, and thus it skips the full data set query.

The script includes two functions used by the stored procedure. These functions are used solely to parse the parameter for excluding columns. Remember to run the three CREATE FUNCTION/CREATE PROCEDURE statements separately. To execute the spd:

EXEC spd_Tool_Get_Unique_Column_Combos 'help' -- prints help information

EXEC spd_Tool_Get_Unique_Column_Combos 'test_table'

Feel free to email me at jesse@jessemclain.com with any questions or comments, and please visit www.jessemclain.com for more info about the author. My blog can be found at http://jessesql.blogspot.com.

Total article views: 1774 | Views in the last 30 days: 29
 
Related Articles
FORUM

unique clustered index on 2 columns

unique clustered index on 2 columns

FORUM

Unique Constraint for 2 columns in SQL 2005

Unique Constraint for 2 columns in SQL 2005

ARTICLE

TSQL Challenge 44 – Identify products that belong to each unique color

The first part of your job is to identify the unique color combinations of products and the second p...

FORUM

Unique constraints with encryption

I need encrypted columns to be unique

FORUM

Unique Constraint

Unique Constraint

 
Contribute

Join the most active online SQL Server Community

SQL knowledge, delivered daily, free:

Email address:  

You make SSC a better place

As a member of SQLServerCentral, you get free access to loads of fresh content: thousands of articles and SQL scripts, a library of free eBooks, a weekly database news roundup, a great Q & A platform… And it’s our huge, buzzing community of SQL Server Professionals that makes it such a success.

Join us!

Steve Jones
Editor, SQLServerCentral.com

Already a member? Jump in:

Email address:   Password:   Remember me: Forgotten your password?
Steve Jones