As an ETL developer, I often receive files in which the natural key is either unidentified or misidentified, and I need to determine a combination of columns that uniquely identifies each row (the natural key, which is usually also the primary or unique key of the table holding this data), in order to integrate the data from that file into the database. Many times it is simply not clear what the unique key should be, and this requires a painstaking cycle of guessing and testing combinations of columns for unique values. To address this problem, I developed a stored procedure that will automatically query the table to discover a candidate for the unique key. At first, I wanted to create a tool that would report all unique combos, but due to performance reasons, I later limited that to just the first one discovered. I added a way to retest and exclude that result if the first one is not satisfactory.
This goal requires first creating a list of all of the unique combinations of column names. If we have a table with 3 columns, A, B, and C, then we need to generate all of the combos: A, B, C, AB, AC, BC, ABC (note that for our purposes, AB and BA are the same thing). When I initially approached this problem, I developed a recursive process that would start with column A, then test all "strains" beginning with A: A, AB, ABC, AC (this is known as a depth-first search). However, there's a problem with this approach. Let's say that only the combination BC is unique in our table. When our algorithm tests ABC, the result is positive, and the algorithm then stops upon success (if BC is unique, then any strain of BC is unique also). But what we really want is the minimum set of unique columns, the BC result. Ultimately I developed a hybrid recursive/iterative solution that tests combos from smallest size to largest. Using our ABC table above, the tool will first test A, B, and C, and then recursively call each derivative (AB, AC, and BC). When AB is tested, it will also call its derivatives (the only one of which is ABC). Combos AC and BC have no further derivatives.
The biggest issue with this task is performance. With 3 columns in our sample table (columns A, B, and C), we have 7 possible combinations to check: A, B, C, AB, AC, BC, and ABC. The general formula for this is, given a table of n columns, there are 2^n - 1 possible combinations (the "-1" discounts the null combination). Those of you familiar with your powers of two will see how quickly this formula will grow - a table with just 10 columns has 1023 combinations to check; 25 columns means 33,554,431 possible unique key combinations. To expedite the querying I put a few shortcuts in place. The first is the "exclude columns" parameter that allows you to list columns that are already known to be attributes, and thus should not be considered as candidates for the UK. Also specifiable via parameter is a maximum search depth, which limits the number of columns to include in each check (how often do you deal with a unique key composed of 30 columns? I set this at 10 as a default). SQL Server has a built-in limitation of 32, which is the maximum depth of nested stored procedure calls. Lastly, I make use of a sample table for each check. Let's say that the table you wish to check has data on the order of millions of rows. Each combo check the spd does first checks a sample of 1000 rows; if the sample is not unique, then the full data set is not going to be unique, and thus it skips the full data set query.
The script includes two functions used by the stored procedure. These functions are used solely to parse the parameter for excluding columns. Remember to run the three CREATE FUNCTION/CREATE PROCEDURE statements separately. To execute the spd:
EXEC spd_Tool_Get_Unique_Column_Combos 'help' -- prints help information
EXEC spd_Tool_Get_Unique_Column_Combos 'test_table'
Feel free to email me at email@example.com with any questions or comments, and please visit www.jessemclain.com for more info about the author. My blog can be found at http://jessesql.blogspot.com.