SQL Server Central is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
Search:  
 
 

Discover Unique Columns

By Jesse McLain, 2008/04/16

Total article views: 429 | Views in the last 30 days: 366

As an ETL developer, I often receive files in which the natural key is either unidentified or misidentified, and I need to determine a unique combination of columns in order to integrate data from that file into the database. Many times it is simply not clear what the unique key should be, and this requires a painstaking cycle of guessing and testing combinations of columns for unique values. To address this problem, I developed a stored procedure that will automatically query the table to discover a candidate for the unique key. At first, I wanted to create a tool that would report all unique combos, but due to performance reasons, I later limited that to just the first one discovered. I added a way to retest and exclude that result if the first one is not satisfactory.

This goal requires first creating a list of all of the unique combinations of column names. If we have a table with 3 columns, A, B, and C, then we need to generate all of the combos: A, B, C, AB, AC, BC, ABC (note that for our purposes, AB and BA are the same thing). When I initially approached this problem, I developed a recursive process that would start with column A, then test all "strains" beginning with A: A, AB, ABC, AC (this is known as a depth-first search). However, there's a problem with this approach. Let's say that only the combination BC is unique in our table. When our algorithm tests ABC, the result is positive, and the algorithm then stops upon success (if BC is unique, then any strain of BC is unique also). But what we really want is the minimum set of unique columns, the BC result. Ultimately I developed a hybrid recursive/iterative solution that tests combos from smallest size to largest. Using our ABC table above, the tool will first test A, B, and C, and then recursively call each derivative (AB, AC, and BC). When AB is tested, it will also call its derivatives (the only one of which is ABC). Combos AC and BC have no further derivatives.

The biggest issue with this task is performance. With 3 columns in our sample table (columns A, B, and C), we have 7 possible combinations to check: A, B, C, AB, AC, BC, and ABC. The general formula for this is, given a table of n columns, there are 2^n - 1 possible combinations (the "-1" discounts the null combination). Those of you familiar with your powers of two will see how quickly this formula will grow - a table with just 10 columns has 1023 combinations to check; 25 columns means 33,554,431 possible unique key combinations. To expedite the querying I put a few shortcuts in place. The first is the "exclude columns" parameter that allows you to list columns that are already known to be attributes, and thus should not be considered as candidates for the UK. Also specifiable via parameter is a maximum search depth, which limits the number of columns to include in each check (how often do you deal with a unique key composed of 30 columns? I set this at 10 as a default). SQL Server has a built-in limitation of 32, which is the maximum depth of nested stored procedure calls. Lastly, I make use of a sample table for each check. Let's say that the table you wish to check has data on the order of millions of rows. Each combo check the spd does first checks a sample of 1000 rows; if the sample is not unique, then the full data set is not going to be unique, and thus it skips the full data set query.

The script includes two functions used by the stored procedure. These functions are used solely to parse the parameter for excluding columns. Remember to run the three CREATE FUNCTION/CREATE PROCEDURE statements separately. To execute the spd:

EXEC spd_Tool_Get_Unique_Column_Combos 'help' -- prints help information

EXEC spd_Tool_Get_Unique_Column_Combos 'test_table'

Feel free to email me at jesse@phnx.biz with any questions or comments, and please visit www.phnx.biz for more info about the author.

 

By Jesse McLain, 2008/04/16

Total article views: 429 | Views in the last 30 days: 366
Your response
 
 
Related tags

Advanced Querying    
ETL    
T-SQL    
 
Already registered?  

Free registration required

To read the rest of this article, and access thousands of other articles, we ask you to register on the site and subscribe to our newsletters.

Register

E-mail address:
Password:
Password (confirm):

  

Subscriptions

We ask you to register on the site and subscribe to our newsletters. Subscribing to our newsletters gets you:

  • ALL of our content (thousands of articles, scripts, and forum postings)
  • A daily newsletter (example)
  • A weekly news round up (example)
  • The opportunity to ask and answer questions in our forums
  • A daily Question of the Day to test and help you increase your knowledge of SQL Server.

We ask that you give the newsletter a try for a week. Over 200,000 SQL Server Professionals a day find it entertaining and useful. If not, you are welcome to unsubscribe at anytime.

Steve Jones
Editor, SQLServerCentral.com