Identify Unique Columns

Jesse-McLain, 2012-08-21 (first published: 2008-01-28)

As an ETL developer, I often receive files in which the natural key is either unidentified or misidentified, and I need to determine a combination of columns that uniquely identifies each row (the natural key, which is usually also the primary or unique key of the table holding this data), in order to integrate the data from that file into the database. Many times it is simply not clear what the unique key should be, and this requires a painstaking cycle of guessing and testing combinations of columns for unique values. To address this problem, I developed a stored procedure that will automatically query the table to discover a candidate for the unique key. At first, I wanted to create a tool that would report all unique combos, but due to performance reasons, I later limited that to just the first one discovered. I added a way to retest and exclude that result if the first one is not satisfactory.

This goal requires first creating a list of all of the unique combinations of column names. If we have a table with 3 columns, A, B, and C, then we need to generate all of the combos: A, B, C, AB, AC, BC, ABC (note that for our purposes, AB and BA are the same thing). When I initially approached this problem, I developed a recursive process that would start with column A, then test all "strains" beginning with A: A, AB, ABC, AC (this is known as a depth-first search). However, there's a problem with this approach. Let's say that only the combination BC is unique in our table. When our algorithm tests ABC, the result is positive, and the algorithm then stops upon success (if BC is unique, then any strain of BC is unique also). But what we really want is the minimum set of unique columns, the BC result. Ultimately I developed a hybrid recursive/iterative solution that tests combos from smallest size to largest. Using our ABC table above, the tool will first test A, B, and C, and then recursively call each derivative (AB, AC, and BC). When AB is tested, it will also call its derivatives (the only one of which is ABC). Combos AC and BC have no further derivatives.

The biggest issue with this task is performance. With 3 columns in our sample table (columns A, B, and C), we have 7 possible combinations to check: A, B, C, AB, AC, BC, and ABC. The general formula for this is, given a table of n columns, there are 2^n - 1 possible combinations (the "-1" discounts the null combination). Those of you familiar with your powers of two will see how quickly this formula will grow - a table with just 10 columns has 1023 combinations to check; 25 columns means 33,554,431 possible unique key combinations. To expedite the querying I put a few shortcuts in place. The first is the "exclude columns" parameter that allows you to list columns that are already known to be attributes, and thus should not be considered as candidates for the UK. Also specifiable via parameter is a maximum search depth, which limits the number of columns to include in each check (how often do you deal with a unique key composed of 30 columns? I set this at 10 as a default). SQL Server has a built-in limitation of 32, which is the maximum depth of nested stored procedure calls. Lastly, I make use of a sample table for each check. Let's say that the table you wish to check has data on the order of millions of rows. Each combo check the spd does first checks a sample of 1000 rows; if the sample is not unique, then the full data set is not going to be unique, and thus it skips the full data set query.

The script includes two functions used by the stored procedure. These functions are used solely to parse the parameter for excluding columns. Remember to run the three CREATE FUNCTION/CREATE PROCEDURE statements separately. To execute the spd:

EXEC spd_Tool_Get_Unique_Column_Combos 'help' -- prints help information

EXEC spd_Tool_Get_Unique_Column_Combos 'test_table'

Feel free to email me at jesse@jessemclain.com with any questions or comments, and please visit www.jessemclain.com for more info about the author. My blog can be found at http://jessesql.blogspot.com.

CREATE FUNCTION [dbo].[fn_Split](@sText varchar(8000), @sDelim varchar(20) = ' ') RETURNS @retArray TABLE (idx smallint Primary Key, value varchar(8000)) AS BEGIN DECLARE @idx smallint, @value varchar(8000), @bcontinue bit, @iStrike smallint, @iDelimlength tinyint IF @sDelim = 'Space' BEGIN SET @sDelim = ' ' END SET @idx = 0 SET @sText = LTrim(RTrim(@sText)) SET @iDelimlength = DATALENGTH(@sDelim) SET @bcontinue = 1 IF NOT ((@iDelimlength = 0) or (@sDelim = 'Empty')) BEGIN WHILE @bcontinue = 1 BEGIN --If you can find the delimiter in the text, retrieve the first element and --insert it with its index into the return table. IF CHARINDEX(@sDelim, @sText)>0 BEGIN SET @value = SUBSTRING(@sText,1, CHARINDEX(@sDelim,@sText)-1) BEGIN INSERT @retArray (idx, value) VALUES (@idx, @value) END --Trim the element and its delimiter from the front of the string. --Increment the index and loop. SET @iStrike = DATALENGTH(@value) + @iDelimlength SET @idx = @idx + 1 SET @sText = LTrim(Right(@sText,DATALENGTH(@sText) - @iStrike)) END ELSE BEGIN --If you can’t find the delimiter in the text, @sText is the last value in --@retArray. SET @value = @sText BEGIN INSERT @retArray (idx, value) VALUES (@idx, @value) END --Exit the WHILE loop. SET @bcontinue = 0 END END END ELSE BEGIN WHILE @bcontinue=1 BEGIN --If the delimiter is an empty string, check for remaining text --instead of a delimiter. Insert the first character into the --retArray table. Trim the character from the front of the string. --Increment the index and loop. IF DATALENGTH(@sText)>1 BEGIN SET @value = SUBSTRING(@sText,1,1) BEGIN INSERT @retArray (idx, value) VALUES (@idx, @value) END SET @idx = @idx+1 SET @sText = SUBSTRING(@sText,2,DATALENGTH(@sText)-1) END ELSE BEGIN --One character remains. --Insert the character, and exit the WHILE loop. INSERT @retArray (idx, value) VALUES (@idx, @sText) SET @bcontinue = 0 END END END RETURN END /*************************************************************************************************** Written by: Jesse McLain Purpose: Given a delimited string, this function will crack it open, parse out the items in the list, wrap the items in strings passed into @Item_Marker param, and then concatenate the new items back into list Input Parameters: see below Output Parameters: returns varchar(max) Called By: user *************************************************************************************************** Update History Date Author Purpose 12/05/2007 Jesse McLain Created function *************************************************************************************************** ToDo List Date Added By Business Need *************************************************************************************************** Notes ***************************************************************************************************/CREATE FUNCTION [dbo].[fn_Wrap_Items_In_List] (@Original_List varchar(1000), -- the original list to crack open @List_Delimiters varchar(100) = ',', -- the string serving as item delimiters @Item_Marker varchar(100) = '') -- string to wrap the items in before putting back together RETURNS varchar(max) AS BEGIN DECLARE @NewList varchar(max) SET @NewList = '' DECLARE @ListItem varchar(200) DECLARE List_Items_Cursor CURSOR FOR SELECT Value FROM dbo.fn_Split(@Original_List, @List_Delimiters) OPEN List_Items_Cursor FETCH NEXT FROM List_Items_Cursor INTO @ListItem WHILE @@FETCH_STATUS = 0 BEGIN SET @NewList = @NewList + @Item_Marker + @ListItem + @Item_Marker + @List_Delimiters FETCH NEXT FROM List_Items_Cursor INTO @ListItem END CLOSE List_Items_Cursor DEALLOCATE List_Items_Cursor IF @NewList <> '' SET @NewList = LEFT(@NewList, LEN(@NewList) - 1) RETURN @NewList END /*************************************************************************************************** Written by: Jesse McLain Purpose: To return all combinations of size n of values in a table's column Input Parameters: see below Output Parameters: none Called By: user *************************************************************************************************** Update History Date Author Purpose 11/29/2007 Jesse McLain Created spd 11/30/2007 Jesse McLain Chgd approach from depth-first to breadth-first 12/03/2007 Jesse McLain Added use of @uniq_fnd, to short-circuit further searching once a unique combo w/in a strain is found 12/05/2007 Jesse McLain Expanded use of @uniq_fnd so that the spd returns it, and in the loop that recursively calls itself, added a short circuit in the WHILE clause to stop searching if success found in deeper strain. Added use of fn_Wrap_Items_In_List to automatically wrap column names in @Exclude_Columns_List in quotes for SQL statement. *************************************************************************************************** ToDo List Date Added By Business Need 12/03/2007 Jesse McLain Add start-at/stop-at parameters, so that user can exclude searching columns based on range of column numbers *************************************************************************************************** Notes 11/30/07 10:51:50 AM - this version is the most brute force method possible. It works, but it is terribly inefficient. It is also the most simple implementation possible. Because of its depth-first searching, it over-reports uniqueness. For example, let's say we're checking columns A, B, and C. If A is not unique, then it checks all strains beginning with A. But let's say that column B is unique. Then combo AB is unique, and it is redundant to check both AB and B for uniqueness, but the depth-first method will do exactly that. In order to improve this, I need to change this to a breadth-first approach. To do that, keep the existing "Col_Names_Cursor" cursor. Instead of putting the recursive calls inside of it, have it check the uniqueness of the current strains, report them if they are, and insert the non-unique combos to check later into a local temp table. Open a second cursor on that temp table after "Col_Names_Cursor" finishes, which will then make recursive calls on those non-unique combos. I think that this new breadth-first approach will be ideal. 11/30/07 11:35:37 AM - I don't think that using the sample table with the new breadth-first approach will provide any benefit. The reason being, whether the combo in the sample table is unique or not, we still have to check the combo in the full table. Pseudocode of breadth-first with sample table usage: LOOP thru cursor of current strain IF combo in sample table is unique IF combo in whole table is unique INSERT into results table (reporting as unique) ELSE INSERT into temp table ELSE INSERT into temp table ENDLOOP 11/30/07 11:48:48 AM - the pseudocode above shows that the sample table will be useful, as it allows us to short-circuit checking the whole table for non-uniqueness if the sample is not unique. Finishing out the pseudocode for the second loop: LOOP thru cursor of temp table CALL spd recursively on new strain ENDLOOP ***************************************************************************************************/CREATE PROCEDURE [dbo].[spd_Tool_Get_Unique_Column_Combos] @Table_Name varchar(250), -- req'd; the name of the table on which to run report @Exclude_Columns_List varchar(1000) = '', -- opt'l; list of columns to exclude from checking @Max_Combo_Size tinyint = 10, -- opt'l; the maximum recursive search depth @Curr_Combo_Size tinyint = 0, -- (USER DOES NOT SPECIFY) max # columns in current -- column combo strain ( >= 1 and <= #columns) @Start_With tinyint = 1, -- (USER DOES NOT SPECIFY) the value item number to start at @Already_Picked varchar(1000) = '' -- (USER DOES NOT SPECIFY) any values that must be included -- in the current "strain" of column combos AS SET NOCOUNT ON IF @Table_Name = 'help' BEGIN PRINT 'PROCEDURE [dbo].[spd_Tool_Get_Unique_Column_Combos] ' PRINT ' @Table_Name varchar(250), -- REQUIRED; the name of the table on which to run report' PRINT ' @Exclude_Columns_List varchar(1000) = '''', -- OPTIONAL; list of columns to exclude from checking' PRINT ' @Max_Combo_Size tinyint = 10 -- OPTIONAL; the max #columns to check within a combo' RETURN END DECLARE @Find_First_Only tinyint -- set to 1 if you want to stop searching for more results if you found one SET @Find_First_Only = 1 -- if this is the initial call to this spd by user, then do some temp table creation: IF @Curr_Combo_Size = 0 BEGIN -- create a sample table from the source table to expedite uniqueness checking: DECLARE @Sample_Size smallint SET @Sample_Size = 1000 DECLARE @Sample_Table varchar(100) SET @Sample_Table = '##Sample_Table' IF EXISTS(SELECT 1 FROM TempDb.dbo.SysObjects WHERE NAME = '##Sample_Table') DROP TABLE ##Sample_Table DECLARE @sql varchar(8000) SET @sql = 'SELECT TOP ' + LTRIM(STR(@Sample_Size)) + ' * INTO ' + @sample_table + ' FROM ' + @Table_Name EXEC(@sql) -- create a temp table ##ColumnsAvailable to hold the columns to check: IF EXISTS(SELECT 1 FROM TempDb.dbo.SysObjects WHERE NAME = '##ColumnsAvailable') DROP TABLE ##ColumnsAvailable CREATE TABLE ##ColumnsAvailable (ColName varchar(200), ColNum smallint IDENTITY(1,1)) DECLARE @Exclude_Columns_List_Clause varchar(max) SET @Exclude_Columns_List_Clause = CASE WHEN @Exclude_Columns_List IS NOT NULL AND @Exclude_Columns_List <> '' THEN ' AND C.Name NOT IN (' + dbo.fn_Wrap_Items_In_List(@Exclude_Columns_List, ',', '''') + ')' ELSE '' END SET @sql = 'INSERT INTO ##ColumnsAvailable (ColName) SELECT C.Name FROM SysColumns C JOIN SysObjects O ON O.Id = C.Id WHERE O.Name = ''' + @Table_Name + '''' + @Exclude_Columns_List_Clause EXEC(@sql) IF EXISTS(SELECT * FROM TempDb.dbo.SysObjects WHERE NAME = '##Uniq_Col_Combo_Results') DROP TABLE ##Uniq_Col_Combo_Results CREATE TABLE ##Uniq_Col_Combo_Results (Combo_Txt varchar(1000), Pick_Size tinyint) END -- SQL server 2005 can only have 32 nested spd calls, so we have to limit the @Max_Combo_Size: SET @Max_Combo_Size = CASE WHEN @Max_Combo_Size > 30 THEN 30 ELSE @Max_Combo_Size END DECLARE @Curr_ColName varchar(200) -- holds values for cursor Col_Names_Cursor DECLARE @Curr_ColNum tinyint -- holds values for cursor Col_Names_Cursor DECLARE @Curr_Combo_ToChk varchar(1000) DECLARE @Curr_Combo_Size_Plus1 tinyint SET @Curr_Combo_Size_Plus1 = @Curr_Combo_Size + 1 DECLARE @uniq_fnd tinyint SET @uniq_fnd = 0 DECLARE @Curr_ColNum_Plus1 tinyint -- this is a temp table that holds strains to test in recursive calls: CREATE TABLE #Non_Uniqs (Combo_Txt varchar(1000), Col_Num tinyint) DECLARE Col_Names_Cursor CURSOR LOCAL FAST_FORWARD FOR SELECT ColName, ColNum FROM ##ColumnsAvailable WHERE ColNum >= @Start_With ORDER BY ColNum OPEN Col_Names_Cursor -- main loop FETCH NEXT FROM Col_Names_Cursor INTO @Curr_ColName, @Curr_ColNum WHILE @@FETCH_STATUS = 0 AND @uniq_fnd = 0 BEGIN SET @Curr_ColNum_Plus1 = @Curr_ColNum + 1 SET @Curr_Combo_ToChk = CASE WHEN @Already_Picked = '' THEN '' ELSE @Already_Picked + ',' END + @Curr_ColName SET @uniq_fnd = 0 -- first we're going to check the sample table for uniqueness of the current combo IF EXISTS(SELECT 1 FROM TempDb.dbo.SysObjects WHERE NAME = '##Dupes') DROP TABLE ##Dupes SET @sql = 'SELECT ' + @Curr_Combo_ToChk + ', cnt_dupes=COUNT(*) INTO ##Dupes FROM ##Sample_Table GROUP BY ' + @Curr_Combo_ToChk + ' HAVING COUNT(*) > 1' EXEC(@sql) IF NOT EXISTS(SELECT TOP 1 1 FROM ##Dupes) -- true if combo is unique in sample table BEGIN -- if the combo is unique in the sample table, then we check it in the whole table: DROP TABLE ##Dupes SET @sql = 'SELECT ' + @Curr_Combo_ToChk + ', cnt_dupes=COUNT(*) INTO ##Dupes FROM ' + @Table_Name + ' GROUP BY ' + @Curr_Combo_ToChk + ' HAVING COUNT(*) > 1' EXEC(@sql) IF NOT EXISTS(SELECT TOP 1 1 FROM ##Dupes) -- true if combo is unique in whole table BEGIN -- if it's unique here, then we're done with this strain: --PRINT 'COMBINATION "' + @Curr_Combo_ToChk + '" IS UNIQUE.' INSERT INTO ##Uniq_Col_Combo_Results (Combo_Txt) VALUES (@Curr_Combo_ToChk) SET @uniq_fnd = 1 END ELSE -- combo is not unique in whole table BEGIN --PRINT 'Combination "' + @Curr_Combo_ToChk + '" is not unique.' INSERT INTO #Non_Uniqs (Combo_Txt, Col_Num) VALUES (@Curr_Combo_ToChk, @Curr_ColNum_Plus1) END END ELSE -- combo is not unique in sample table BEGIN --PRINT 'Combination "' + @Curr_Combo_ToChk + '" is not unique.' INSERT INTO #Non_Uniqs (Combo_Txt, Col_Num) VALUES (@Curr_Combo_ToChk, @Curr_ColNum_Plus1) END FETCH NEXT FROM Col_Names_Cursor INTO @Curr_ColName, @Curr_ColNum END CLOSE Col_Names_Cursor DEALLOCATE Col_Names_Cursor /* now we have a temp table, #Non_Uniqs, holding the combos from the querying above that are non-unique in the table of interest. We want to determine the uniqueness of their derivatives (which are combos that have them as a beginning) */ IF @Curr_Combo_Size < @Max_Combo_Size - 1 -- make sure we're not at search depth limit AND @uniq_fnd = 0 -- short-circuit further searching by success BEGIN DECLARE Non_Uniqs_Cursor CURSOR LOCAL FAST_FORWARD FOR SELECT Combo_Txt, Col_Num FROM #Non_Uniqs OPEN Non_Uniqs_Cursor FETCH NEXT FROM Non_Uniqs_Cursor INTO @Curr_Combo_ToChk, @Curr_ColNum_Plus1 WHILE @@FETCH_STATUS = 0 AND @uniq_fnd = 0 BEGIN EXEC @uniq_fnd = spd_Tool_Get_Unique_Column_Combos @Table_Name, @Exclude_Columns_List, @Max_Combo_Size, @Curr_Combo_Size_Plus1, @Curr_ColNum_Plus1, @Curr_Combo_ToChk FETCH NEXT FROM Non_Uniqs_Cursor INTO @Curr_Combo_ToChk, @Curr_ColNum_Plus1 END CLOSE Non_Uniqs_Cursor DEALLOCATE Non_Uniqs_Cursor END -- if this is the initial call to this spd by user, then do some clean up and report results: IF @Curr_Combo_Size = 0 BEGIN IF EXISTS(SELECT TOP 1 1 FROM ##Uniq_Col_Combo_Results) BEGIN IF (SELECT COUNT(*) FROM ##Uniq_Col_Combo_Results) > 1 PRINT 'Here are the combinations of columns found to be unique in "' + @Table_Name + '":' ELSE IF @Find_First_Only = 1 PRINT 'Here is the first combination of columns found to be unique in "' + @Table_Name + '":' ELSE PRINT 'Here is the only combination of columns found to be unique in "' + @Table_Name + '":' DECLARE Uniqs_Cursor CURSOR LOCAL FAST_FORWARD FOR SELECT Combo_Txt FROM ##Uniq_Col_Combo_Results OPEN Uniqs_Cursor FETCH NEXT FROM Uniqs_Cursor INTO @Curr_Combo_ToChk WHILE @@FETCH_STATUS = 0 BEGIN PRINT RTRIM(@Curr_Combo_ToChk) FETCH NEXT FROM Uniqs_Cursor INTO @Curr_Combo_ToChk END CLOSE Uniqs_Cursor DEALLOCATE Uniqs_Cursor END ELSE BEGIN PRINT 'No combinations of columns were found to be unique in "' + @Table_Name + '"' END DROP TABLE ##ColumnsAvailable DROP TABLE ##Sample_Table DROP TABLE ##Uniq_Col_Combo_Results END ELSE RETURN @uniq_fnd

Adding Years, Months and Days

by Pete Revell

SQLServerCentral.com

T-SQL

Adding Years, Months and Days to Years, Months and Days, with Bonus UK Public Holiday calculation.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

(10)

You rated this post out of 5. Change rating

2017-03-14

3,240 reads

Discuss

Hidden RBAR: Triangular Joins

by Jeff Moden

SQLServerCentral.com

T-SQL

Just because it looks set based, doesn't mean it is. T-SQL Crackerjack Jeff Moden takes a look at a mistake that many query writers make.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

(285)

You rated this post out of 5. Change rating

2009-01-16 (first published: 2007-12-06)

60,548 reads

Discuss

Gathering Random Data

by Brian Knight

SQLServerCentral.com

T-SQL

Brian Knight recently had the basic need to retrieve a record from the database at random. What seemed to be an easy task quickly became a complex one. This case showed an interesting quirk with T-SQL that was resolved in an equally quirky way. This quick article shows you a method to retrieve random data or randomize the display of data.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

(1)

You rated this post out of 5. Change rating

2003-03-26

9,423 reads

Discuss

Reusing Identities

by Dinesh Priyankara

SQLServerCentral.com

T-SQL

Ever wanted to know how to reuse identity values after they're deleted? In this example by Dinesh, he shows you two methods to reuse the wholes in identity rows.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2003-02-18

8,743 reads

Discuss

Using Exotic Joins in SQL - Part 2

by Chris Cubley

SQLServerCentral.com

T-SQL

In the previous article, you saw how the BETWEEN operator could be used in joins to solve problems dealing with range-based data. In this article, Chris Cubley will show you how to take joins even further by using multiple criteria in joins as well as using the greater than, less than, and not equals operators in joins.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2003-02-05

13,846 reads

Discuss

Identify Unique Columns

Rate

Share

Share

Rate

Identify Unique Columns

Rate

Share

Share

Rate

Related content

Adding Years, Months and Days

Hidden RBAR: Triangular Joins

Gathering Random Data

Reusing Identities

Using Exotic Joins in SQL - Part 2