## Using Ranking Functions to Deduplicate Data

 Author Message Langston Montgomery SSC-Addicted Group: General Forum Members Points: 480 Visits: 508 Very nice article. Thanks for sharing. My two cents are that I would use the dense_rank() function to return the rank in tandem order. The rank() function tends to skip a rank depending on how many members belong to a group.`declare @AlphaList table (AlphaKey char);insert into @AlphaList(AlphaKey) values ('A');insert into @AlphaList(AlphaKey) values ('A');insert into @AlphaList(AlphaKey) values ('B');insert into @AlphaList(AlphaKey) values ('B');insert into @AlphaList(AlphaKey) values ('C');insert into @AlphaList(AlphaKey) values ('D');insert into @AlphaList(AlphaKey) values ('D');insert into @AlphaList(AlphaKey) values ('E');select RANK() over (order by AlphaKey) as Rank, dense_RANK() over (order by AlphaKey) as [Rank by density], ROW_NUMBER() over (order by AlphaKey) as RowNumber, AlphaKeyfrom @AlphaList;``Rank Rank by density RowNumber AlphaKey1 1 1 A1 1 2 A3 2 3 B3 2 4 B5 3 5 C6 4 6 D6 4 7 D8 5 8 E` Andre Guerreiro SSCarpal Tunnel Group: General Forum Members Points: 4163 Visits: 1517 I'm still getting ranking functions into my head slowly. :-DNice article about a very useful way to use them. Best regards,Andre Guerreiro NetoDatabase Analysthttp://www.softplan.com.brMCITPx1/MCTSx2/MCSE/MCSA lgidon Grasshopper Group: General Forum Members Points: 22 Visits: 59 I'm disappointed. I've been doing this for a while with ROW_NUMBER and PARTION and it works gr8 in all the situations I've come across.I can't figure out what advantage I can gain using the RANK function.I'm not sure there is an advantage, but I'll be happy if someone corrects me :-) Jeff Moden SSC Guru Group: General Forum Members Points: 503970 Visits: 44223 I'm torn... it's a nicely written article but the premise is at least twice as complicated as it needs to be and 2 times as slow as conventional methods, IMHO. Don't take my word for it, though... test it yourself... `--===========================================================================-- Create a multi-million row test table. This is not a measured-- part of the test.--===========================================================================--===== Conditionally drop the test table to make reruns easier IF OBJECT_ID('TempDB..#AlphaList','U') IS NOT NULL DROP TABLE #AlphaList;--===== Create and populate a multi-million row test table on the fly SELECT TOP 5000000 AlphaListID = IDENTITY(INT,1,1), AlphaKey = CHAR(ABS(CHECKSUM(NEWID()))%26+65) --A thru Z randomly INTO #AlphaList FROM Master.sys.All_Columns t1 CROSS JOIN Master.sys.All_Columns t2;--===== Create the expected indexes ALTER TABLE #AlphaList ADD PRIMARY KEY CLUSTERED (AlphaListID); CREATE NONCLUSTERED INDEX IX_#AlphaList_AlphaKey ON #AlphaList (AlphaKey);--===========================================================================-- Test 3 different methods with CPU and Duration measurements.--===========================================================================--===== "Clear the guns" PRINT REPLICATE('=',80); PRINT '========== Traditional RowNumber Method =========='; DBCC FREEPROCCACHE; DBCC DROPCLEANBUFFERS; SET STATISTICS TIME ON;--===== Test the codeselect AlphaKeyfrom ( select row_number() over(partition by AlphaKey order by AlphaKey) rownum, AlphaKey from #AlphaList ) alwhere rownum = 1 order by AlphaKey; SET STATISTICS TIME OFF;--===== "Clear the guns" PRINT REPLICATE('=',80); PRINT '========== Simple Distinct =========='; DBCC FREEPROCCACHE; DBCC DROPCLEANBUFFERS; SET STATISTICS TIME ON;--===== Test the codeSELECT DISTINCT AlphaKey FROM #AlphaListORDER BY AlphaKey; SET STATISTICS TIME OFF;--===== "Clear the guns" PRINT REPLICATE('=',80); PRINT '========== New Rank Method from Article =========='; DBCC FREEPROCCACHE; DBCC DROPCLEANBUFFERS; SET STATISTICS TIME ON;--===== Test the codewith AlphaRank(Rank, RowNumber, AlphaKey) as (select RANK() over (order by AlphaKey) as Rank, ROW_NUMBER() over (order by AlphaKey) as RowNumber, AlphaKeyfrom #AlphaList)select AlphaKeyfrom AlphaRankwhere Rank=RowNumber; SET STATISTICS TIME OFF;`Here are the results on my 8 year old desktop single p4 CPU running at 1.8Ghz with 1GB of ram on SQL Server 2005 Developer's Edition sp3.`(5000000 row(s) affected)========================================================================================== Traditional RowNumber Method ==========DBCC execution completed. If DBCC printed error messages, contact your system administrator.DBCC execution completed. If DBCC printed error messages, contact your system administrator.(26 row(s) affected)SQL Server Execution Times: CPU time = 2875 ms, elapsed time = 3005 ms.========================================================================================== Simple Distinct ==========DBCC execution completed. If DBCC printed error messages, contact your system administrator.DBCC execution completed. If DBCC printed error messages, contact your system administrator.(26 row(s) affected)SQL Server Execution Times: CPU time = 2985 ms, elapsed time = 3043 ms.========================================================================================== New Rank Method from Article ==========DBCC execution completed. If DBCC printed error messages, contact your system administrator.DBCC execution completed. If DBCC printed error messages, contact your system administrator.(26 row(s) affected)SQL Server Execution Times: CPU time = 7953 ms, elapsed time = 8037 ms.` --Jeff ModenRBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair Helpful Links:How to post code problemsHow to post performance problemsForum FAQs PravB4u Right there with Babe Group: General Forum Members Points: 763 Visits: 603 Hi,I think we can achieve the same without rank function.SELECT * FROM GetMissingItems WHERE RowNumber = 1 Praveen D'saMCITP - Database Administrator 2008http://sqlerrors.wordpress.com Jeff Moden SSC Guru Group: General Forum Members Points: 503970 Visits: 44223 PravB4u (7/29/2010)Hi,I think we can achieve the same without rank function.SELECT * FROM GetMissingItems WHERE RowNumber = 1Ummm.... Not without the "Partition By" part of the OVER clause for ROW_NUMBER in what appears to be a missing CTE from your code. --Jeff ModenRBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair Helpful Links:How to post code problemsHow to post performance problemsForum FAQs