## Using Ranking Functions to Deduplicate Data

 Author Message andrew.riley Valued Member Group: General Forum Members Points: 57 Visits: 184 Comments posted to this topic are about the item Using Ranking Functions to Deduplicate Data scottm30 SSC Eights! Group: General Forum Members Points: 899 Visits: 383 Nice article, thanks for sharing. The same thing can be achieved with just the row_number function as well.select AlphaKeyfrom ( select row_number() over(partition by AlphaKey order by AlphaKey) rownum, AlphaKey from @AlphaList ) alwhere rownum = 1 order by 1; ziangij SSCarpal Tunnel Group: General Forum Members Points: 4866 Visits: 382 yep, i too use this query to remove duplicates...was wondering, which query is faster, this one or the one mentioned in the article...scottm30 (7/26/2010)Nice article, thanks for sharing. The same thing can be achieved with just the row_number function as well.select AlphaKeyfrom ( select row_number() over(partition by AlphaKey order by AlphaKey) rownum, AlphaKey from @AlphaList ) alwhere rownum = 1 order by 1; david.skov SSC-Enthusiastic Group: General Forum Members Points: 101 Visits: 100 with the A,B,C,D,E example, the select distinct syntax would be better imho. phil.wood 94423 SSC Veteran Group: General Forum Members Points: 229 Visits: 65 How about good ol' grouping to remove duplicates and supply any of the minimum, maximum, sum, count, average, etc. etc?select ItemNumber, min(UnitCost) as MinCost from MissingItemsgroup by ItemNumber;ItemNumber MinCost777 10.10888 13.13999 16.16 philip.cullingworth SSCommitted Group: General Forum Members Points: 1874 Visits: 1008 I was interested to see what the difference in performance between these queries as I would always have used Row_Number with a partition before.On the Northwind database (SQL 2005) I ran the following 2 queriesselect a.ProductID ,a.UnitPricefrom (select ProductID ,UnitPrice ,Rank() over (order by ProductID) as Ranking ,Row_Number() over (order by ProductID,UnitPrice) as RowNumber from [Order Details] ) as awhere a.Ranking=a.RowNumberorder by a.ProductIDselect a.ProductID ,a.UnitPricefrom (select ProductID ,UnitPrice ,Row_Number() over (partition by ProductID order by UnitPrice) as RowNumber from [Order Details] ) as awhere a.RowNumber=1order by a.ProductIDIn the Actual Execution Plan, the first query showed up as 64% and the second as 36%.Far from exhaustive, but I'll probably stick to Row_Number with Partition by.In (partial) answer to Phil Wood, if all I wanted was the lowest price for each product then I would use grouping, I'd probably use this method if I wanted to find the customer or order date as well, changing the code toselect a.ProductID ,a.UnitPrice ,b.CustomerID ,b.OrderDatefrom (select ProductID ,UnitPrice ,OrderID ,Row_Number() over (partition by ProductID order by UnitPrice) as RowNumber from [Order Details] ) as a inner join Orders as b on a.OrderID=b.OrderIDwhere a.RowNumber=1order by a.ProductID aszendi SSC Journeyman Group: General Forum Members Points: 88 Visits: 7 I agree. The simplest way to deduplicate is using grouping and aggregate functions as needed.regardsÁron Bhavesh-1094084 SSChasing Mays Group: General Forum Members Points: 635 Visits: 110 I had to remove unitprice from the orderby to get deduplicated results.declare @TempTable table (itemnumber int, unitprice decimal(18, 2));insert into @TempTable values(777, 10.10), (777, 11.11),(777, 12.12),(888, 13.13),(888, 14.14),(888, 15.15),(999, 16.16),(999, 17.17),(999, 18.18);with GetMissingItems(Rank, RowNumber, ItemNumber, UnitCost) as( select RANK() over (order by ItemNumber) as Rank, ROW_NUMBER() over (order by ItemNumber) as RowNumber, itemnumber, unitprice from @TempTable ) select * from GetMissingItemswhere Rank = RowNumber; Bhavesh.NET and SQL Server Blog messineo SSC Journeyman Group: General Forum Members Points: 79 Visits: 4 In the example you provided you used one column. Say I have 4 columns and I want to deduplicate based on those. In other words I want to rank against those columns -- how would the SQL statements change? philip.cullingworth SSCommitted Group: General Forum Members Points: 1874 Visits: 1008 messineo,If you have multiple columns that you want to deduplicate on you will need to include all of them in whichever method you use.Here is some sample code using each method that should work (all based on the Northwind database). Note that in all these cases the data will be de-duplicated but which row is selected for each set of duplicates can not be predicted (so if you want the first order for each set you will need different code).select a.ProductID ,a.UnitPrice ,a.Quantity ,a.Discountfrom (select ProductID ,UnitPrice ,Quantity ,Discount ,Rank() over (order by ProductID,UnitPrice,Quantity,Discount) as Ranking ,Row_Number() over (order by ProductID,UnitPrice,Quantity,Discount) as RowNumber from [Order Details] ) as awhere a.Ranking=a.RowNumberselect a.ProductID ,a.UnitPrice ,a.Quantity ,a.Discountfrom (select ProductID ,UnitPrice ,Quantity ,Discount ,Row_Number() over (partition by ProductID,UnitPrice,Quantity,Discount order by ProductID) as RowNumber from [Order Details] ) as awhere a.RowNumber=1select ProductID ,UnitPrice ,Quantity ,Discountfrom [Order Details]group by ProductID ,UnitPrice ,Quantity ,Discount