## Using Ranking Functions to Deduplicate Data

 Author Message andrew.riley Grasshopper Group: General Forum Members Points: 19 Visits: 184 Comments posted to this topic are about the item Using Ranking Functions to Deduplicate Data scottm30 Say Hey Kid Group: General Forum Members Points: 673 Visits: 383 Nice article, thanks for sharing. The same thing can be achieved with just the row_number function as well.`select AlphaKeyfrom (   select row_number() over(partition by AlphaKey order by AlphaKey) rownum,    AlphaKey    from @AlphaList ) alwhere rownum = 1   order by 1;` ziangij SSCrazy Group: General Forum Members Points: 2907 Visits: 374 yep, i too use this query to remove duplicates...was wondering, which query is faster, this one or the one mentioned in the article...scottm30 (7/26/2010)Nice article, thanks for sharing. The same thing can be achieved with just the row_number function as well.`select AlphaKeyfrom (   select row_number() over(partition by AlphaKey order by AlphaKey) rownum,    AlphaKey    from @AlphaList ) alwhere rownum = 1   order by 1;` david.skov Forum Newbie Group: General Forum Members Points: 7 Visits: 100 with the A,B,C,D,E example, the select distinct syntax would be better imho. phil.wood 94423 Valued Member Group: General Forum Members Points: 69 Visits: 65 How about good ol' grouping to remove duplicates and supply any of the minimum, maximum, sum, count, average, etc. etc?select ItemNumber, min(UnitCost) as MinCost from MissingItemsgroup by ItemNumber;ItemNumber   MinCost777   10.10888   13.13999   16.16 philip.cullingworth UDP Broadcaster Group: General Forum Members Points: 1463 Visits: 967 I was interested to see what the difference in performance between these queries as I would always have used Row_Number with a partition before.On the Northwind database (SQL 2005) I ran the following 2 queries`select   a.ProductID   ,a.UnitPricefrom   (select      ProductID      ,UnitPrice      ,Rank() over (order by ProductID) as Ranking      ,Row_Number() over (order by ProductID,UnitPrice) as RowNumber   from      [Order Details]    as awhere   a.Ranking=a.RowNumberorder by   a.ProductID``select   a.ProductID   ,a.UnitPricefrom   (select      ProductID      ,UnitPrice      ,Row_Number() over (partition by ProductID order by UnitPrice) as RowNumber   from      [Order Details]    as awhere   a.RowNumber=1order by   a.ProductID`In the Actual Execution Plan, the first query showed up as 64% and the second as 36%.Far from exhaustive, but I'll probably stick to Row_Number with Partition by.In (partial) answer to Phil Wood, if all I wanted was the lowest price for each product then I would use grouping, I'd probably use this method if I wanted to find the customer or order date as well, changing the code to`select   a.ProductID   ,a.UnitPrice   ,b.CustomerID   ,b.OrderDatefrom   (select      ProductID      ,UnitPrice      ,OrderID      ,Row_Number() over (partition by ProductID order by UnitPrice) as RowNumber   from      [Order Details]    as a      inner join   Orders as b      on         a.OrderID=b.OrderIDwhere   a.RowNumber=1order by   a.ProductID` aszendi Forum Newbie Group: General Forum Members Points: 2 Visits: 7 I agree. The simplest way to deduplicate is using grouping and aggregate functions as needed.regardsÁron Bhavesh-1094084 SSC-Enthusiastic Group: General Forum Members Points: 149 Visits: 110 I had to remove unitprice from the orderby to get deduplicated results.`declare @TempTable table (itemnumber int, unitprice decimal(18, 2));insert into @TempTable values(777, 10.10), (777, 11.11),(777, 12.12),(888, 13.13),(888, 14.14),(888, 15.15),(999, 16.16),(999, 17.17),(999, 18.18);with GetMissingItems(Rank, RowNumber, ItemNumber, UnitCost) as(   select      RANK() over (order by ItemNumber) as Rank,      ROW_NUMBER() over (order by ItemNumber) as RowNumber,      itemnumber,      unitprice   from @TempTable      ) select * from GetMissingItemswhere Rank = RowNumber;` Bhavesh.NET and SQL Server Blog messineo Forum Newbie Group: General Forum Members Points: 1 Visits: 4 In the example you provided you used one column. Say I have 4 columns and I want to deduplicate based on those. In other words I want to rank against those columns -- how would the SQL statements change? philip.cullingworth UDP Broadcaster Group: General Forum Members Points: 1463 Visits: 967 messineo,If you have multiple columns that you want to deduplicate on you will need to include all of them in whichever method you use.Here is some sample code using each method that should work (all based on the Northwind database). Note that in all these cases the data will be de-duplicated but which row is selected for each set of duplicates can not be predicted (so if you want the first order for each set you will need different code).`select   a.ProductID   ,a.UnitPrice   ,a.Quantity   ,a.Discountfrom   (select      ProductID      ,UnitPrice      ,Quantity      ,Discount      ,Rank() over (order by ProductID,UnitPrice,Quantity,Discount) as Ranking      ,Row_Number() over (order by ProductID,UnitPrice,Quantity,Discount) as RowNumber   from      [Order Details]    as awhere   a.Ranking=a.RowNumber``select   a.ProductID   ,a.UnitPrice   ,a.Quantity   ,a.Discountfrom   (select      ProductID      ,UnitPrice      ,Quantity      ,Discount      ,Row_Number() over (partition by ProductID,UnitPrice,Quantity,Discount order by ProductID) as RowNumber   from      [Order Details]    as awhere   a.RowNumber=1``select   ProductID   ,UnitPrice   ,Quantity   ,Discountfrom   [Order Details]group by   ProductID   ,UnitPrice   ,Quantity   ,Discount`