Using Ranking Functions to Deduplicate Data

  • Comments posted to this topic are about the item Using Ranking Functions to Deduplicate Data

  • Nice article, thanks for sharing.

    The same thing can be achieved with just the row_number function as well.

    select AlphaKey

    from (

    select row_number() over(partition by AlphaKey order by AlphaKey) rownum,

    AlphaKey

    from @AlphaList

    ) al

    where rownum = 1

    order by 1;

  • yep, i too use this query to remove duplicates...

    was wondering, which query is faster, this one or the one mentioned in the article...

    scottm30 (7/26/2010)


    Nice article, thanks for sharing.

    The same thing can be achieved with just the row_number function as well.

    select AlphaKey

    from (

    select row_number() over(partition by AlphaKey order by AlphaKey) rownum,

    AlphaKey

    from @AlphaList

    ) al

    where rownum = 1

    order by 1;

  • with the A,B,C,D,E example, the select distinct syntax would be better imho.

  • How about good ol' grouping to remove duplicates and supply any of the minimum, maximum, sum, count, average, etc. etc?

    select ItemNumber, min(UnitCost) as MinCost from MissingItems

    group by ItemNumber;

    ItemNumberMinCost

    77710.10

    88813.13

    99916.16

  • I was interested to see what the difference in performance between these queries as I would always have used Row_Number with a partition before.

    On the Northwind database (SQL 2005) I ran the following 2 queries

    select

    a.ProductID

    ,a.UnitPrice

    from

    (select

    ProductID

    ,UnitPrice

    ,Rank() over (order by ProductID) as Ranking

    ,Row_Number() over (order by ProductID,UnitPrice) as RowNumber

    from

    [Order Details]

    ) as a

    where

    a.Ranking=a.RowNumber

    order by

    a.ProductID

    select

    a.ProductID

    ,a.UnitPrice

    from

    (select

    ProductID

    ,UnitPrice

    ,Row_Number() over (partition by ProductID order by UnitPrice) as RowNumber

    from

    [Order Details]

    ) as a

    where

    a.RowNumber=1

    order by

    a.ProductID

    In the Actual Execution Plan, the first query showed up as 64% and the second as 36%.

    Far from exhaustive, but I'll probably stick to Row_Number with Partition by.

    In (partial) answer to Phil Wood, if all I wanted was the lowest price for each product then I would use grouping, I'd probably use this method if I wanted to find the customer or order date as well, changing the code to

    select

    a.ProductID

    ,a.UnitPrice

    ,b.CustomerID

    ,b.OrderDate

    from

    (select

    ProductID

    ,UnitPrice

    ,OrderID

    ,Row_Number() over (partition by ProductID order by UnitPrice) as RowNumber

    from

    [Order Details]

    ) as a

    inner join

    Orders as b

    on

    a.OrderID=b.OrderID

    where

    a.RowNumber=1

    order by

    a.ProductID

  • I agree. The simplest way to deduplicate is using grouping and aggregate functions as needed.

    regards

    Áron

  • I had to remove unitprice from the orderby to get deduplicated results.

    declare @TempTable table (itemnumber int, unitprice decimal(18, 2));

    insert into @TempTable values

    (777, 10.10),

    (777, 11.11),

    (777, 12.12),

    (888, 13.13),

    (888, 14.14),

    (888, 15.15),

    (999, 16.16),

    (999, 17.17),

    (999, 18.18);

    with GetMissingItems(Rank, RowNumber, ItemNumber, UnitCost) as

    (

    select

    RANK() over (order by ItemNumber) as Rank,

    ROW_NUMBER() over (order by ItemNumber) as RowNumber,

    itemnumber,

    unitprice

    from @TempTable

    ) select * from GetMissingItems

    where Rank = RowNumber;

  • In the example you provided you used one column. Say I have 4 columns and I want to deduplicate based on those. In other words I want to rank against those columns -- how would the SQL statements change?

  • messineo,

    If you have multiple columns that you want to deduplicate on you will need to include all of them in whichever method you use.

    Here is some sample code using each method that should work (all based on the Northwind database). Note that in all these cases the data will be de-duplicated but which row is selected for each set of duplicates can not be predicted (so if you want the first order for each set you will need different code).

    select

    a.ProductID

    ,a.UnitPrice

    ,a.Quantity

    ,a.Discount

    from

    (select

    ProductID

    ,UnitPrice

    ,Quantity

    ,Discount

    ,Rank() over (order by ProductID,UnitPrice,Quantity,Discount) as Ranking

    ,Row_Number() over (order by ProductID,UnitPrice,Quantity,Discount) as RowNumber

    from

    [Order Details]

    ) as a

    where

    a.Ranking=a.RowNumber

    select

    a.ProductID

    ,a.UnitPrice

    ,a.Quantity

    ,a.Discount

    from

    (select

    ProductID

    ,UnitPrice

    ,Quantity

    ,Discount

    ,Row_Number() over (partition by ProductID,UnitPrice,Quantity,Discount order by ProductID) as RowNumber

    from

    [Order Details]

    ) as a

    where

    a.RowNumber=1

    select

    ProductID

    ,UnitPrice

    ,Quantity

    ,Discount

    from

    [Order Details]

    group by

    ProductID

    ,UnitPrice

    ,Quantity

    ,Discount

  • I also generally use a partitioned row_number(), ordering on some selection criteria (highest number of x with provider y, etc).

    Is the output of Rank defined by ANSI to always support this method in all implementations? My guess is that it's only guaranteed to be in order, but that the density of the default invocation could vary.

    Still, this is clever and hadn't occurred to me before, so thanks for expanding my box.

  • Nice article. Gives me another way to de-dupe data. πŸ™‚

    I use almost the same technique, but I have a time stamp associated with each record and this time stamp differs by milliseconds. So I just pick the one which has the latest time stamp. I should point out that when you use Ranking functions on a table that has really large number of records (in my case it was close to 90 millions some times), the query performance can degrade drastically (I didnt even have an index to begin with).

  • I personally have used the ranking functions quite a bit lately not to just remove duplicates but to be able to qualify what is removed. In one case, I needed to qualify one field based on the number of characters in a column and look at another column that may have one of two choices in it - "Reject" or "Reschedule". In this case, Reject weighed more than Reschedule. The ranking of both of the columns was added together to come up with the total ranking. Then all of the rows that were not the highest rank [based on a single column] were removed.

  • The approach is/was unique and creative, and for that, it was an excellent article. But considering there are at least 2 other well known approaches to this problem, I feel that the article was missing the performance comparison that not only compared with the other approaches, but alsoo compared all approaches on a multi-column approach. Perhaps this was an excellent article as it left me, the reader, wanting lots more πŸ˜€

  • I find the RANK function to be much better used in situations where I might not be able to group by all fields in the select statement. Another way I have used it before is to find not the latest record, but the SECOND latest record by choosing where RANK = 2.

    -------------------------------------------------------------------------------------------------
    My SQL Server Blog

Viewing 15 posts - 1 through 15 (of 24 total)

You must be logged in to reply to this topic. Login to reply