Using Ranking Functions to Deduplicate Data

  • andrew.riley

    SSC Journeyman

    Points: 81

    Comments posted to this topic are about the item Using Ranking Functions to Deduplicate Data

  • scottm30

    Ten Centuries

    Points: 1094

    Nice article, thanks for sharing.

    The same thing can be achieved with just the row_number function as well.

    select AlphaKey

    from (

    select row_number() over(partition by AlphaKey order by AlphaKey) rownum,

    AlphaKey

    from @AlphaList

    ) al

    where rownum = 1

    order by 1;

  • ziangij

    SSCertifiable

    Points: 6566

    yep, i too use this query to remove duplicates...

    was wondering, which query is faster, this one or the one mentioned in the article...

    scottm30 (7/26/2010)


    Nice article, thanks for sharing.

    The same thing can be achieved with just the row_number function as well.

    select AlphaKey

    from (

    select row_number() over(partition by AlphaKey order by AlphaKey) rownum,

    AlphaKey

    from @AlphaList

    ) al

    where rownum = 1

    order by 1;

  • david.skov

    SSC Enthusiast

    Points: 179

    with the A,B,C,D,E example, the select distinct syntax would be better imho.

  • phil.wood 94423

    Old Hand

    Points: 335

    How about good ol' grouping to remove duplicates and supply any of the minimum, maximum, sum, count, average, etc. etc?

    select ItemNumber, min(UnitCost) as MinCost from MissingItems

    group by ItemNumber;

    ItemNumber MinCost

    777 10.10

    888 13.13

    999 16.16

  • philip.cullingworth

    SSCrazy

    Points: 2150

    I was interested to see what the difference in performance between these queries as I would always have used Row_Number with a partition before.

    On the Northwind database (SQL 2005) I ran the following 2 queries

    select

    a.ProductID

    ,a.UnitPrice

    from

    (select

    ProductID

    ,UnitPrice

    ,Rank() over (order by ProductID) as Ranking

    ,Row_Number() over (order by ProductID,UnitPrice) as RowNumber

    from

    [Order Details]

    ) as a

    where

    a.Ranking=a.RowNumber

    order by

    a.ProductID

    select

    a.ProductID

    ,a.UnitPrice

    from

    (select

    ProductID

    ,UnitPrice

    ,Row_Number() over (partition by ProductID order by UnitPrice) as RowNumber

    from

    [Order Details]

    ) as a

    where

    a.RowNumber=1

    order by

    a.ProductID

    In the Actual Execution Plan, the first query showed up as 64% and the second as 36%.

    Far from exhaustive, but I'll probably stick to Row_Number with Partition by.

    In (partial) answer to Phil Wood, if all I wanted was the lowest price for each product then I would use grouping, I'd probably use this method if I wanted to find the customer or order date as well, changing the code to

    select

    a.ProductID

    ,a.UnitPrice

    ,b.CustomerID

    ,b.OrderDate

    from

    (select

    ProductID

    ,UnitPrice

    ,OrderID

    ,Row_Number() over (partition by ProductID order by UnitPrice) as RowNumber

    from

    [Order Details]

    ) as a

    inner join

    Orders as b

    on

    a.OrderID=b.OrderID

    where

    a.RowNumber=1

    order by

    a.ProductID

  • aszendi

    SSC Enthusiast

    Points: 114

    I agree. The simplest way to deduplicate is using grouping and aggregate functions as needed.

    regards

    Áron

  • Bhavesh-1094084

    SSC Eights!

    Points: 865

    I had to remove unitprice from the orderby to get deduplicated results.

    declare @TempTable table (itemnumber int, unitprice decimal(18, 2));

    insert into @TempTable values

    (777, 10.10),

    (777, 11.11),

    (777, 12.12),

    (888, 13.13),

    (888, 14.14),

    (888, 15.15),

    (999, 16.16),

    (999, 17.17),

    (999, 18.18);

    with GetMissingItems(Rank, RowNumber, ItemNumber, UnitCost) as

    (

    select

    RANK() over (order by ItemNumber) as Rank,

    ROW_NUMBER() over (order by ItemNumber) as RowNumber,

    itemnumber,

    unitprice

    from @TempTable

    ) select * from GetMissingItems

    where Rank = RowNumber;

  • messineo

    SSC Journeyman

    Points: 83

    In the example you provided you used one column. Say I have 4 columns and I want to deduplicate based on those. In other words I want to rank against those columns -- how would the SQL statements change?

  • philip.cullingworth

    SSCrazy

    Points: 2150

    messineo,

    If you have multiple columns that you want to deduplicate on you will need to include all of them in whichever method you use.

    Here is some sample code using each method that should work (all based on the Northwind database). Note that in all these cases the data will be de-duplicated but which row is selected for each set of duplicates can not be predicted (so if you want the first order for each set you will need different code).

    select

    a.ProductID

    ,a.UnitPrice

    ,a.Quantity

    ,a.Discount

    from

    (select

    ProductID

    ,UnitPrice

    ,Quantity

    ,Discount

    ,Rank() over (order by ProductID,UnitPrice,Quantity,Discount) as Ranking

    ,Row_Number() over (order by ProductID,UnitPrice,Quantity,Discount) as RowNumber

    from

    [Order Details]

    ) as a

    where

    a.Ranking=a.RowNumber

    select

    a.ProductID

    ,a.UnitPrice

    ,a.Quantity

    ,a.Discount

    from

    (select

    ProductID

    ,UnitPrice

    ,Quantity

    ,Discount

    ,Row_Number() over (partition by ProductID,UnitPrice,Quantity,Discount order by ProductID) as RowNumber

    from

    [Order Details]

    ) as a

    where

    a.RowNumber=1

    select

    ProductID

    ,UnitPrice

    ,Quantity

    ,Discount

    from

    [Order Details]

    group by

    ProductID

    ,UnitPrice

    ,Quantity

    ,Discount

  • mross01

    SSC Journeyman

    Points: 88

    I also generally use a partitioned row_number(), ordering on some selection criteria (highest number of x with provider y, etc).

    Is the output of Rank defined by ANSI to always support this method in all implementations? My guess is that it's only guaranteed to be in order, but that the density of the default invocation could vary.

    Still, this is clever and hadn't occurred to me before, so thanks for expanding my box.

  • rvphx

    SSCrazy

    Points: 2120

    Nice article. Gives me another way to de-dupe data. πŸ™‚

    I use almost the same technique, but I have a time stamp associated with each record and this time stamp differs by milliseconds. So I just pick the one which has the latest time stamp. I should point out that when you use Ranking functions on a table that has really large number of records (in my case it was close to 90 millions some times), the query performance can degrade drastically (I didnt even have an index to begin with).

  • grc-80104

    Old Hand

    Points: 339

    I personally have used the ranking functions quite a bit lately not to just remove duplicates but to be able to qualify what is removed. In one case, I needed to qualify one field based on the number of characters in a column and look at another column that may have one of two choices in it - "Reject" or "Reschedule". In this case, Reject weighed more than Reschedule. The ranking of both of the columns was added together to come up with the total ranking. Then all of the rows that were not the highest rank [based on a single column] were removed.

  • Lambchop4697

    SSC-Addicted

    Points: 457

    The approach is/was unique and creative, and for that, it was an excellent article. But considering there are at least 2 other well known approaches to this problem, I feel that the article was missing the performance comparison that not only compared with the other approaches, but alsoo compared all approaches on a multi-column approach. Perhaps this was an excellent article as it left me, the reader, wanting lots more πŸ˜€

  • amenjonathan

    SSCrazy

    Points: 2422

    I find the RANK function to be much better used in situations where I might not be able to group by all fields in the select statement. Another way I have used it before is to find not the latest record, but the SECOND latest record by choosing where RANK = 2.

    -------------------------------------------------------------------------------------------------
    My SQL Server Blog

Viewing 15 posts - 1 through 15 (of 25 total)

You must be logged in to reply to this topic. Login to reply