Using Ranking Functions to Deduplicate Data

Question

Post reply

Using Ranking Functions to Deduplicate Data

andrew.riley

SSC Journeyman

Points: 81
More actions
July 26, 2010 at 8:57 pm

#223690

Comments posted to this topic are about the item Using Ranking Functions to Deduplicate Data

Viewing 15 posts - 1 through 15 (of 25 total)

You must be logged in to reply to this topic. Login to reply

scottm30 Ten Centuries Points: 1094 More actions · Answer 1

Nice article, thanks for sharing.

The same thing can be achieved with just the row_number function as well.

select AlphaKey

from (

select row_number() over(partition by AlphaKey order by AlphaKey) rownum,

AlphaKey

from @AlphaList

) al

where rownum = 1

order by 1;

ziangij SSCrazy Eights Points: 8066 More actions · Answer 2

yep, i too use this query to remove duplicates...

was wondering, which query is faster, this one or the one mentioned in the article...

scottm30 (7/26/2010)
Nice article, thanks for sharing.
The same thing can be achieved with just the row_number function as well.
select AlphaKey
from (
select row_number() over(partition by AlphaKey order by AlphaKey) rownum,
AlphaKey
from @AlphaList
) al
where rownum = 1
order by 1;

david.skov SSC Enthusiast Points: 179 More actions · Answer 3

with the A,B,C,D,E example, the select distinct syntax would be better imho.

phil.wood 94423 Old Hand Points: 335 More actions · Answer 4

How about good ol' grouping to remove duplicates and supply any of the minimum, maximum, sum, count, average, etc. etc?

select ItemNumber, min(UnitCost) as MinCost from MissingItems

group by ItemNumber;

ItemNumberMinCost

77710.10

88813.13

99916.16

philip.cullingworth SSCrazy Points: 2150 More actions · Answer 5

I was interested to see what the difference in performance between these queries as I would always have used Row_Number with a partition before.

On the Northwind database (SQL 2005) I ran the following 2 queries

select

a.ProductID

,a.UnitPrice

from

(select

ProductID

,UnitPrice

,Rank() over (order by ProductID) as Ranking

,Row_Number() over (order by ProductID,UnitPrice) as RowNumber

from

[Order Details]

) as a

where

a.Ranking=a.RowNumber

order by

a.ProductID

select

a.ProductID

,a.UnitPrice

from

(select

ProductID

,UnitPrice

,Row_Number() over (partition by ProductID order by UnitPrice) as RowNumber

from

[Order Details]

) as a

where

a.RowNumber=1

order by

a.ProductID

In the Actual Execution Plan, the first query showed up as 64% and the second as 36%.

Far from exhaustive, but I'll probably stick to Row_Number with Partition by.

In (partial) answer to Phil Wood, if all I wanted was the lowest price for each product then I would use grouping, I'd probably use this method if I wanted to find the customer or order date as well, changing the code to

select

a.ProductID

,a.UnitPrice

,b.CustomerID

,b.OrderDate

from

(select

ProductID

,UnitPrice

,OrderID

,Row_Number() over (partition by ProductID order by UnitPrice) as RowNumber

from

[Order Details]

) as a

inner join

Orders as b

on

a.OrderID=b.OrderID

where

a.RowNumber=1

order by

a.ProductID

aszendi SSC Enthusiast Points: 114 More actions · Answer 6

I agree. The simplest way to deduplicate is using grouping and aggregate functions as needed.

regards

Áron

Bhavesh-1094084 SSC Eights! Points: 865 More actions · Answer 7

I had to remove unitprice from the orderby to get deduplicated results.

declare @TempTable table (itemnumber int, unitprice decimal(18, 2));

insert into @TempTable values

(777, 10.10),

(777, 11.11),

(777, 12.12),

(888, 13.13),

(888, 14.14),

(888, 15.15),

(999, 16.16),

(999, 17.17),

(999, 18.18);

with GetMissingItems(Rank, RowNumber, ItemNumber, UnitCost) as

(

select

RANK() over (order by ItemNumber) as Rank,

ROW_NUMBER() over (order by ItemNumber) as RowNumber,

itemnumber,

unitprice

from @TempTable

) select * from GetMissingItems

where Rank = RowNumber;

Bhavesh
.NET and SQL Server Blog

messineo SSC Journeyman Points: 83 More actions · Answer 8

In the example you provided you used one column. Say I have 4 columns and I want to deduplicate based on those. In other words I want to rank against those columns -- how would the SQL statements change?

philip.cullingworth SSCrazy Points: 2150 More actions · Answer 9

messineo,

If you have multiple columns that you want to deduplicate on you will need to include all of them in whichever method you use.

Here is some sample code using each method that should work (all based on the Northwind database). Note that in all these cases the data will be de-duplicated but which row is selected for each set of duplicates can not be predicted (so if you want the first order for each set you will need different code).

select

a.ProductID

,a.UnitPrice

,a.Quantity

,a.Discount

from

(select

ProductID

,UnitPrice

,Quantity

,Discount

,Rank() over (order by ProductID,UnitPrice,Quantity,Discount) as Ranking

,Row_Number() over (order by ProductID,UnitPrice,Quantity,Discount) as RowNumber

from

[Order Details]

) as a

where

a.Ranking=a.RowNumber

select

a.ProductID

,a.UnitPrice

,a.Quantity

,a.Discount

from

(select

ProductID

,UnitPrice

,Quantity

,Discount

,Row_Number() over (partition by ProductID,UnitPrice,Quantity,Discount order by ProductID) as RowNumber

from

[Order Details]

) as a

where

a.RowNumber=1

select

ProductID

,UnitPrice

,Quantity

,Discount

from

[Order Details]

group by

ProductID

,UnitPrice

,Quantity

,Discount

mross01 SSC Journeyman Points: 88 More actions · Answer 10

I also generally use a partitioned row_number(), ordering on some selection criteria (highest number of x with provider y, etc).

Is the output of Rank defined by ANSI to always support this method in all implementations? My guess is that it's only guaranteed to be in order, but that the density of the default invocation could vary.

Still, this is clever and hadn't occurred to me before, so thanks for expanding my box.

rvphx SSCrazy Points: 2120 More actions · Answer 11

Nice article. Gives me another way to de-dupe data. 🙂

I use almost the same technique, but I have a time stamp associated with each record and this time stamp differs by milliseconds. So I just pick the one which has the latest time stamp. I should point out that when you use Ranking functions on a table that has really large number of records (in my case it was close to 90 millions some times), the query performance can degrade drastically (I didnt even have an index to begin with).

grc-80104 Old Hand Points: 339 More actions · Answer 12

I personally have used the ranking functions quite a bit lately not to just remove duplicates but to be able to qualify what is removed. In one case, I needed to qualify one field based on the number of characters in a column and look at another column that may have one of two choices in it - "Reject" or "Reschedule". In this case, Reject weighed more than Reschedule. The ranking of both of the columns was added together to come up with the total ranking. Then all of the rows that were not the highest rank [based on a single column] were removed.

Lambchop4697 Mr or Mrs. 500 Points: 599 More actions · Answer 13

The approach is/was unique and creative, and for that, it was an excellent article. But considering there are at least 2 other well known approaches to this problem, I feel that the article was missing the performance comparison that not only compared with the other approaches, but alsoo compared all approaches on a multi-column approach. Perhaps this was an excellent article as it left me, the reader, wanting lots more 😀

amenjonathan SSCrazy Points: 2482 More actions · Answer 14

I find the RANK function to be much better used in situations where I might not be able to group by all fields in the select statement. Another way I have used it before is to find not the latest record, but the SECOND latest record by choosing where RANK = 2.

-------------------------------------------------------------------------------------------------
My SQL Server Blog