Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 123»»»

Using Ranking Functions to Deduplicate Data Expand / Collapse
Author
Message
Posted Monday, July 26, 2010 8:57 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Friday, August 29, 2014 5:06 AM
Points: 19, Visits: 147
Comments posted to this topic are about the item Using Ranking Functions to Deduplicate Data
Post #959167
Posted Monday, July 26, 2010 11:19 PM
SSChasing Mays

SSChasing MaysSSChasing MaysSSChasing MaysSSChasing MaysSSChasing MaysSSChasing MaysSSChasing MaysSSChasing Mays

Group: General Forum Members
Last Login: Tuesday, December 3, 2013 4:40 PM
Points: 654, Visits: 375
Nice article, thanks for sharing.
The same thing can be achieved with just the row_number function as well.


select AlphaKey
from (
select row_number() over(partition by AlphaKey order by AlphaKey) rownum,
AlphaKey
from @AlphaList
) al
where rownum = 1
order by 1;


Post #959186
Posted Tuesday, July 27, 2010 12:50 AM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Monday, June 30, 2014 2:51 AM
Points: 2,017, Visits: 369
yep, i too use this query to remove duplicates...

was wondering, which query is faster, this one or the one mentioned in the article...

scottm30 (7/26/2010)
Nice article, thanks for sharing.
The same thing can be achieved with just the row_number function as well.


select AlphaKey
from (
select row_number() over(partition by AlphaKey order by AlphaKey) rownum,
AlphaKey
from @AlphaList
) al
where rownum = 1
order by 1;


Post #959217
Posted Tuesday, July 27, 2010 1:35 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Monday, January 13, 2014 4:52 AM
Points: 5, Visits: 100
with the A,B,C,D,E example, the select distinct syntax would be better imho.
Post #959229
Posted Tuesday, July 27, 2010 1:55 AM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Thursday, December 5, 2013 7:19 AM
Points: 69, Visits: 65
How about good ol' grouping to remove duplicates and supply any of the minimum, maximum, sum, count, average, etc. etc?

select ItemNumber, min(UnitCost) as MinCost from MissingItems
group by ItemNumber;

ItemNumber MinCost
777 10.10
888 13.13
999 16.16
Post #959234
Posted Tuesday, July 27, 2010 2:31 AM
SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Today @ 7:51 AM
Points: 999, Visits: 859
I was interested to see what the difference in performance between these queries as I would always have used Row_Number with a partition before.

On the Northwind database (SQL 2005) I ran the following 2 queries
select
a.ProductID
,a.UnitPrice
from
(select
ProductID
,UnitPrice
,Rank() over (order by ProductID) as Ranking
,Row_Number() over (order by ProductID,UnitPrice) as RowNumber
from
[Order Details]
) as a
where
a.Ranking=a.RowNumber
order by
a.ProductID

select
a.ProductID
,a.UnitPrice
from
(select
ProductID
,UnitPrice
,Row_Number() over (partition by ProductID order by UnitPrice) as RowNumber
from
[Order Details]
) as a
where
a.RowNumber=1
order by
a.ProductID

In the Actual Execution Plan, the first query showed up as 64% and the second as 36%.

Far from exhaustive, but I'll probably stick to Row_Number with Partition by.

In (partial) answer to Phil Wood, if all I wanted was the lowest price for each product then I would use grouping, I'd probably use this method if I wanted to find the customer or order date as well, changing the code to
select
a.ProductID
,a.UnitPrice
,b.CustomerID
,b.OrderDate
from
(select
ProductID
,UnitPrice
,OrderID
,Row_Number() over (partition by ProductID order by UnitPrice) as RowNumber
from
[Order Details]
) as a
inner join
Orders as b
on
a.OrderID=b.OrderID
where
a.RowNumber=1
order by
a.ProductID
Post #959248
Posted Tuesday, July 27, 2010 2:54 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Tuesday, July 27, 2010 4:58 AM
Points: 2, Visits: 7
I agree. The simplest way to deduplicate is using grouping and aggregate functions as needed.
regards
Áron
Post #959255
Posted Tuesday, July 27, 2010 4:05 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Tuesday, October 15, 2013 5:16 AM
Points: 147, Visits: 104
I had to remove unitprice from the orderby to get deduplicated results.

declare @TempTable table (itemnumber int, unitprice decimal(18, 2));

insert into @TempTable values
(777, 10.10),
(777, 11.11),
(777, 12.12),
(888, 13.13),
(888, 14.14),
(888, 15.15),
(999, 16.16),
(999, 17.17),
(999, 18.18);

with GetMissingItems(Rank, RowNumber, ItemNumber, UnitCost) as
(
select
RANK() over (order by ItemNumber) as Rank,
ROW_NUMBER() over (order by ItemNumber) as RowNumber,
itemnumber,
unitprice
from @TempTable
) select * from GetMissingItems
where Rank = RowNumber;



Bhavesh
.NET and SQL Server Blog
Post #959283
Posted Tuesday, July 27, 2010 4:32 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Wednesday, July 28, 2010 12:19 PM
Points: 1, Visits: 4
In the example you provided you used one column. Say I have 4 columns and I want to deduplicate based on those. In other words I want to rank against those columns -- how would the SQL statements change?
Post #959294
Posted Tuesday, July 27, 2010 4:46 AM
SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Today @ 7:51 AM
Points: 999, Visits: 859
messineo,

If you have multiple columns that you want to deduplicate on you will need to include all of them in whichever method you use.

Here is some sample code using each method that should work (all based on the Northwind database). Note that in all these cases the data will be de-duplicated but which row is selected for each set of duplicates can not be predicted (so if you want the first order for each set you will need different code).

select
a.ProductID
,a.UnitPrice
,a.Quantity
,a.Discount
from
(select
ProductID
,UnitPrice
,Quantity
,Discount
,Rank() over (order by ProductID,UnitPrice,Quantity,Discount) as Ranking
,Row_Number() over (order by ProductID,UnitPrice,Quantity,Discount) as RowNumber
from
[Order Details]
) as a
where
a.Ranking=a.RowNumber

select
a.ProductID
,a.UnitPrice
,a.Quantity
,a.Discount
from
(select
ProductID
,UnitPrice
,Quantity
,Discount
,Row_Number() over (partition by ProductID,UnitPrice,Quantity,Discount order by ProductID) as RowNumber
from
[Order Details]
) as a
where
a.RowNumber=1

select
ProductID
,UnitPrice
,Quantity
,Discount
from
[Order Details]
group by
ProductID
,UnitPrice
,Quantity
,Discount
Post #959300
« Prev Topic | Next Topic »

Add to briefcase 123»»»

Permissions Expand / Collapse