Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

Using Ranking Functions to Deduplicate Data

By Andy Riley,

The introduction of ranking functions in SQL Server 2005 allowed the generation of listings with generated numbers based on sort orders providing keys such as row numbers and rank. These can be used for the deduplication of data in a simple way.


Consider the following simple example of a list of characters:

A
A
B
B
C
D
D
E

Let's put this into a table. This example creates a table variable.

declare @AlphaList table (AlphaKey char);

insert into @AlphaList(AlphaKey) values ('A');
insert into @AlphaList(AlphaKey) values ('A');
insert into @AlphaList(AlphaKey) values ('B');
insert into @AlphaList(AlphaKey) values ('B');
insert into @AlphaList(AlphaKey) values ('C');
insert into @AlphaList(AlphaKey) values ('D');
insert into @AlphaList(AlphaKey) values ('D');
insert into @AlphaList(AlphaKey) values ('E'); select AlphaKey from @AlphaList order by 1;

If we now use the ROW_NUMBER function we can get a listing of @AlphaList table with a sequential list of row numbers.

select 
ROW_NUMBER() over (order by AlphaKey) as RowNumber
, AlphaKey
from @AlphaList;

1 A
2 A
3 B
4 B
5 C
6 D
7 D
8 E


Now if we add RANK onto this query, we get an additional column ranking the data. For example, the values of 'A' are assigned a rank of 1 as they are equal, the values of B are assigned the next rank, C the next, etc. This is a simple way of assiging a value by class or category depending on how we order the data - the rank value of 3 simply means that all data with this rank number is the same.

select 
RANK() over (order by AlphaKey) as Rank
, ROW_NUMBER() over (order by AlphaKey) as RowNumber
, AlphaKey
from @AlphaList;


1 1 A
1 2 A
3 3 B
3 4 B
5 5 C
6 6 D
6 7 D
8 8 E


From this you can see that the first rows containing the AlphaKey column values of 'A', 'B' and 'D' have the same values generated by the RANK and ROW_NUMBER functions giving us a very simple way of deduplicating the data.

What is happening is that the data is now being ordered and classified or categorised simultaneously so that equal values are adjacent to each other. It all works because

  • the 'order by' clauses are used to bring the data back as a specific sorted list
  • the ROW_NUMBER function allocates sequential numbers to each row of data
  • the RANK function allocates a value to each block of data matching the order criteria

with AlphaRank(Rank, RowNumber, AlphaKey) as (
select
RANK() over (order by AlphaKey) as Rank
, ROW_NUMBER() over (order by AlphaKey) as RowNumber
, AlphaKey
from @AlphaList
)
select AlphaKey
from AlphaRank
where Rank=RowNumber

A
B
C
D
E

Practical Example

This process was used to extract missing data from tables to enable an Item dimension table to be back filled from invoice/credit fact tables. A query was run to extract all the Item numbers that were missing from the Item table, for example:

select ItemNumber, UnitCost into MissingItems from SalesInvoices where ItemNumber not in (select ItemNumber from dimItem)

This produces a table containing all the missing item numbers together with their unit cost. A similar query was run against the 'SalesCredits' table to obtain further rows of data. The extracted data could look something like this:

777 10.10
777 11.11
777 12.12
888 13.13
888 14.14
888 15.15
999 16.16
999 17.17
999 18.18

The following code could then be run to deduplicate the data and, in addition, obtain the lowest unit price for the item number.

with GetMissingItems(Rank, RowNumber, ItemNumber, UnitCost) as (
select
RANK() over (order by ItemNumber, UnitCost) as Rank
, ROW_NUMBER() over (order by ItemNumber, UnitCost) as RowNumber
, ItemNumber , UnitCost
from MissingItems
)
select ItemNumber , UnitCost
from GetMissingItems
where Rank=RowNumber

777 10.10
888 13.13
999 16.16

Conclusion

This is a simple process to solve one of the most complicated tasks required of a DBA. It is also fairly well performing and my own test was to extract 8314 item numbers from an Invoice table containing more than 63 million rows of data in less than 3 minutes. It's simplicity lies in the fact that you only need to add a couple of lines to a SQL select statement together with the appropriate ordering in the ROW_NUMBER and RANK function to get a workable solution. If you write the query so it returns a list of primary keys then you have a method for deleting unwanted rows.

 

 

 

Total article views: 12287 | Views in the last 30 days: 12
 
Related Articles
FORUM

Selecting a Value of the Order Within a Group

Select To Indicate Order Number Within a Group

FORUM

Order by numbers in proper sequence

Ordering by numbers in proper sequence order

BLOG

ORDER BY the numbers

Have you ever needed to order by a calculated column? You might have written it something like this...

FORUM

Paging Problem SQL 2005 using row_number and order by

Paging Problem SQL 2005 using row_number and order by

FORUM

Order by

Ordering rows by column with fullstops in between numbers

Tags
deduplicate    
ranking    
row_number()    
t-sql    
 
Contribute

Join the most active online SQL Server Community

SQL knowledge, delivered daily, free:

Email address:  

You make SSC a better place

As a member of SQLServerCentral, you get free access to loads of fresh content: thousands of articles and SQL scripts, a library of free eBooks, a weekly database news roundup, a great Q & A platform… And it’s our huge, buzzing community of SQL Server Professionals that makes it such a success.

Join us!

Steve Jones
Editor, SQLServerCentral.com

Already a member? Jump in:

Email address:   Password:   Remember me: Forgotten your password?
Steve Jones