SQLServerCentral Article

Using Ranking Functions to Deduplicate Data

,

The introduction of ranking functions in SQL Server 2005 allowed the generation of listings with generated numbers based on sort orders providing keys such as row numbers and rank. These can be used for the deduplication of data in a simple way.

Consider the following simple example of a list of characters:

A
A
B
B
C
D
D
E

Let's put this into a table. This example creates a table variable.

declare @AlphaList table (AlphaKey char);
insert into @AlphaList(AlphaKey) values ('A');
insert into @AlphaList(AlphaKey) values ('A');
insert into @AlphaList(AlphaKey) values ('B');
insert into @AlphaList(AlphaKey) values ('B');
insert into @AlphaList(AlphaKey) values ('C');
insert into @AlphaList(AlphaKey) values ('D');
insert into @AlphaList(AlphaKey) values ('D');
insert into @AlphaList(AlphaKey) values ('E'); select AlphaKey from @AlphaList order by 1;

If we now use the ROW_NUMBER function we can get a listing of @AlphaList table with a sequential list of row numbers.

select 
ROW_NUMBER() over (order by AlphaKey) as RowNumber
, AlphaKey
from @AlphaList;

1 A
2 A
3 B
4 B
5 C
6 D
7 D
8 E

Now if we add RANK onto this query, we get an additional column ranking the data. For example, the values of 'A' are assigned a rank of 1 as they are equal, the values of B are assigned the next rank, C the next, etc. This is a simple way of assiging a value by class or category depending on how we order the data - the rank value of 3 simply means that all data with this rank number is the same.

select 
RANK() over (order by AlphaKey) as Rank
, ROW_NUMBER() over (order by AlphaKey) as RowNumber
, AlphaKey
from @AlphaList;


1 1 A
1 2 A
3 3 B
3 4 B
5 5 C
6 6 D
6 7 D
8 8 E

From this you can see that the first rows containing the AlphaKey column values of 'A', 'B' and 'D' have the same values generated by the RANK and ROW_NUMBER functions giving us a very simple way of deduplicating the data.

What is happening is that the data is now being ordered and classified or categorised simultaneously so that equal values are adjacent to each other. It all works because

  • the 'order by' clauses are used to bring the data back as a specific sorted list
  • the ROW_NUMBER function allocates sequential numbers to each row of data
  • the RANK function allocates a value to each block of data matching the order criteria

with AlphaRank(Rank, RowNumber, AlphaKey) as (
select
RANK() over (order by AlphaKey) as Rank
, ROW_NUMBER() over (order by AlphaKey) as RowNumber
, AlphaKey
from @AlphaList
)
select AlphaKey
from AlphaRank
where Rank=RowNumber

A
B
C
D
E

Practical Example

This process was used to extract missing data from tables to enable an Item dimension table to be back filled from invoice/credit fact tables. A query was run to extract all the Item numbers that were missing from the Item table, for example:

select ItemNumber, UnitCost into MissingItems from SalesInvoices where ItemNumber not in (select ItemNumber from dimItem)

This produces a table containing all the missing item numbers together with their unit cost. A similar query was run against the 'SalesCredits' table to obtain further rows of data. The extracted data could look something like this:

777 10.10
777 11.11
777 12.12
888 13.13
888 14.14
888 15.15
999 16.16
999 17.17
999 18.18

The following code could then be run to deduplicate the data and, in addition, obtain the lowest unit price for the item number.

with GetMissingItems(Rank, RowNumber, ItemNumber, UnitCost) as (
select
RANK() over (order by ItemNumber, UnitCost) as Rank
, ROW_NUMBER() over (order by ItemNumber, UnitCost) as RowNumber
, ItemNumber , UnitCost
from MissingItems
)
select ItemNumber , UnitCost
from GetMissingItems
where Rank=RowNumber

777 10.10
888 13.13
999 16.16

Conclusion

This is a simple process to solve one of the most complicated tasks required of a DBA. It is also fairly well performing and my own test was to extract 8314 item numbers from an Invoice table containing more than 63 million rows of data in less than 3 minutes. It's simplicity lies in the fact that you only need to add a couple of lines to a SQL select statement together with the appropriate ordering in the ROW_NUMBER and RANK function to get a workable solution. If you write the query so it returns a list of primary keys then you have a method for deleting unwanted rows.

 

 

 

Rate

4.62 (100)

You rated this post out of 5. Change rating

Share

Share

Rate

4.62 (100)

You rated this post out of 5. Change rating