Defining unique sets of records in a dataset

Question

Post reply

Defining unique sets of records in a dataset

bart.hermans

SSC Rookie

Points: 29
More actions
December 17, 2015 at 4:12 pm

#100598

I work for a logistics company.
Customers send packages from point A to B and pay a certain amount for that.
Tariff data is stored in the following way:
AccountNumbervarchar(9)
Productvarchar(1)
Originvarchar(3)
Destinationvarchar(3)
Pointervarchar(3)
Some sample data:
AccountNumberProductOriginDestinationPointer
1234567891ABCDEFXXX
1234567891GHIJKLYYY
1234567891MNOPQRYYY
1234567892ABCDEFZZZ
1234567892GHIJKLZZZ
1234567893ABCDEFYYY
1234567893GHIJKLYYY
There are hundreds of thousands of accounts.
Each account can have up to 36 different products (A-Z, 0-9) set up.
There are thousands of possible Origins and Destinations.
A combination of AccountNumber, Product, Origin and Destination points to a Pointer, which in turn defines the tariff a customer pays (not important for my question).
A zone is defined as a set of orgin-destination combinations under a specific AccountNumber, Product and Pointer.
These are the zones from above sample data:
AccountNumberProductPointerZoneOrigin-Destination
1234567891XXX1ABC-DEF
YYY2GHI-JKL
MNO-PQR
2ZZZ3ABC-DEF
GHI-JKL
3YYY4ABC-DEF
GHI-JKL
As you can see, Zones 3 and 4 are equal (same set of origin-destination combinations).
What we’re trying to do, is create a list of distinct zones.
In other words, we’re trying to identify unique sets of records.
What we’ve tried so far:
checksum_agg(expression)
this aggregate function accepts an integer expression and calculates a checksum over a set of records. This is actually exactly what we were looking for, but the problem with this function is that there’s a very high possibility of collisions. We converted the varchar origin and destination values to an int value and fed that to the aggregate and noticed ourselves that the collision rate was extremely high, which made the aggregate useless to us.
What we did then was use some math. Again, we converted the origin and destinations to integers and calculated the min, max, sum, avg, stdev for these sets of records and used the result to identify unique sets. This works fine, but is really slow.
The last thing we tried, was build a custom aggregate function in c#. This custom aggregate works as follows:
select AccountNumber, Product, Pointer, AggregateChecksum(hashbytes(‘md5’, Origin + ‘|’ + Destination))
This works pretty well and is the fastest solution we found up until now. We’re using it in production and we’re satisfied with the results, but we’re still looking for a faster way to identify these zone sets.

So this is my question: does anyone know of a faster/better way to identify unique sets of records?

bart.hermans SSC Rookie Points: 29 More actions · Answer 1

No replies yet, perhaps I should make this a bit simpler?

Let's say, I have this source table:

AccountNumberPointer

111111111A

111111111B

222222222C

222222222D

333333333A

333333333B

444444444D

444444444E

As a result, I want the following three tables:

UniquePointerSets (because I have 3 unique sets of pointers: A,B - C,D - D,E

IdDescription

1Description 1

2Description 2

3Description 3

UniquePointerSetDetails

IdUniquePointerSetIdPointer

11A

21B

32C

42D

53D

63E

AccountPointers

AccountNumberUniquePointerSetId

1111111111

2222222222

3333333331

4444444443

Does this make sense to anyone?

J Livingston SQL SSC Guru Points: 51550 More actions · Answer 2

please read the following document and then post back with DDL/sample data and expected results.

http://spaghettidba.com/2015/04/24/how-to-post-a-t-sql-question-on-a-public-forum

if you provide the setups, then you are far more likely to get a tried and tested response.

Happy New Year

________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day

bart.hermans SSC Rookie Points: 29 More actions · Answer 3

Ah yes, makes sense. Thx, I'll do that first thing next year 😉

Happy New Year to you too!

Steve Jones - SSC Editor SSC Guru Points: 734449 More actions · Answer 4

Not quite enough time to completely debug, but if you're on SQL 2012+, you might be able to use windowing functions and an OVER() clause to help.

A couple of things. Are there always just two points? Meaning that a combination of your fields yields a single source->destination?

If your data is structured like your sample, I might try an OVER(PARTITION BY AccountNumber, Product, Pointer) and then get a SUM() or COUNT(). I'm not sure what you need.