Defining unique sets of records in a dataset

  • I work for a logistics company.

    Customers send packages from point A to B and pay a certain amount for that.

    Tariff data is stored in the following way:

    AccountNumbervarchar(9)

    Productvarchar(1)

    Originvarchar(3)

    Destinationvarchar(3)

    Pointervarchar(3)

    Some sample data:

    AccountNumberProductOriginDestinationPointer

    1234567891ABCDEFXXX

    1234567891GHIJKLYYY

    1234567891MNOPQRYYY

    1234567892ABCDEFZZZ

    1234567892GHIJKLZZZ

    1234567893ABCDEFYYY

    1234567893GHIJKLYYY

    There are hundreds of thousands of accounts.

    Each account can have up to 36 different products (A-Z, 0-9) set up.

    There are thousands of possible Origins and Destinations.

    A combination of AccountNumber, Product, Origin and Destination points to a Pointer, which in turn defines the tariff a customer pays (not important for my question).

    A zone is defined as a set of orgin-destination combinations under a specific AccountNumber, Product and Pointer.

    These are the zones from above sample data:

    AccountNumberProductPointerZoneOrigin-Destination

    1234567891XXX1ABC-DEF

    YYY2GHI-JKL

    MNO-PQR

    2ZZZ3ABC-DEF

    GHI-JKL

    3YYY4ABC-DEF

    GHI-JKL

    As you can see, Zones 3 and 4 are equal (same set of origin-destination combinations).

    What we’re trying to do, is create a list of distinct zones.

    In other words, we’re trying to identify unique sets of records.

    What we’ve tried so far:

  • checksum_agg(expression)

    this aggregate function accepts an integer expression and calculates a checksum over a set of records. This is actually exactly what we were looking for, but the problem with this function is that there’s a very high possibility of collisions. We converted the varchar origin and destination values to an int value and fed that to the aggregate and noticed ourselves that the collision rate was extremely high, which made the aggregate useless to us.

  • What we did then was use some math. Again, we converted the origin and destinations to integers and calculated the min, max, sum, avg, stdev for these sets of records and used the result to identify unique sets. This works fine, but is really slow.
  • The last thing we tried, was build a custom aggregate function in c#. This custom aggregate works as follows:

    select AccountNumber, Product, Pointer, AggregateChecksum(hashbytes(‘md5’, Origin + ‘|’ + Destination))

    This works pretty well and is the fastest solution we found up until now. We’re using it in production and we’re satisfied with the results, but we’re still looking for a faster way to identify these zone sets.

  • So this is my question: does anyone know of a faster/better way to identify unique sets of records?

  • No replies yet, perhaps I should make this a bit simpler?

    Let's say, I have this source table:

    AccountNumberPointer

    111111111A

    111111111B

    222222222C

    222222222D

    333333333A

    333333333B

    444444444D

    444444444E

    As a result, I want the following three tables:

    UniquePointerSets (because I have 3 unique sets of pointers: A,B - C,D - D,E

    IdDescription

    1Description 1

    2Description 2

    3Description 3

    UniquePointerSetDetails

    IdUniquePointerSetIdPointer

    11A

    21B

    32C

    42D

    53D

    63E

    AccountPointers

    AccountNumberUniquePointerSetId

    1111111111

    2222222222

    3333333331

    4444444443

    Does this make sense to anyone?

  • please read the following document and then post back with DDL/sample data and expected results.

    http://spaghettidba.com/2015/04/24/how-to-post-a-t-sql-question-on-a-public-forum

    if you provide the setups, then you are far more likely to get a tried and tested response.

    Happy New Year

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • Ah yes, makes sense. Thx, I'll do that first thing next year 😉

    Happy New Year to you too!

  • Not quite enough time to completely debug, but if you're on SQL 2012+, you might be able to use windowing functions and an OVER() clause to help.

    A couple of things. Are there always just two points? Meaning that a combination of your fields yields a single source->destination?

    If your data is structured like your sample, I might try an OVER(PARTITION BY AccountNumber, Product, Pointer) and then get a SUM() or COUNT(). I'm not sure what you need.

  • Viewing 5 posts - 1 through 4 (of 4 total)

    You must be logged in to reply to this topic. Login to reply