December 17, 2015 at 4:12 pm
I work for a logistics company.
Customers send packages from point A to B and pay a certain amount for that.
Tariff data is stored in the following way:
AccountNumbervarchar(9)
Productvarchar(1)
Originvarchar(3)
Destinationvarchar(3)
Pointervarchar(3)
Some sample data:
AccountNumberProductOriginDestinationPointer
1234567891ABCDEFXXX
1234567891GHIJKLYYY
1234567891MNOPQRYYY
1234567892ABCDEFZZZ
1234567892GHIJKLZZZ
1234567893ABCDEFYYY
1234567893GHIJKLYYY
There are hundreds of thousands of accounts.
Each account can have up to 36 different products (A-Z, 0-9) set up.
There are thousands of possible Origins and Destinations.
A combination of AccountNumber, Product, Origin and Destination points to a Pointer, which in turn defines the tariff a customer pays (not important for my question).
A zone is defined as a set of orgin-destination combinations under a specific AccountNumber, Product and Pointer.
These are the zones from above sample data:
AccountNumberProductPointerZoneOrigin-Destination
1234567891XXX1ABC-DEF
YYY2GHI-JKL
MNO-PQR
2ZZZ3ABC-DEF
GHI-JKL
3YYY4ABC-DEF
GHI-JKL
As you can see, Zones 3 and 4 are equal (same set of origin-destination combinations).
What we’re trying to do, is create a list of distinct zones.
In other words, we’re trying to identify unique sets of records.
What we’ve tried so far:
this aggregate function accepts an integer expression and calculates a checksum over a set of records. This is actually exactly what we were looking for, but the problem with this function is that there’s a very high possibility of collisions. We converted the varchar origin and destination values to an int value and fed that to the aggregate and noticed ourselves that the collision rate was extremely high, which made the aggregate useless to us.
select AccountNumber, Product, Pointer, AggregateChecksum(hashbytes(‘md5’, Origin + ‘|’ + Destination))
This works pretty well and is the fastest solution we found up until now. We’re using it in production and we’re satisfied with the results, but we’re still looking for a faster way to identify these zone sets.
So this is my question: does anyone know of a faster/better way to identify unique sets of records?
December 31, 2015 at 6:11 am
No replies yet, perhaps I should make this a bit simpler?
Let's say, I have this source table:
AccountNumberPointer
111111111A
111111111B
222222222C
222222222D
333333333A
333333333B
444444444D
444444444E
As a result, I want the following three tables:
UniquePointerSets (because I have 3 unique sets of pointers: A,B - C,D - D,E
IdDescription
1Description 1
2Description 2
3Description 3
UniquePointerSetDetails
IdUniquePointerSetIdPointer
11A
21B
32C
42D
53D
63E
AccountPointers
AccountNumberUniquePointerSetId
1111111111
2222222222
3333333331
4444444443
Does this make sense to anyone?
December 31, 2015 at 6:22 am
please read the following document and then post back with DDL/sample data and expected results.
http://spaghettidba.com/2015/04/24/how-to-post-a-t-sql-question-on-a-public-forum
if you provide the setups, then you are far more likely to get a tried and tested response.
Happy New Year
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
December 31, 2015 at 6:35 am
Ah yes, makes sense. Thx, I'll do that first thing next year 😉
Happy New Year to you too!
January 19, 2016 at 8:52 am
Not quite enough time to completely debug, but if you're on SQL 2012+, you might be able to use windowing functions and an OVER() clause to help.
A couple of things. Are there always just two points? Meaning that a combination of your fields yields a single source->destination?
If your data is structured like your sample, I might try an OVER(PARTITION BY AccountNumber, Product, Pointer) and then get a SUM() or COUNT(). I'm not sure what you need.
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply