Grouping by pairs, order agnostic

Question

Grouping by pairs, order agnostic

SQL_Bob

SSC Eights!

Points: 813
More actions
February 17, 2014 at 11:36 am

#304159

I'm guessing there's probably a simple and elegant way to accomplish this in T-SQL, just not sure how.
I have a large quantity of information regarding travel routes - lots of pairs involving origin and destination IDs. My end goal is to be able to group by route, regardless of direction. This would mean that a trip from location 1 to location 2, and a return trip from 2 to 1, would both fall into the same group.
My best idea for accomplishing this so far is to create a table containing origins and destinations in both orders, and then assign a route ID for each:
Origin | Destination | RouteID
-------+-------------+---------
1 | 2 | 1
2 | 1 | 1
2 | 3 | 2
3 | 2 | 2
1 | 3 | 3
3 | 1 | 3
Populating the origin and destination columns via a cartesian join is simple, but calculating the RouteID column efficiently is eluding me. Is there a better way to do this other than using loops?
_______________________________________________________________________________________________
Bob Pusateri
Microsoft Certified Master: SQL Server 2008
Blog: The Outer Join[/url]
Twitter: @SQLBob

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply

Sean Lange SSC Guru Points: 286573 More actions · Answer 1

Can you provide some ddl and sample data? It sounds like your origin and destination columns are derived? If your pattern of data is consistent maybe you can just a case expression?

case when Origin < Destination then Origin else Destination end as RouteID

I suspect it isn't really quite that straightforward but I bet there is something in the base table to indicate if it is return trip or not.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

Dwain Camps SSC Guru Points: 86908 More actions · Answer 2

Sean Lange (2/17/2014)
Can you provide some ddl and sample data? It sounds like your origin and destination columns are derived? If your pattern of data is consistent maybe you can just a case expression?
case when Origin < Destination then Origin else Destination end as RouteID
I suspect it isn't really quite that straightforward but I bet there is something in the base table to indicate if it is return trip or not.

Other than the fact that "as RouteID" wouldn't be valid in a GROUP BY clause, I think Sean's proposed solution hits the mark.

You might also want to take a look at this article: Departures from Origins and Arrivals at Destinations [/url]. While it doesn't address this specific problem, you might find in it some interesting solutions to other problems you'll likely encounter.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Dwain Camps SSC Guru Points: 86908 More actions · Answer 3

Ooops! Silly me, I just realized why that won't work. Try this (assuming this is what Sean meant):

WITH SampleData (Origin, Destination) AS

(

SELECT 1, 2

UNION ALL SELECT 2, 1

UNION ALL SELECT 2, 3

UNION ALL SELECT 3, 2

UNION ALL SELECT 1, 3

UNION ALL SELECT 3, 1

UNION ALL SELECT 1, 4

UNION ALL SELECT 4, 1

)

SELECT d1=MIN(Origin), d2=MAX(Destination)

FROM SampleData

GROUP BY CASE WHEN Origin < Destination THEN Origin ELSE Destination END;

But I think this will work (combine origin and destination in whatever fashion makes sense for you):

WITH SampleData (Origin, Destination) AS

(

SELECT 1, 2

UNION ALL SELECT 2, 1

UNION ALL SELECT 2, 3

UNION ALL SELECT 3, 2

UNION ALL SELECT 1, 3

UNION ALL SELECT 3, 1

UNION ALL SELECT 1, 4

UNION ALL SELECT 4, 1

)

SELECT d1=MIN(Origin), d2=MAX(Destination)

FROM SampleData a

CROSS APPLY

(

SELECT d1=RIGHT(10000+Origin, 4)+RIGHT(10000+Destination, 4)

,d2=RIGHT(10000+Destination, 4)+RIGHT(10000+Origin, 4)

) b

GROUP BY CASE WHEN Origin < Destination THEN d1 ELSE d2 END;

I'd assume that your origins/destinations are either the 3 or 4 character airport codes, so a straight concatenate should do the trick.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

SQL_Bob SSC Eights! Points: 813 More actions · Answer 4

Thanks for all the tips! I came up with a multi-step solution that so far seems to get me what I need, and I'm happy to share. I'll do a blog post with some better examples and sample data shortly.

So for a table "TripData" with columns "Origin" and "Destination", we can see all trips (directionally) with the following query:

select distinct Origin, Destination

from TripData;

To express these trips uniformly regardless of direction, some case statements are necessary to do some swapping of their order. Origin and Destination are both integers, so some simple numeric comparisons do the trick. Distinct ensures we see each route only once.

select distinct

case when Origin<=Destination then Origin

when Origin>Destination then Destination

end as PtA,

case when Origin<=Destination then Destination

when Origin>Destination then Origin

end as PtB

from TripData;

Now that we have distinct routes, we can assign each a RouteID.

SELECT a.PtA, a.PtB, ROW_NUMBER() over (order by a.PtA, a.PtB) AS RouteID

into dbo.RouteIDSTG

from (

select distinct

case when Origin<=Destination then Origin

when Origin>Destination then Destination

end as PtA,

case when Origin<=Destination then Destination

when Origin>Destination then Origin

end as PtB

from TripData

)a;

With that done, we can create a lookup table that contains each trip (directionally) along with the route ID for easy joining.

WITH AllPairs as (

SELECT Origin, Destination,

Case when Origin<=Destination then Origin

when Origin>Destination then Destination

end as PtA,

case when Origin<=Destination then Destination

when Origin>Destination then Origin

end as PtB

from TripData

)

SELECT distinct ap.Origin, ap.Destination, r.RouteID

into RouteIDMaster

from AllPairs ap

inner join dbo.RouteIDSTG r on r.PtA = ap.PtA and r.PtB = ap.PtB;

Now the TripData table can be joined to RouteIDMaster on Origin and Destination columns and grouped by RouteID.

_______________________________________________________________________________________________
Bob Pusateri
Microsoft Certified Master: SQL Server 2008

Blog: The Outer Join[/url]
Twitter: @SQLBob

Dwain Camps SSC Guru Points: 86908 More actions · Answer 5

Looks like an awful lot of work and an awful lot of DISTINCTs. Wouldn't this be simpler?

WITH SampleData (Origin, Destination) AS

(

SELECT 1, 2

UNION ALL SELECT 2, 1

UNION ALL SELECT 2, 3

UNION ALL SELECT 3, 2

UNION ALL SELECT 1, 3

UNION ALL SELECT 3, 1

UNION ALL SELECT 1, 4

UNION ALL SELECT 4, 1

)

SELECT Origin, Destination, RouteID

FROM

(

SELECT d1=MIN(Origin), d2=MAX(Destination), RouteID=ROW_Number() OVER (ORDER BY (SELECT NULL))

FROM SampleData a

CROSS APPLY

(

SELECT d1=RIGHT(10000+Origin, 4)+RIGHT(10000+Destination, 4)

,d2=RIGHT(10000+Destination, 4)+RIGHT(10000+Origin, 4)

) b

GROUP BY CASE WHEN Origin < Destination THEN d1 ELSE d2 END

) a

CROSS APPLY

(

SELECT Origin, Destination

FROM SampleData

WHERE d1 IN (Origin, Destination) AND d2 IN (Origin, Destination)

) b;

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

SQL_Bob SSC Eights! Points: 813 More actions · Answer 6

Your code is definitely simpler. But the performance (at least on my machine) is lacking. I just stopped it after letting it run for 10 minutes. My table has about 1.2M rows in it. I'll try tweaking it and see what develops!

_______________________________________________________________________________________________
Bob Pusateri
Microsoft Certified Master: SQL Server 2008

Blog: The Outer Join[/url]
Twitter: @SQLBob

Dwain Camps SSC Guru Points: 86908 More actions · Answer 7

With some DDL including indexing on your table it might offer a clue.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Dwain Camps SSC Guru Points: 86908 More actions · Answer 8

You're right! That was kind of doggy wasn't it? Try this instead:

CREATE TABLE dbo.OriginsDestinations

(

Origin INT

,Destination INT

,PRIMARY KEY (Origin, Destination)

);

WITH Tally (n) AS

(

SELECT TOP 500000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

FROM sys.all_columns a CROSS JOIN sys.all_columns b

)

INSERT INTO dbo.OriginsDestinations

SELECT n, n+1

FROM Tally

UNION ALL

SELECT n+1, n

FROM Tally;

SET STATISTICS TIME ON;

WITH OD AS

(

SELECT d1=MIN(Origin), d2=MAX(Destination), RouteID=ROW_Number() OVER (ORDER BY (SELECT NULL))

FROM dbo.OriginsDestinations a

CROSS APPLY

(

SELECT d1=RIGHT(1000000+Origin, 6)+RIGHT(1000000+Destination, 6)

,d2=RIGHT(1000000+Destination, 6)+RIGHT(1000000+Origin, 6)

) b

GROUP BY CASE WHEN Origin < Destination THEN d1 ELSE d2 END

)

SELECT d1, d2, RouteID

INTO #Temp

FROM OD

UNION ALL

SELECT d2, d1, RouteID

FROM OD

SET STATISTICS TIME OFF;

GO

DROP TABLE #Temp;

I get these timing results:

SQL Server Execution Times:

CPU time = 8454 ms, elapsed time = 12722 ms.

(1000000 row(s) affected)

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Sean Lange SSC Guru Points: 286573 More actions · Answer 9

Thanks Dwain. You were right in what you thought I was going for. Seems that this issue is in good hands now.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

Hunchback SSCrazy Points: 2649 More actions · Answer 10

I could be wrong for simplifying it that much but I think we could do this in one pull from [TripData] using the idea from Sean.

select

*, -- columns of interest

dense_rank() over(

order by

case when Origin <= Destination then Origin else Destination end,

case when Origin <= Destination then Destination else Origin end

) as rnk -- route_id

from

TripData;

The issue I see here is how to create a POC index (partition / ordering / covering) to avoid the sorting needed to support the ranking function. May be adding calculated columns (case expressions) and then adding the index.

Dwain Camps SSC Guru Points: 86908 More actions · Answer 11

Sean Lange (2/18/2014)
Thanks Dwain. You were right in what you thought I was going for. Seems that this issue is in good hands now.

Thanks for letting me know that Sean. Sometimes I feel a bit sheepish after posting my interpretation of what someone else is saying, in case I was wrong. Enthusiasm sometimes gets the best of me.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St