Grouping by pairs, order agnostic

  • I'm guessing there's probably a simple and elegant way to accomplish this in T-SQL, just not sure how.

    I have a large quantity of information regarding travel routes - lots of pairs involving origin and destination IDs. My end goal is to be able to group by route, regardless of direction. This would mean that a trip from location 1 to location 2, and a return trip from 2 to 1, would both fall into the same group.

    My best idea for accomplishing this so far is to create a table containing origins and destinations in both orders, and then assign a route ID for each:

    Origin | Destination | RouteID

    -------+-------------+---------

    1 | 2 | 1

    2 | 1 | 1

    2 | 3 | 2

    3 | 2 | 2

    1 | 3 | 3

    3 | 1 | 3

    Populating the origin and destination columns via a cartesian join is simple, but calculating the RouteID column efficiently is eluding me. Is there a better way to do this other than using loops?

    _______________________________________________________________________________________________
    Bob Pusateri
    Microsoft Certified Master: SQL Server 2008

    Blog: The Outer Join[/url]
    Twitter: @SQLBob

  • Can you provide some ddl and sample data? It sounds like your origin and destination columns are derived? If your pattern of data is consistent maybe you can just a case expression?

    case when Origin < Destination then Origin else Destination end as RouteID

    I suspect it isn't really quite that straightforward but I bet there is something in the base table to indicate if it is return trip or not.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Sean Lange (2/17/2014)


    Can you provide some ddl and sample data? It sounds like your origin and destination columns are derived? If your pattern of data is consistent maybe you can just a case expression?

    case when Origin < Destination then Origin else Destination end as RouteID

    I suspect it isn't really quite that straightforward but I bet there is something in the base table to indicate if it is return trip or not.

    Other than the fact that "as RouteID" wouldn't be valid in a GROUP BY clause, I think Sean's proposed solution hits the mark.

    You might also want to take a look at this article: Departures from Origins and Arrivals at Destinations [/url]. While it doesn't address this specific problem, you might find in it some interesting solutions to other problems you'll likely encounter.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Ooops! Silly me, I just realized why that won't work. Try this (assuming this is what Sean meant):

    WITH SampleData (Origin, Destination) AS

    (

    SELECT 1, 2

    UNION ALL SELECT 2, 1

    UNION ALL SELECT 2, 3

    UNION ALL SELECT 3, 2

    UNION ALL SELECT 1, 3

    UNION ALL SELECT 3, 1

    UNION ALL SELECT 1, 4

    UNION ALL SELECT 4, 1

    )

    SELECT d1=MIN(Origin), d2=MAX(Destination)

    FROM SampleData

    GROUP BY CASE WHEN Origin < Destination THEN Origin ELSE Destination END;

    But I think this will work (combine origin and destination in whatever fashion makes sense for you):

    WITH SampleData (Origin, Destination) AS

    (

    SELECT 1, 2

    UNION ALL SELECT 2, 1

    UNION ALL SELECT 2, 3

    UNION ALL SELECT 3, 2

    UNION ALL SELECT 1, 3

    UNION ALL SELECT 3, 1

    UNION ALL SELECT 1, 4

    UNION ALL SELECT 4, 1

    )

    SELECT d1=MIN(Origin), d2=MAX(Destination)

    FROM SampleData a

    CROSS APPLY

    (

    SELECT d1=RIGHT(10000+Origin, 4)+RIGHT(10000+Destination, 4)

    ,d2=RIGHT(10000+Destination, 4)+RIGHT(10000+Origin, 4)

    ) b

    GROUP BY CASE WHEN Origin < Destination THEN d1 ELSE d2 END;

    I'd assume that your origins/destinations are either the 3 or 4 character airport codes, so a straight concatenate should do the trick.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Thanks for all the tips! I came up with a multi-step solution that so far seems to get me what I need, and I'm happy to share. I'll do a blog post with some better examples and sample data shortly.

    So for a table "TripData" with columns "Origin" and "Destination", we can see all trips (directionally) with the following query:

    select distinct Origin, Destination

    from TripData;

    To express these trips uniformly regardless of direction, some case statements are necessary to do some swapping of their order. Origin and Destination are both integers, so some simple numeric comparisons do the trick. Distinct ensures we see each route only once.

    select distinct

    case when Origin<=Destination then Origin

    when Origin>Destination then Destination

    end as PtA,

    case when Origin<=Destination then Destination

    when Origin>Destination then Origin

    end as PtB

    from TripData;

    Now that we have distinct routes, we can assign each a RouteID.

    SELECT a.PtA, a.PtB, ROW_NUMBER() over (order by a.PtA, a.PtB) AS RouteID

    into dbo.RouteIDSTG

    from (

    select distinct

    case when Origin<=Destination then Origin

    when Origin>Destination then Destination

    end as PtA,

    case when Origin<=Destination then Destination

    when Origin>Destination then Origin

    end as PtB

    from TripData

    )a;

    With that done, we can create a lookup table that contains each trip (directionally) along with the route ID for easy joining.

    WITH AllPairs as (

    SELECT Origin, Destination,

    Case when Origin<=Destination then Origin

    when Origin>Destination then Destination

    end as PtA,

    case when Origin<=Destination then Destination

    when Origin>Destination then Origin

    end as PtB

    from TripData

    )

    SELECT distinct ap.Origin, ap.Destination, r.RouteID

    into RouteIDMaster

    from AllPairs ap

    inner join dbo.RouteIDSTG r on r.PtA = ap.PtA and r.PtB = ap.PtB;

    Now the TripData table can be joined to RouteIDMaster on Origin and Destination columns and grouped by RouteID.

    _______________________________________________________________________________________________
    Bob Pusateri
    Microsoft Certified Master: SQL Server 2008

    Blog: The Outer Join[/url]
    Twitter: @SQLBob

  • Looks like an awful lot of work and an awful lot of DISTINCTs. Wouldn't this be simpler?

    WITH SampleData (Origin, Destination) AS

    (

    SELECT 1, 2

    UNION ALL SELECT 2, 1

    UNION ALL SELECT 2, 3

    UNION ALL SELECT 3, 2

    UNION ALL SELECT 1, 3

    UNION ALL SELECT 3, 1

    UNION ALL SELECT 1, 4

    UNION ALL SELECT 4, 1

    )

    SELECT Origin, Destination, RouteID

    FROM

    (

    SELECT d1=MIN(Origin), d2=MAX(Destination), RouteID=ROW_Number() OVER (ORDER BY (SELECT NULL))

    FROM SampleData a

    CROSS APPLY

    (

    SELECT d1=RIGHT(10000+Origin, 4)+RIGHT(10000+Destination, 4)

    ,d2=RIGHT(10000+Destination, 4)+RIGHT(10000+Origin, 4)

    ) b

    GROUP BY CASE WHEN Origin < Destination THEN d1 ELSE d2 END

    ) a

    CROSS APPLY

    (

    SELECT Origin, Destination

    FROM SampleData

    WHERE d1 IN (Origin, Destination) AND d2 IN (Origin, Destination)

    ) b;


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Your code is definitely simpler. But the performance (at least on my machine) is lacking. I just stopped it after letting it run for 10 minutes. My table has about 1.2M rows in it. I'll try tweaking it and see what develops!

    _______________________________________________________________________________________________
    Bob Pusateri
    Microsoft Certified Master: SQL Server 2008

    Blog: The Outer Join[/url]
    Twitter: @SQLBob

  • With some DDL including indexing on your table it might offer a clue.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • You're right! That was kind of doggy wasn't it? Try this instead:

    CREATE TABLE dbo.OriginsDestinations

    (

    Origin INT

    ,Destination INT

    ,PRIMARY KEY (Origin, Destination)

    );

    WITH Tally (n) AS

    (

    SELECT TOP 500000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM sys.all_columns a CROSS JOIN sys.all_columns b

    )

    INSERT INTO dbo.OriginsDestinations

    SELECT n, n+1

    FROM Tally

    UNION ALL

    SELECT n+1, n

    FROM Tally;

    SET STATISTICS TIME ON;

    WITH OD AS

    (

    SELECT d1=MIN(Origin), d2=MAX(Destination), RouteID=ROW_Number() OVER (ORDER BY (SELECT NULL))

    FROM dbo.OriginsDestinations a

    CROSS APPLY

    (

    SELECT d1=RIGHT(1000000+Origin, 6)+RIGHT(1000000+Destination, 6)

    ,d2=RIGHT(1000000+Destination, 6)+RIGHT(1000000+Origin, 6)

    ) b

    GROUP BY CASE WHEN Origin < Destination THEN d1 ELSE d2 END

    )

    SELECT d1, d2, RouteID

    INTO #Temp

    FROM OD

    UNION ALL

    SELECT d2, d1, RouteID

    FROM OD

    SET STATISTICS TIME OFF;

    GO

    DROP TABLE #Temp;

    I get these timing results:

    SQL Server Execution Times:

    CPU time = 8454 ms, elapsed time = 12722 ms.

    (1000000 row(s) affected)


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Thanks Dwain. You were right in what you thought I was going for. Seems that this issue is in good hands now.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • I could be wrong for simplifying it that much but I think we could do this in one pull from [TripData] using the idea from Sean.

    select

    *, -- columns of interest

    dense_rank() over(

    order by

    case when Origin <= Destination then Origin else Destination end,

    case when Origin <= Destination then Destination else Origin end

    ) as rnk -- route_id

    from

    TripData;

    The issue I see here is how to create a POC index (partition / ordering / covering) to avoid the sorting needed to support the ranking function. May be adding calculated columns (case expressions) and then adding the index.

  • Sean Lange (2/18/2014)


    Thanks Dwain. You were right in what you thought I was going for. Seems that this issue is in good hands now.

    Thanks for letting me know that Sean. Sometimes I feel a bit sheepish after posting my interpretation of what someone else is saying, in case I was wrong. Enthusiasm sometimes gets the best of me.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply