Joining a large table to itself....better ideas?

  • Hello all,
    I'm creating a view to use for future queries to make things easier.  Essentially this view needs to join a very large table to itself and I'd like to see what are the best ways to maintain performance.  The table has individual records for pickups and dropoffs.  The view will join them together to make the pickup and dropoff in the same record so we can see the entire trip in one record.  Currently I'm just doing a join

    left join

    on xyz...  However, the query takes way too long to complete for just a few columns.

    Any suggestions for how to better join a table to itself to produce something like this?

  • tacy.highland - Monday, June 18, 2018 10:44 AM

    Hello all,
    I'm creating a view to use for future queries to make things easier.  Essentially this view needs to join a very large table to itself and I'd like to see what are the best ways to maintain performance.  The table has individual records for pickups and dropoffs.  The view will join them together to make the pickup and dropoff in the same record so we can see the entire trip in one record.  Currently I'm just doing a join

    left join

    on xyz...  However, the query takes way too long to complete for just a few columns.

    Any suggestions for how to better join a table to itself to produce something like this?

    A view may not be able to help you much, as without selectivity, you probably don't have much shot at a good execution plan, but as I can't see your query or your tables, or the execution plan of what you've already tried, I'm stuck with wild-a$$ guessing...   How about providing table create statements for the table along with what the indexes are and the actual text for the view.   We need a lot more detail....

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • You also may want to consider reworking the table if possible.  If a completed trip consists of both a pickup and a drop off, they could be the same record.  Information related to the drop off would be unknown until it became available.  It's not that the other way is necessarily wrong.  But one of the factors to consider in design is how might the data be queried.  Breaking out the pickup and drop off into separate rows is apparently causing you some issues.

  • Given that a matched pickup and dropoff won't change in the future, when a dropoff gets entered, have a trigger link it to its pickup (store the pickup key(s) in the dropoff row) and the pickup to its dropoff (store the dropoff key(s) in the pickup row).  Then it's a simple join with pre-existing key links.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • It seems I may have just needed to join on a few more key columns and that vastly improved my performance.

    Apologies for the newbie move, and thank you for all your input.

  • Although there is some good merit in a guaranteed "only 2 values" (pickup/dropoff) type of table, a proper CROSSTAB would likely blow the doors off of any view you might come up with.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Although there is some good merit in a guaranteed "only 2 values" (pickup/dropoff) type of table, a proper CROSSTAB would likely blow the doors off of any view you might come up with.

    Don't see how that would apply here.  Please elaborate further and provide an example of what you think the result my appear like.  Thanks,

  • RonKyle - Monday, June 18, 2018 2:21 PM

    Although there is some good merit in a guaranteed "only 2 values" (pickup/dropoff) type of table, a proper CROSSTAB would likely blow the doors off of any view you might come up with.

    Don't see how that would apply here.  Please elaborate further and provide an example of what you think the result my appear like.  Thanks,

    NP... right after the OP posts some readily consumable data and related DDL. Might as well solve their problem while we're at it. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Monday, June 18, 2018 2:32 PM

    RonKyle - Monday, June 18, 2018 2:21 PM

    Although there is some good merit in a guaranteed "only 2 values" (pickup/dropoff) type of table, a proper CROSSTAB would likely blow the doors off of any view you might come up with.

    Don't see how that would apply here.  Please elaborate further and provide an example of what you think the result my appear like.  Thanks,

    NP... right after the OP posts some readily consumable data and related DDL. Might as well solve their problem while we're at it. 😉

    Yeah, will need some more details. But in general, if you're trying to build a view that joins multiple large sets of data to sort of combine them into one large view, then the best practice outside of enforcing proper indexing, filtering, statistics and so forth would be to pre-aggregate the data ahead of time and build a view off the aggregated data to feed into whatever. Thus, if you want to show the entire journey of the pickup/dropoff whatever, you go ahead and try to aggregate/summarize the data as much as possible as a batch process once a night and build the view off that result.

    For example, I have two large tables with billions of records in each. I can join these two tables together into a single view. Querying this view would likely be slow because both tables are extremely large. If I was creating this view to only further GROUP BY on them, then I would go ahead and GROUP BY on both of the tables and copy the results into one new physical table. The new table now has far less records than both of it's source tables going from billions to maybe hundreds of millions. Then build a new view off that new table, which should be significantly faster than querying to two source tables with billions of records in them.

    That's spitballing without any actual data or queries to get an idea of what the end result here is for said view.

  • xsevensinzx - Monday, June 18, 2018 7:41 PM

    Jeff Moden - Monday, June 18, 2018 2:32 PM

    RonKyle - Monday, June 18, 2018 2:21 PM

    Although there is some good merit in a guaranteed "only 2 values" (pickup/dropoff) type of table, a proper CROSSTAB would likely blow the doors off of any view you might come up with.

    Don't see how that would apply here.  Please elaborate further and provide an example of what you think the result my appear like.  Thanks,

    NP... right after the OP posts some readily consumable data and related DDL. Might as well solve their problem while we're at it. 😉

    Yeah, will need some more details. But in general, if you're trying to build a view that joins multiple large sets of data to sort of combine them into one large view, then the best practice outside of enforcing proper indexing, filtering, statistics and so forth would be to pre-aggregate the data ahead of time and build a view off the aggregated data to feed into whatever. Thus, if you want to show the entire journey of the pickup/dropoff whatever, you go ahead and try to aggregate/summarize the data as much as possible as a batch process once a night and build the view off that result.

    For example, I have two large tables with billions of records in each. I can join these two tables together into a single view. Querying this view would likely be slow because both tables are extremely large. If I was creating this view to only further GROUP BY on them, then I would go ahead and GROUP BY on both of the tables and copy the results into one new physical table. The new table now has far less records than both of it's source tables going from billions to maybe hundreds of millions. Then build a new view off that new table, which should be significantly faster than querying to two source tables with billions of records in them.

    That's spitballing without any actual data or queries to get an idea of what the end result here is for said view.

    Just to be sure, I wouldn't use a view for this on a bet.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Tuesday, June 19, 2018 1:06 AM

    xsevensinzx - Monday, June 18, 2018 7:41 PM

    Jeff Moden - Monday, June 18, 2018 2:32 PM

    RonKyle - Monday, June 18, 2018 2:21 PM

    Although there is some good merit in a guaranteed "only 2 values" (pickup/dropoff) type of table, a proper CROSSTAB would likely blow the doors off of any view you might come up with.

    Don't see how that would apply here.  Please elaborate further and provide an example of what you think the result my appear like.  Thanks,

    NP... right after the OP posts some readily consumable data and related DDL. Might as well solve their problem while we're at it. 😉

    Yeah, will need some more details. But in general, if you're trying to build a view that joins multiple large sets of data to sort of combine them into one large view, then the best practice outside of enforcing proper indexing, filtering, statistics and so forth would be to pre-aggregate the data ahead of time and build a view off the aggregated data to feed into whatever. Thus, if you want to show the entire journey of the pickup/dropoff whatever, you go ahead and try to aggregate/summarize the data as much as possible as a batch process once a night and build the view off that result.

    For example, I have two large tables with billions of records in each. I can join these two tables together into a single view. Querying this view would likely be slow because both tables are extremely large. If I was creating this view to only further GROUP BY on them, then I would go ahead and GROUP BY on both of the tables and copy the results into one new physical table. The new table now has far less records than both of it's source tables going from billions to maybe hundreds of millions. Then build a new view off that new table, which should be significantly faster than querying to two source tables with billions of records in them.

    That's spitballing without any actual data or queries to get an idea of what the end result here is for said view.

    Just to be sure, I wouldn't use a view for this on a bet.

    Snicker, sometimes you have no other choice. I have had situations similar to what he/she is describing where access to the data is needed, but the data is extremely large. View is used because that's the only way most of us expose data to the end user as opposed to giving physical table access.

    Creating a crosstab, matrix, pivot, whatever is not possible in some examples because the crosstab itself is containing hundreds of columns that dynamically change across hundreds of millions of records. Thus, aggregating data vertically as much as possible and then optimizing that final model to be pushed into a crosstab is the best approach in my experience. Essentially going from a wide table to a extremely narrow table that can be pivoted anyway you want.

  • I wouldn't use a view either.  I find no use for them in an OLTP environment.  There are a couple of situations where they are useful.  I just haven't had them.

    Still don't see how a crosstab helps, and in the absence of even a suggested example not necessarily this problem, I stand by that.  It seems like all a crosstab would show is:
                      Date1    Date2   Date3
    PICKUPS      5           8       9
    DROPOFFS  5         8         9

    As a pickup inevitable involves a drop, that's going to be the same.  Nothing has changed my original view that each row is a trip with columns for pickup and drop off information.  The drop off information is null until the drop off is made.  Eliminates talk of a view, crosstab, etc.  The rows can be broken out in a view if necessary to show pickup and drop offs.  Some people don't like null columns, but some people are wrong on this point.

  • tacy.highland - Monday, June 18, 2018 10:44 AM

    Hello all,
    I'm creating a view to use for future queries to make things easier.  Essentially this view needs to join a very large table to itself and I'd like to see what are the best ways to maintain performance.  The table has individual records for pickups and dropoffs.  The view will join them together to make the pickup and dropoff in the same record so we can see the entire trip in one record.  Currently I'm just doing a join

    left join

    on xyz...  However, the query takes way too long to complete for just a few columns.

    Any suggestions for how to better join a table to itself to produce something like this?

    I (and ISO) think your design is fundamentally wrong. In temporal data, there is an interval data type (this is a duration with a start and end timestamp). You're trying to construct this ISO model from what probably is a clipboard signin-signout sheet that the records the ends of a single event on separate lines of the paper form. You might want to download the Rick Snodgrass book on temporal queries in SQL; it is available for free as a PDF download from the University of Arizona. The convention in ISO uses half open intervals. This means we know the start time of an interval, but the termination time does not exist within that interval. If the event is still ongoing, then it's termination time is a null in SQL. You'll have to sit down and redo your whole table to bring it up to ISO standards.

    Please post DDL and follow ANSI/ISO standards when asking for help. 

  • Nothing has changed my original view that each row is a trip with columns for pickup and drop off information. The drop off information is null until the drop off is made. Eliminates talk of a view, crosstab, etc. The rows can be broken out in a view if necessary to show pickup and drop offs. Some people don't like null columns, but some people are wrong on this point.  

    I (and ISO) think your design is fundamentally wrong. In temporal data, there is an interval data type (this is a duration with a start and end timestamp). You're trying to construct this ISO model from what probably is a clipboard signin-signout sheet that the records the ends of a single event on separate lines of the paper form. You might want to download the Rick Snodgrass book on temporal queries in SQL; it is available for free as a PDF download from the University of Arizona. The convention in ISO uses half open intervals. This means we know the start time of an interval, but the termination time does not exist within that interval. If the event is still ongoing, then it's termination time is a null in SQL. You'll have to sit down and redo your whole table to bring it up to ISO standards.   

    I think we are saying the same thing, even if I didn't refer to the ISO standards.  Although I have no objections to reinforcement either.

  • xsevensinzx - Tuesday, June 19, 2018 6:01 AM

    Jeff Moden - Tuesday, June 19, 2018 1:06 AM

    xsevensinzx - Monday, June 18, 2018 7:41 PM

    Jeff Moden - Monday, June 18, 2018 2:32 PM

    RonKyle - Monday, June 18, 2018 2:21 PM

    Although there is some good merit in a guaranteed "only 2 values" (pickup/dropoff) type of table, a proper CROSSTAB would likely blow the doors off of any view you might come up with.

    Don't see how that would apply here.  Please elaborate further and provide an example of what you think the result my appear like.  Thanks,

    NP... right after the OP posts some readily consumable data and related DDL. Might as well solve their problem while we're at it. 😉

    Yeah, will need some more details. But in general, if you're trying to build a view that joins multiple large sets of data to sort of combine them into one large view, then the best practice outside of enforcing proper indexing, filtering, statistics and so forth would be to pre-aggregate the data ahead of time and build a view off the aggregated data to feed into whatever. Thus, if you want to show the entire journey of the pickup/dropoff whatever, you go ahead and try to aggregate/summarize the data as much as possible as a batch process once a night and build the view off that result.

    For example, I have two large tables with billions of records in each. I can join these two tables together into a single view. Querying this view would likely be slow because both tables are extremely large. If I was creating this view to only further GROUP BY on them, then I would go ahead and GROUP BY on both of the tables and copy the results into one new physical table. The new table now has far less records than both of it's source tables going from billions to maybe hundreds of millions. Then build a new view off that new table, which should be significantly faster than querying to two source tables with billions of records in them.

    That's spitballing without any actual data or queries to get an idea of what the end result here is for said view.

    Just to be sure, I wouldn't use a view for this on a bet.

    Snicker, sometimes you have no other choice. I have had situations similar to what he/she is describing where access to the data is needed, but the data is extremely large. View is used because that's the only way most of us expose data to the end user as opposed to giving physical table access.

    Creating a crosstab, matrix, pivot, whatever is not possible in some examples because the crosstab itself is containing hundreds of columns that dynamically change across hundreds of millions of records. Thus, aggregating data vertically as much as possible and then optimizing that final model to be pushed into a crosstab is the best approach in my experience. Essentially going from a wide table to a extremely narrow table that can be pivoted anyway you want.

    A view could easily be much better than everyone having to write their own joins to combine the correct pair of rows.  I would still hard-link the drop-off and pick-up for efficiency, after which the join is about as efficient as it can be anyway.  You very well may not want everyone "rolling their own" way of connecting those rows.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply