Left Join performance issue

  • Hi all

    SELECT TableA.LINE_ID

    FROM

    TableA

    LEFT OUTER JOIN

    TableB

    ON TableA.RecName = TableB.RecName

    AND TableA.RecType = TableB.RecType

    AND TableA.Price = TableB.Price

    AND TableA.Status = 'ACTIVE'

    AND TableB.Status = 'ACTIVE'

    WHERE TableB.LINE_ID IS NULL

    When running this SQL without the second last line ("TableB.Status = 'ACTIVE'")

    the query executes in about 2 seconds. When including it though it takes about 7 minutes to execute!

    Any ideas on what the reason might be? I suspect it might be an indexing problem?

    J

  • Hi, sounds like a full table scan is being performed. Indexes on all columns used to join / filter will improve things.

    Regards,

    Andy Jones

    .

  • Nope, does not work. I put indexes on all the columns used in the join and the query still executes for 7 min.

    J

  • You should look at your execution plan to see what it is doing during that step.

    Try this and see what happens.

    SELECT TableA.LINE_ID

    FROM

    TableA

    LEFT OUTER JOIN

    (SELECT

    RecName,

    RecType,

    Price,

    Line_ID

    FROM TableB

    WHERE TableB.Status = 'ACTIVE') AS TableB

    ON TableA.RecName = TableB.RecName

    AND TableA.RecType = TableB.RecType

    AND TableA.Price = TableB.Price

    AND TableA.Status = 'ACTIVE'

    WHERE TableB.LINE_ID IS NULL

  • The execution plan indicates that 86% of the processing is on the "Clustered Index Scan" on both TableA and TableB. That being the LINE_ID columns on both.

    Should I change the table design in some way to improve this?

    I tried your script and it is fast

    (2 seconds)

    Thanks

    J

  • No, I don't think a table design change or index will fix it. The problem I believe comes from the fact TableB must be ACTIVE and TableA must alos be. In an inner join you could set them equal However with the outer join and the on having to check both A and B status fields, I believe it is doing this at the same time and slowing you down. The subquery handles getting B handled in a specific order of operation and takes the poor decision out of the loop.

  • I suppose what Antares said might be possible, but it seems to me more likely that statistical information on your tables is suggesting to the query optimizer that a significant number of rows in both tables will meet your FROM conditions; consequently, it figures a table scan (or - in this case - a clustered index scan, which is equivalent to a table scan) is the most effective way to obtain your data. Alternatively, given the amount of data, the optimizer may feel that performing a FILTER operation (which it would have to do if it used any other index besides the clustered index) would be more expensive than doing the clustered index scan and filtering the rows during the scan. Try Antares' suggestion first, and if it doesn't improve things, you might want to look at:

    • Placing your clustered index on a different column - preferably (in this case) one of your most selective join columns. This might encourage the query optimizer to do a clustered index seek, as opposed to a clustered index scan.
    • Using the FORCE ORDER query hint to force the optimizer to generate a plan that more closely follows the structure of Antares' query. The optimizer may think too much when optimizing Antares' plan and consequently develop a plan that doesn't take advantage of what Antares is trying to do; force order might resolve this
    • Use INDEX hints to force the optimizer to use a different index; perhaps one of the indexes on one of your join columns

    I generally discourage the use of hints where possible, but sometimes they are necessary. Personally, I would first try placing the clustered index on a different column - again, one of the more selective columns in your join, such as RecName or Price. Usually, I place clustered indexes on less selective columns, but in this case you have to consider the logic of the optimizer. It's choosing the clustered index because either (a) it believes that the number of rows to be returned is significantly large that it might as well scan the whole table, or (b) that it believes the cost of performing a separate FILTER operation is sufficiently large that it would be less expensive to perform the clustered index scan, doing the filter there, even though the clustered index doesn't cover any of the columns in your join. By placing the clustered index on one of your more selective columns, it may (a) help the optimizer recognize that a clustered index seek will cost less than a scan (something that you can't always say when comparing nonclustered index seeks vs. scans), and (b) if it chooses to filter the columns as part of the clustered index seek, it has a good head start, because the index seek itself will sufficiently filter a large portion of the rows. If your clustered index is supporting a primary key, this will require you to drop and rebuild the primary key, and you must consider this change in relation to all of the other queries in your system that might be affected by this.

    Matthew Burr

    Edited by - mdburr on 11/01/2002 10:23:52 AM

    Edited by - mdburr on 11/01/2002 7:10:57 PM

  • I think the problem could be todo with the position of the tabA.status = 'Active' clause.

    If you only want result from tableA that are active move it to the where clause.

    Further more is there a index on recname, rectype, price and status. IF not adding this will allow for the use of a coverin index.

    SELECT TableA.LINE_ID

    FROM

    TableA

    LEFT OUTER JOIN

    TableB

    ON TableA.RecName = TableB.RecName

    AND TableA.RecType = TableB.RecType

    AND TableA.Price = TableB.Price

    AND TableB.Status = 'ACTIVE'

    WHERE TableB.LINE_ID IS NULL

    AND TableA.Status = 'ACTIVE'

    Do these tables have lots of columns?

    Simon Sabin

    Co-author of SQL Server 2000 XML Distilled

    http://www.amazon.co.uk/exec/obidos/ASIN/1904347088


    Simon Sabin
    SQL Server MVP

    http://sqlblogcasts.com/blogs/simons

  • Matthew, I cannot change the clustered index to another column as the line_id column is the only true incremental column in this table. I am not familiar with the use of hints but I found your comments quite interesting. Will have to do a bit of reading on my own though.

    Simon, thanks, you got it spot on. The previous solution actually gave incorrect results. I got records back where the Status on TableA was <> 'ACTIVE'. Which is wrong.

    I might have given an unclear description of my requirements.

    Anyway, moving "TableA.Status = 'ACTIVE'" out to the where clause solved the problem and the speed is quite reasonable as well.

    Both the tables have got about 20 columns.

    Thanks

    J

  • The comment about the line_id column being the only true incremental column in the table is interesting. This is not a requirement for a clustered index; i.e. a clustered index does not need to be placed on an incremental column. I suspect that you have placed it there in order to avoid page splits, by insuring that all new inserts - since they are incremental - will be placed in the last page of the table. That is one reason of considering placing the clustered index on such a column. On the other hand, if you ever do deletes from this table, you could be losing a lot of space, since the space previously used by the now deleted row will not be reused. Also, you have to balance negative performance on inserts with negative performance on selects and determine which you can afford more. You might find it more reasonable to place the clustered index elsewhere if you can recognize significant performance gains in a larger portion of your queries. Consider also that unless you are selecting large ranges of rows based on line_id (e.g. using queries that search for "line_id BETWEEN x AND y," you may not be recognizing all of the potential performance gains that a clustered index can truly provide.

    Ultimately, you may be losing performance by using a clustered index primarily to support ordered inserts of your data (i.e. all data being added to the end of the table), which in turn may be costing you in data space, since any space left when data is deleted may not be getting reused; this in contradistinction to using the clustered index in a possibly more effective way and saving yourself from page splits by using other options, such as FILLFACTOR. If you haven't yet - and you probably have - you might want to revisit this indexing decision. Of course, indexing decisions, ultimately, are a very personal thing, based on the data and activity in your database, which - obviously - you know far better than anyway outside of your company; so, take any advice with a "grain of salt."

    Matthew Burr

    Edited by - mdburr on 11/05/2002 09:45:59 AM

  • joepin,

    I'm coming late to the party so forgive me if I'm covering old material...

    It looks like you are blurring the line between the join and the where clauses. The JOIN clause is only used to determine how the outer join is supposed to select elements from TableB. By imposing the condition "AND TableB.Status = 'ACTIVE'" you are implicitly converting an outer join into an inner join.

    HTH

    Steve Hendricks

    MCSD, MCDBA

    AFS Consulting Group

    shendricks@afsconsulting.com

    (949) 588-9800 x15


    Steve Hendricks
    MCSD, MCDBA
    Data Matrix

    shendricks@afsconsulting.com
    (949) 588-9800 x15

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply