Correlated Sub Query (RBAR WTF?)

  • Here's an oddball. My software has the following query:

    select A.fldA, A.fldB from tableA A where A.fldKey IN

    (

    select AA.fldKey from tableA AA inner join tableB B ON

    AA.fldKey = B.fldKey inner join tableC C ON B.fldChild = C.fldChild

    )

    OR A.fldKey in

    (

    select AAA.fldKey from tableA AAA inner join tableX X ON

    AAA.fldKey = X.fldKey inner join tableC C ON X.fldChild = C.fldChild

    )

    The app was hanging and when I checked statistics I found that tableB was being read 4 million times. The table contains about 100K worth of data. As a test I changed the query to read like this:

    select A.fldA, A.fldB from tableA A where A.fldKey IN

    (

    select AA.fldKey from tableA AA inner join tableB B ON

    AA.fldKey = B.fldKey inner join tableC C ON B.fldChild = C.fldChild

    )

    OR A.fldKey = 7

    I got the same results. I know how to change the query so this doesn't happen but am curious as to why the DB is reading the table so many times. It seems that the optimizer sees that the correlated sub query can be treated as an inner join. Then when you add the additional conditional (I like how that sounds) it checks the condition against every row and subsequently performs a join for every row.

    Any thoughts.

    Good Monday Morning

    (gets more coffee)

  • Try this:

    select

    A.fldA,

    A.fldB

    from

    tableA A

    where

    EXISTS ( select

    1

    FROM

    tableB B inner join

    tableC C

    ON B.fldChild = C.fldChild

    WHERE

    A.fldKey = B.fldKey ) OR

    Exists ( select

    1

    from

    tableX X inner join

    tableC C

    ON X.fldChild = C.fldChild

    WHERE

    A.fldKey = X.fldKey )

  • Thanks, Jack. I have lots of options as to how to correct the issue. I was interested in a discussion as to why the query would create the horrible performance in the first place. At first glance I didn't think i was a bad query (although I have no idea why a person would have used this methodology to get the data??).

    To reiterate, the following had fine performance:

    select A.fldA, A.fldB from tableA A where A.fldKey in

    (

    -- correlated sub query here

    )

    However, simply adding the additional condition caused the performance issue:

    select A.fldA, A.fldB from tableA A where A.fldKey in

    (

    -- correlated sub query here

    ) OR A.fldKey = 7

  • You don't need tableA bulking up the two subselects:

    select A.fldA, A.fldB

    from tableA A

    where A.fldKey IN

    (

    select B.fldKey

    from tableB B

    inner join tableC C ON B.fldChild = C.fldChild

    )

    OR A.fldKey in

    (

    select X.fldKey

    from tableX X

    inner join tableC C ON X.fldChild = C.fldChild

    )

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Unfortunately it did need tableA in the two sub queries. As I was posting pseudo code I didn't include the many other joins in the sub queries. I have since modified the code as (pseudo):

    select myFields from

    tableA A inner join tableNecessary N on A.nKey = N.nKey

    left join tableB B on A.fldKey = B.fldKey

    left join tableX X on A.fldKey = X.fldKey

    Again. Does anyone know why the database would have acted in this way?

  • What's the result of running the highlighted part?

    select A.fldA, A.fldB from tableA A where A.fldKey IN

    (

    select AA.fldKey from tableA AA inner join tableB B ON

    AA.fldKey = B.fldKey inner join tableC C ON B.fldChild = C.fldChild

    )

    OR A.fldKey in

    (

    select AAA.fldKey from tableA AAA inner join tableX X ON

    AAA.fldKey = X.fldKey inner join tableC C ON X.fldChild = C.fldChild

    )

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • This totally confuses me. Why not post the full query, replacing real names with fake ones where ever you want? I can't match your last piece of code at all with your first. Also, I still think that you don't need TableA in your subquery, since any selection you make in your subquery, you also can make in the main query.

    Also, please add some table description and possibly sample data, use the link in my signature below for that. It's of course hard work, but that way we can test our own solutions and we may be triggered to answer your question without misunderstandings...

    Ronald HensbergenHelp us, help yourself... Post data so we can read and use it: http://www.sqlservercentral.com/articles/Best+Practices/61537/-------------------------------------------------------------------------2+2=5 for significant large values of 2

  • My experience of joins in subqueries is that they always end up as nested loops. This can be very inefficient with large tables. (Check your query plans to confirm this.)

    I always avoid joins in subqueries. Joins in derived tables seem to be OK.

  • Ken is correct. I apologize that my original queries were incomplete. Here is a query which demonstrates my point:

    set statistics io OFF

    set nocount on

    CREATE TABLE #foo (

    fldID int identity(1, 1),

    fldNum int not null

    )

    -- Get a few records started

    insert into #foo(fldNum)

    select 1 as X union all

    select 1 as X union all

    select 1 as X union all

    select 1 as X union all

    select 1 as X union all

    select 1 as X union all

    select 1 as X

    while (select count(fldID) from #foo) < 100000

    insert into #foo(fldNum) select fldID from #foo

    select count(fldID) from #foo

    set statistics io ON

    -- Get statistics on sub query

    select top 500 fldID from #foo

    -- Results Table #foo Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0.

    -- Get statistics on the main query and the sub query

    select A.fldID from #foo A where A.fldID IN

    (

    select top 500 B.fldID from #foo B

    )

    -- Results Table #foo Scan count 2, logical reads 256, physical reads 0, read-ahead reads 0.

    -- Get statistics on the second condition

    select A.fldID from #foo A where A.fldID = 1

    -- Results Table #foo Scan count 1, logical reads 254, physical reads 0, read-ahead reads 0.

    -- Get statistics on the whole thing

    select A.fldID from #foo A where A.fldID IN

    (

    select top 500 B.fldID from #foo B

    ) OR A.fldID = 1

    -- Results Table #foo Scan count 3, logical reads 256, physical reads 0, read-ahead reads 0.

    -- Table 'Worktable'. Scan count 498, logical reads 997, physical reads 0, read-ahead reads 0.

    -- Table 'Worktable'. Scan count 500, logical reads 1001, physical reads 0, read-ahead reads 0.

    -- Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0.

    You can see that the addition of a second conditional causes the DB to create additional work tables with more reads than the original table required. In looking at the execution plan there are indeed many nested loops. I don't normally use this type of query but it does exist in the application I work on.

    The point to the post is to help others see that what looks fairly straightforward can indeed cause bottlenecks.

  • I understand this is purely an academic post and that you have resolved the performance issue with the left outer joins.

    When you say "reads", do you mean index scans or physical disk reads?

    I'm quite interested to know how many records are in table A and the two derived datasets in the where clause, I have a sneaky suspicion that the 4000k scans you are referring to could possibly the number of records in table A and two derived datasets in the where clause multiplied by each other or the number of samples in the statistics. If you update usage on those tables, does the count change?

    Along with a couple of the other posts here, I would probably stick my neck out here are suggest strongly trying to avoid using derived datasets in your where clause. Always try and include any such distillation in the from clause, ideally, perhaps, determining and distilling the recordsets as you go along. Including joins enhances the use of the query optimiser and will allow far better control of the performance.

    Max

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply