Performance issue with nested query

  • I have a large aggregation query on a large dataset that run forever and never return, but if I run the main part of the query by itself it returns in 5 minutes. I can explain it better with an example:

    SELECT

    *

    FROM

    (

    SELECT

    key,

    SUM(value)

    FROM

    some_table

    GROUP BY

    key

    ) s

    If I run just the code inside the virtual table, it runs in 5 min, but as soon as I simply nest it and select * from it, then it run indefinitely. Of course I need to do more than select *. The internal query does a bunch of aggregations, then the external query does additional calculations with the values.

    I did however try literally just changing it to * to troubleshoot and I still have the same performance problem.

    Any ideas what could be causing this?

  • You must have the derived table joined to something else.

    It may cause all sorts of issues which I don't see the reason to list.

    If you post the whole query and DDL for the objects involved the guess work would by much less.

    Alternatively, you may create a temp table, populate with subquery data and join it to the rest of the big query.

    Don't forget to index it appropriately.

    _____________
    Code for TallyGenerator

  • Look at the estimated query plans for each and see if the differences tell a story.

    Also run sp_whoisactive during the execution and see various things you can see with that.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Actually I did have it joined to a different table at first, but I removed that thinking that the join was the problem.

    When I do an estimated execution plan, with and without the additional table join, there are no additional recommended indexes.

    I've seen this type of behavior before and never understood it. I think last time I ended up using the temp table method you suggested, but never understood why it was needed. If nothing else I guess I'll try that again, but this is a result set with hundreds of columns and millions of rows, so seems silly to put it into a table just to pull it back out again, but we'll see.

    For a result set this large, would a temp table or a table variable be better?

  • rgp151 (12/7/2016)


    I have a large aggregation query on a large dataset that run forever and never return, but if I run the main part of the query by itself it returns in 5 minutes. I can explain it better with an example:

    SELECT

    *

    FROM

    (

    SELECT

    key,

    SUM(value)

    FROM

    some_table

    GROUP BY

    key

    ) s

    If I run just the code inside the virtual table, it runs in 5 min, but as soon as I simply nest it and select * from it, then it run indefinitely. Of course I need to do more than select *. The internal query does a bunch of aggregations, then the external query does additional calculations with the values.

    I did however try literally just changing it to * to troubleshoot and I still have the same performance problem.

    Any ideas what could be causing this?

    Can you post (as a .sqlplan attachment) the estimated plan for this, please -

    Cheers

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Here is a plan where its just the query encapsulated in SELECT * FROM ()

    Remember, if I run just the part inside the outer nesting it runs in under 5 minutes. I just left it running last night as described above and it ran for 23 hours without finishing.

    Thanks

    Edit: I also included an execution plan without the encapsulation. This is when the query runs in <5 minutes. I can't see a difference between the two.

  • rgp151 (12/8/2016)


    Here is a plan where its just the query encapsulated in SELECT * FROM ()

    Remember, if I run just the part inside the outer nesting it runs in under 5 minutes. I just left it running last night as described above and it ran for 23 hours without finishing.

    Thanks

    Edit: I also included an execution plan without the encapsulation. This is when the query runs in <5 minutes. I can't see a difference between the two.

    Thanks.

    First thought - comment out the ten lines which look like this:

    COUNT(DISTINCT CASE WHEN

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • I'm sorry, but what you have posted up goes WAY beyond free support on a forum, at least for me. That is a metric butt-ton of ugly operations that could take many hours of effort to decipher and improve.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Yeah, I know its a lot. My main question is just what would cause performance to degrade so much just going from:

    SELECT

    x,

    y,

    z

    FROM

    t

    to

    SELECT

    *

    FROM

    (

    SELECT

    x,

    y,

    z

    FROM

    t

    ) t

    It should essentially be the same query with the same results either way. I don't know why simply nesting the query as a virtual table degrades the performance so much.

  • rgp151 (12/8/2016)


    Yeah, I know its a lot. My main question is just what would cause performance to degrade so much just going from:

    SELECT

    x,

    y,

    z

    FROM

    t

    to

    SELECT

    *

    FROM

    (

    SELECT

    x,

    y,

    z

    FROM

    t

    ) t

    It should essentially be the same query with the same results either way. I don't know why simply nesting the query as a virtual table degrades the performance so much.

    You might be hitting a tipping point. The easiest to diagnose is optimiser timeout.

    Did you try my suggestion? Just comment out the ten lines and rerun the simple query. Harvest the actual plan to post here. I have a cunning plan <<cough>>

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • The description doesn't begin to describe the real problem.

    Without at least having the full query and actual execution plans, there's no real advice that can be given here. I agree that the COUNT DISTINCT might be the ones giving problems, but those are present on both plans.

    Have you tried using temp tables?

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Yeah, I can't figure this out. But it is the case that if I simply insert the result of the nested query into a temp table first, then select from the temp table, the whole thing completes in a few minutes.

    I don't get it. It doesn't make any sense and I wish I knew what was going on, but for now, at least putting into a temp table works.

  • rgp151 (12/8/2016)


    Yeah, I can't figure this out. But it is the case that if I simply insert the result of the nested query into a temp table first, then select from the temp table, the whole thing completes in a few minutes.

    I don't get it. It doesn't make any sense and I wish I knew what was going on, but for now, at least putting into a temp table works.

    It may be frustrating, but often it's best to materialize something that otherwise can "get lost in the plan", so to speak. I can't begin to tell you how often I've had CTEs that pull a rather small number of rows end up choking the rest of the query until I put the results into a temp table instead and join to that temp table instead. It just seems sometimes that there are some queries that the optimizer simply can not fathom a good row-count guess for, until it's "written in stone", so to speak.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • rgp151 (12/8/2016)


    Yeah, I can't figure this out. But it is the case that if I simply insert the result of the nested query into a temp table first, then select from the temp table, the whole thing completes in a few minutes.

    I don't get it. It doesn't make any sense and I wish I knew what was going on, but for now, at least putting into a temp table works.

    That's usually because the query becomes too complex to create an optimal plan. It then chokes and cries for help. Are you sure that you can't simplify the query?

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • rgp151 (12/8/2016)


    Here is a plan where its just the query encapsulated in SELECT * FROM ()

    Remember, if I run just the part inside the outer nesting it runs in under 5 minutes. I just left it running last night as described above and it ran for 23 hours without finishing.

    Thanks

    Edit: I also included an execution plan without the encapsulation. This is when the query runs in <5 minutes. I can't see a difference between the two.

    "Showplan Comparison" indicates differences in the result operator alone (the SELECT operator). I suspect that one of these plans might not be the plan which was executed when you performed your timings. Also, does the non-encapsulated query have OPTION (RECOMPILE)?

    In any case, with so many copies of a 5.5 million row table littering the plan, it's not going to win any races.

    The reason I suggested commenting out the COUNT(DISTINCT... parts is that the remaining columns in the SELECT could probably be evaluated with a couple of simple aggregations and a single scan of the table - although you might have to perform a preaggregation to provoke the optimiser into generating such a plan. I'd tackle the COUNT(DISTINCT... parts as a separate exercise then merge the results, but make sure you really, REALLY need the DISTINCT, because it's hurting.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply