Same query, different execution plans on PROD & TEST, parallelism slows things down

  • We have a vendor application+database, and noticed that a search took 1 minute on PROD server, but only 10 seconds on TEST.

    Ran a trace to capture exact query being run, restored the database from PROD to TEST, and ran the query on both.

    Analyzed the execution plans with SQL Sentry Plan Explorer - screenshots attached.

    PROD has 8 CPUs. TEST has 2 CPUs

    When I added OPTION (MAXDOP 1) to PROD query, time reduced to 20 seconds.

    But this is a 3rd party database where we don't/can't modify SPs.

    Is there another way to speed up the query ?

  • both the cases difference in estimated vs actual is high try to updates stats and see it helps in QA only

  • I already ran sp_updatestats on both.

    Off-hours I plan to do an update with fullscan.

  • Is there any way you could get the actual execution plans for both queries as .sqlplan files and post those?

    That would give us a bit more detail to work with.

    Having said that, as mxy said, the estimates for the query are wildly off, which is likely the main problem here.

    Without being able to refactor the code, your main options are:

    1) Fixing any statistics-related problems, like the statistics being stale (you seem to have addressed that already), or being inadequate (the value in question has wildly different row counts than the average for that histogram step).

    2) Investigating whether there are any indexes that might help. Obviously this would require a lot of careful testing.

    3) Changing the parallelism behavior, either with an instance-wide MAXDOP (the shotgun approach), or with some careful adjustments of the cost threshold for parallelism, which likely should be adjusted anyway. Again, careful testing would be required, as always.

    Cheers!

  • Attached .sqlplan from TEST & PROD

  • <bump>

    Any feasible tuning possible from the .sqlplans ?

  • Ah, my apologies!

    I meant to respond to this last week and it somehow slipped my mind entirely.

    So, the problem seems to be twofold.

    First, SQL Server grossly overestimates the number of rows coming out of the Left Semi Join between Users and Jobs.

    Second, the parallel plan it generates on prod is much worse than SQL Server estimates.

    With that in mind, I don't know that we can necessarily do a lot with the poor estimates, just because of how complicated those predicates are. I imagine SQL Server will never get those quite right (I might be wrong about that, though; I just don't see a good way to attack that aspect at the moment).

    Tweaking the cost threshold of parallelism is out, as both queries have such high estimated costs that any reasonable value would still be too low to keep the plan from running in parallel.

    On that note, what is the instance level MAXDOP on the test server? I'm guessing it's been set to 1, since you said it has 2 cores, and the query's not parallelizing despite its high cost.

    Without being able to touch the code, and with the plan being as expensive as it is, I can really only see two ways to bridge the performance gap between test and prod.

    1) Lower the instance MAXDOP. Obviously with this you'd need to do quite a bit of testing to make sure it doesn't hurt other queries more than it helps this one (having said that, your test environment already has a lower MAXDOP than prod, since it's at most 2, and probably 1).

    2) You could use a plan guide to force a more efficient plan, which could be as simple as adding a MAXDOP 1 hint in this case. I really hesitate on this one, as plan guides become pretty transparent once created, so if the data distribution ever changes, and SQL Server would then pick a much more efficient plan, it'll be prevented from doing so. If you don't remember that the plan guide exists, that can be a frustrating troubleshooting experience. Still, as with most things in SQL Server, they have their place.

    Hopefully this helps!

  • Or, rewrite the query looking at ways to eliminate the multiple nested sub-queries.

    Looking at it, there has to be a way to rewrite the query so it is more optimal, I just don't have the time during the day. I may be able to look at it more closely tonight or over the weekend but I can't make any promises.

    If you could, please post the DDL for the tables including the indexes currently defined on them.

  • @Lynn: Agreed, I would love to dive into that option. Unfortunately the OP specified that they can't touch the third party app's SPs. Perhaps pushing back on that is still the best route, though. I've occasionally (VERY occasionally) had success with that 🙂

    EDIT: Actually, I can only think of one time pushing back on a third-party helped, so it's more occasional than I thought. Still, worth a shot.

  • Jacob Wilkins (5/7/2015)


    @Lynn: Agreed, I would LOVE to dive into that option. Unfortunately the OP specified that they can't touch the third party app's SPs. Perhaps pushing back on that is still the best route, though. I've occasionally (VERY occasionally) had success with that 🙂

    Didn't catch that on first read. Still, if you send them a rewrite that works better and more efficiently, the vendor may consider making the change.

  • Not to mention the table scans this probably causes:

    ...

    AND (u.FirstName LIKE '%' + 'asdf' + '%' OR

    u.LastName LIKE '%' + 'asdf' + '%' OR

    JobTitle IN ( SELECT JobID

    FROM Jobs

    WHERE

    OrgID = 4 AND

    Job LIKE '%' + 'asdf' + '%') OR

    u.Email LIKE '%' + 'asdf' + '%' OR

    u.UserName LIKE '%' + 'asdf' + '%' OR

    u.Department LIKE '%' + 'asdf' + '%' OR

    u.AlternateID LIKE '%' + 'asdf' + '%' OR

    u.AlternateID2 LIKE '%' + 'asdf' + '%' OR

    u.AlternateID3 LIKE '%' + 'asdf' + '%')

    ...

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply