A Spectacular Case of 100% CPU Caused by Online Reindexing

  • Comments posted to this topic are about the item A Spectacular Case of 100% CPU Caused by Online Reindexing

  • Thanks for sharing this excellent analysis. By far one of the most interesting reads about SQL performance.

  • The title is IMHO misleading, was it intentional? Otherwise a good cautionary tale for DBAs out there. Thanks

  • The link to the technet article is broken due to a stray ')' at the end. The href should be:

    http://technet.microsoft.com/en-us/library/ms181055%28v=sql.105%29.aspx

  • 1) I recommend investigating Trace Flag 2371, which lowers the row mod count trigger for auto stats updates for larger tables.

    http://blogs.msdn.com/b/saponsqlserver/archive/2011/09/07/changes-to-automatic-update-statistics-in-sql-server-traceflag-2371.aspx

    2) No matter what you do this scenario almost certainly is exposed to significant data value skew, i.e. campaigns that have a trivial number of total rows and others that have massive numbers of rows. This makes you ALWAYS vulnerable to parameter sniffing/plan caching badness. Either side of the coin can easily screw your performance (first execution has few rows or first execution has many rows) if one is optimal with nested loops/seeks and the other is optimal with scan/hash. There are two main ways to address this issue: OPTION (RECOMPILE) and dynamic SQL (being sure to guard against injection, obviously). Believe it or not this is one of the most common causes of performance issues I see at clients.

    BTW, not bad on the investigations and VERY nicely done on the SSC article! A few process improvements in your troubleshooting could include:

    a) using sp_whoisactive, which could have shown you running processes with their plans

    b) never ever missing checking actual vs estimated rows in a plan 🙂

    c) possibly using a quick ad hoc profiler session to capture batch starting as well as completing events (which normally isn't done) to find those non-completing killer queries (although again sp_whoisactive should suffice)

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Excellent analysis and excellently written article.

    One solution is to add to the particular sensitive query an "optimize for" hint and supply a known value that gives the desired plan.

    Or use "optimize for unknown", so the plan is not dependent on the statistics related to one single input parameter.

    It also highlights THE biggest weaknesses of how query plan building happens and is cached.

    The first compile always ASSUMES that the first time parameters are representative for all later uses.

    This is obviously a ridiculous strategy if not backed by subsequent observation of actual execution.

    What should at least happen during subsequent plan reuses is to test the assumptions that the plan is based on.

    And if something is radically different (estimate vs actual), try to create an alternative plan and cache this result as well.

    This removes the worst case scenario's without killing the benefits of how the process currently works.

  • What jumps out to me in this plan is that there is an index seek followed by a key lookup. I'd be willing to bet that making the non-clustered index a covering index you'd see more consistent performance as well.

  • One solution is to add to the particular sensitive query an "optimize for" hint and supply a known value that gives the desired plan.

    Or use "optimize for unknown", so the plan is not dependent on the statistics related to one single input parameter.

    Beware!! Both of those options are GUARANTEED to give you BAD PERFORMANCE some of the time in the face of significant data value skew!!

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • "Further", not "farther" ("The query is a SELECT statement after all; there is no harm aside from hurting performance, which could not be hurt any farther.") "Farther" is for literal distances, whereas "further" is for metaphorical distances. If you can measure it in inches or centimetres or miles or kilometers, then "farther" is your man, otherwise it's "further". It can be difficult because "far" can be used for both. One can move both farther and further up a chain; if it's an actual chain then it's "farther", but if it's a metaphorical chain (e.g. a chain of command) then it's "further".

    It drives me nuts when watching golf. 9 times out of 10 they want "farther", but 9 times out of 10 they use "further". One can move further up the leader board if one can hit the golf ball farther.

    Of course it could have just been a typo. My fingers do all sorts of things about which I don't find out about until later... (typically mere moments after clicking send).

    But nice article. 🙂

  • Thanks for sharing! I think this article does a good job of showing how similar DBAs and detectives are. It takes a lot of investigation, critical thinking, and problem solving to fix some of these issues, and that's what makes being a DBA such a fun and challenging job.


    [font="Tahoma"]Personal blog relating fishing to database administration:[/font]

    [font="Comic Sans MS"]https://davegugg.wordpress.com[/url]/[/font]

  • TheSQLGuru (12/11/2014)


    One solution is to add to the particular sensitive query an "optimize for" hint and supply a known value that gives the desired plan.

    Or use "optimize for unknown", so the plan is not dependent on the statistics related to one single input parameter.

    Beware!! Both of those options are GUARANTEED to give you BAD PERFORMANCE some of the time in the face of significant data value skew!!

    Not always true.

    Any solution that does not constant recompile and favors one plan over another will suffer the same.

    Recompiling CAN be costly too on queries with many tables and/or indexes.

    As long as you test with a set that has the same skew and good enough volume and find it working well, there is no problem with going for a middle of the road approach.

    It remains a cost trade-off between the overhead of a generic (less then optimal plan) and the cost of constant compile overhead.

    The current solution is also sensitive to skew, they just reduced the likely-hood picking the undesired plan and is thus not an airtight solution.

    To get true control, the author could check up on plan guides, see BOL:

    This topic describes plan guides and explains how they can be used to optimize the performance of queries when you cannot or do not want to change the text of the query directly

    The only true solution is as I suggested for Microsoft to fix when cached plans are used and when not.

    The simplest action would be to detect, hey...this is going wrong and then force a recompile on that specific query.

    And there are other changes that can adapt to skewed data and optimize plans in stages (in the background) without negative effects on times.

  • Plan guides don't help with data-value-skew-induced poor performance resulting from a good plan for one input(s) being a (horribly) bad plan for other input(s) either.

    I suppose we will have to agree to disagree. For significant data-value-skew and widely-varying-input scenarios the two options I mentioned have routinely provided many orders of magnitude performance gains in at least scores if not hundreds of situations I have come across where as the compilation issues mentioned that prevented their use for my clients can be counted on one hand.

    I agree that there are things that Connor and his team of rocket scientists on the optimizer team can do to help this situation. But given the available workarounds I think those potential improvements fall way down on the priority list of things that team should be working on. YMMV

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Thank you for a great post.

  • TheSQLGuru (12/11/2014)


    Plan guides don't help with data-value-skew-induced poor performance resulting from a good plan for one input(s) being a (horribly) bad plan for other input(s) either.

    I suppose we will have to agree to disagree. For significant data-value-skew and widely-varying-input scenarios the two options I mentioned have routinely provided many orders of magnitude performance gains in at least scores if not hundreds of situations I have come across where as the compilation issues mentioned that prevented their use for my clients can be counted on one hand.

    I agree that there are things that Connor and his team of rocket scientists on the optimizer team can do to help this situation. But given the available workarounds I think those potential improvements fall way down on the priority list of things that team should be working on. YMMV

    If you can introduce "option( recompile )" with plan guides, then plan guides will obviously work in this scenario and are even in-line with your own solution.

    I never needed them myself as I have access to the code we host, but it might be possible to use them for this.

    As for more important things, do not get me started.

    Despite new features in every release, they are inaccessible to most users and often of very limited use in general (especially on first iteration).

    Meanwhile there is absolutely NOTHING more important in a system that ought to optimize at run-time then excluding worst case scenario's.

    Microsoft only has to fix a problem once, but their customers have to fix each problem case individually and at high costs (it is obscenely wasteful).

    This flies right in the face of the ideals behind relational databases and SQL as a language.

  • Excellent post, you covered a lot of ground! The only thing I would add is apples-to-apples testing from the application side on QA. While SSMS is and will remain the default tool of DBA(s), it does not represent what the business users are using- they don't use SSMS. Accurately reproducing the problem in QA is always a challenge. Does QA have 32 CPU? Using SQL trace or Extended Events can help identify the whole picture and can bring to the surface and capture actual exec plans. Tease: Will Query Store in SQL20XX will greatly assist in finding these kinds of things? How about an alert when Actual vs Estimated row count exceeds a threshold? Again, great blow-by-blow account.

    --Todd

Viewing 15 posts - 1 through 15 (of 23 total)

You must be logged in to reply to this topic. Login to reply