High CPU usage due to bad execution plan occurs unexpectedly

  • Hello all,

    I have a very weird problem on my SQL Server and I can't get it solved.

    Server : SQL Server 2005 SP3

    Database compatibility Level : 80

    I will try to describe as precisely as possible the problem (Apologizes in advance for the clumsy english, I am french).

    Sometimes, our SQL Server CPU usage is very high, so high than it gets unresponsive.

    After running traces it appears than the blocking is caused by a specific query which is launched about 100 times in 5 minutes. Nothing abnormal here, the query is a periodic reload of data for some internal components of our application, it can occurs several times a day.

    Most of the time the server can handle these reloads without any problem but sometimes, the CPU usage grows unexpectedly until the server hardly responds.

    The query itself for the reload is pretty simple and runs very quickly (in less than 100 ms)

    We firstly though about some external pressure that would have slow down the server.

    Despite several days of traces nothing obvious was found.

    Finally only that specific query was the guilty and we could only notice than sometimes it was running very slowly (up to 10 minutes) with a very high CPU usage.

    However running the query manually was always running very quickly.

    Finally I succeed to get an execution plan of the above query when it gets bad : I discovered a table scan on a quite large table (850 000 rows) where I should have an index seek instead.

    Also the CPU usage usage was very high.

    And when the bad execution plan is used by our 100 reloads at the same time the server gets hammered !

    The question is why sometimes the SQL engine decides to perform a table scan instead of an index seek ?

    I partially found the reason but this is where I stuck to find a solution.

    I discovered than sometimes, the number of rows reported in the catalog view for the table is inaccurate, and that the number reported is much lesser than the real number of rows (1 or 2 rows against 850000 rows for real !). Therefore, the Optimizer decides to perform a table scan which is more efficient for small tables than an Index Seek.

    I used several methods to get the row count information for the table (sys.partitions,sys.dm_db_partition_stats,etc,..) and they all gave me an unrealistic row count.

    I know than the row count can't be 100% accurate especially for tables which have many insert/update/delete statements (which is the case for my table) but reporting 1 or 2 rows for a table which has 850000 rows sounds like a bug.

    Once I discovered the issue, I ran the command "DBCC UPDATEUSAGE (MyDBName, MyTableName) WITH COUNT_ROWS". The output said than it corrected the rows count to the real value.

    Then I ran my query and the execution plan went back to the good state, with the good index seek as it should be.

    Unfortunately the problem reoccurred again after I ran the dbcc update usage. In fact it can occurs 5 mn or several hours (sometimes 1 or 2 days) after. This is completely unpredictable.

    And it reoccurs again and again, no way to fix the bad row count definitely.

    Finally, in order to avoid further production issue, I have created a script which runs every minute and correct the row count information if it gets bad (I run a dbcc update usage).

    This is running since 2 months now and we never had the CPU usage problem again.

    However I am not very happy with this solution because although I worked around the problem I did not solved it.

    I tried to find other people with the same kind of issue but no luck.

    I even though about the statistics being automatically updated too often.

    As I said, the table is updated very frequently (update/insert). In one hour I can have 80000 rows changed. Basically the statistics are auto updated every 10 minutes.

    Therefore I switched off the auto update stats feature for this specific table but it did not prevent the issue from happening again.

    Sorry for the long paragraphes but I tried to give as much information as I could.

    Any advices would be highly appreciate.

    Kind regards.

  • Well, this can certainly be a problem and it is typically caused by improper indexing even though things may appear correct. It may also be due to a complex query that could be rewritten a bit so, you should check into both. As you stated, rebuilding stats is just not going to solve the issue and is a bad bandaid. A really good article on this can be found by SQL Soldier on his blog. There are some solid suggestions provided at the end of that post to consider. One other option, IF you have an expected rowcount that you are looking for would be to consider the OPTION (FAST N). Again, a limited use case but something that he does not provide.

    A couple other posts on cardinality estimates as well;

    http://msdn.microsoft.com/en-us/library/ms181034.aspx

    http://msdn.microsoft.com/en-us/library/ms175933.aspx

    They may not both apply to your query in question but still good to review.

    Again, I would stress looking at the indexing and the query plan as a whole. Make sure that it is simple, and that the indexing is as exact as possible if you are looking for that to perform really quickly and as frequently as you stated it runs.

    Hope this helps.

    David

    @SQLTentmaker

    “He is no fool who gives what he cannot keep to gain that which he cannot lose” - Jim Elliot

  • Sounds like a parameter sniffing problem,

    http://technet.microsoft.com/en-us/library/cc966425.aspx

    are you using parameterized queries when it runs slow ?



    Clear Sky SQL
    My Blog[/url]

  • Thanks to you for your quick reply and suggestions.

    @david-2, I'll look to you links, maybe I will find some clues for my issue.

    The indexing is pretty simple. The table contains financial instruments and each instrument has an unique id which is indexed (clustered index). Each instrument is updated with the last trade values (Open/High/Low Price/time) but the column which is indexed is never touched.

    In normal situation, the query returns the result in less than 100ms.

    Also how can we explain than running the command "dcc update usage" prevents the issue from happening ? This command is not supposed to fix anything with statistics or indexes.

    I am really stuck to find a solution as I don't understand the root cause...

    @dave-3, we are not using parameterized queries here. The query is built from a C++ program and executed over an ADO connection.

    Cheers.

  • A couple things. First, why do you have compatibility set at 80 for this database? I don't think that is the issue but.... Secondly, can you provide the table structure and the associated indexes along with the query you are running. If you have the good and the bad plan to attach that would be great as well. Thanks.

    David

    @SQLTentmaker

    “He is no fool who gives what he cannot keep to gain that which he cannot lose” - Jim Elliot

  • heehee I was sure someone would jump to the compatibility level 80 😀

    The DB is still in sql 2000 mode for many reasons (probably bad reasons) but unfortunately my company can't afford spending time to the migration at moment (at least in short/medium term).

    I agree with you, it could not be the issue but....

    If I could reproduce the issue easily I could bring the DB to another server, change the compatibility level to 90 and prove my company they need to migrate the production db. Unfortunately I can't reproduce the problem in other environment than production. I they can't spend many time in testing on a problem that could be potentially solved. Also, since I found a work around, they do not feel the hurry...

    Regarding execution plan and query I will post them as soon as possible.

    Not sure I can get the bad execution plan, as it means our prod server is on a bad state. No one want want his prod server stucked on purpose 😉

    I am nearly off for today.

    Bye

  • Oh, ok. Since you posted in a 2005 forum I assumed that you were on 2005.

    No worries about the plans. Post them when you get them and we can see what we can find.

    David

    @SQLTentmaker

    “He is no fool who gives what he cannot keep to gain that which he cannot lose” - Jim Elliot

  • Even though the database is in 80 compatibility, I'm pretty sure you can capture the exec plans from cache, so if the bad plan happens, snag it quick, jsut before you do whatever it is you're doing to fix it. Shouldn't take more than a second.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Gosh... if it's only running 100 times in 5 minutes, try adding OPTION(RECOMPILE) to the offending statement and you'll likely be done with the problem.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi all,

    I did not give up on it and I'd like to give a final update on this thread for the people who kindly tried to help and for those people who could face the same issue.

    Finally I found an acceptable solution, maybe be not the best one but good enough to us.

    The problem of high CPU consumption was caused by a bad execution plan due to stall statistics.

    It's a classic but since the table is heavily updated, execution plans do not stay in cache more than few minutes. Also, the statistics can get inaccurate very very quickly, the statistics are auto-updated almost every 5 minutes !

    So what was the solution (and the problem so far) ?

    Once I could get a bad execution plan from cache I realized than there was a Nested Loop Join to a Clustered Index Scan where the estimated number of rows was 1 instead of several hundred thousand rows.

    Also the other branch of the Nested Loop join had many rows. Nested Join is a very good operator when it joins few rows (or many to few) but it's a resource killer when joining a large set of rows to another.

    A Hash Match would have been way better here but the SQL Optimizer relies on statistics and statistics were bad so it turned into a sub-efficient plan...

    Finally I forced the query optimizer to always use the Hash Match operator by using OPTION (Hash JOIN).

    Since we force the join operator, the solution may not look ideal as it could lead into a less optimal execution plan, but at least execution time is always good enough to us and we make sure than the very bad execution plan never happen.

    Problem solved 🙂

    I thought about using query hint after reading someone here who had similar issue, SQLServerCentral is really a great source of knowledge, I learn so much here !

  • Just a PS. I work in the Montreal region if you need someone on site and I speak french too!

  • Hi, thanks for the proposal but I work in Paris 😉

  • sdelachapelle (7/11/2011)


    Hi, thanks for the proposal but I work in Paris 😉

    Close enough! :w00t:

    I'm 10 minutes away from the airport in case of emergency!

  • Ninja's_RGR'us (7/11/2011)


    sdelachapelle (7/11/2011)


    Hi, thanks for the proposal but I work in Paris 😉

    Close enough! :w00t:

    I'm 10 minutes away from the airport in case of emergency!

    HAH - I am 0 minutes away from a VPN, although I don't speak French!! 😎

    Anyway, I think there was at least one unanswered question about why dbcc updateusage 'prevents' the problem. I doubt that it does 'prevent' it. It flushes the procedure cache for that table's plans and that is probably why things seem to run better for a while.

    There are numerous potential root causes here with numerous potential solutions. You really need to get a professional on board to help you determine how to proceed.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • TheSQLGuru (7/12/2011)


    Ninja's_RGR'us (7/11/2011)


    sdelachapelle (7/11/2011)


    Hi, thanks for the proposal but I work in Paris 😉

    Close enough! :w00t:

    I'm 10 minutes away from the airport in case of emergency!

    HAH - I am 0 minutes away from a VPN, although I don't speak French!! 😎

    Anyway, I think there was at least one unanswered question about why dbcc updateusage 'prevents' the problem. I doubt that it does 'prevent' it. It flushes the procedure cache for that table's plans and that is probably why things seem to run better for a while.

    There are numerous potential root causes here with numerous potential solutions. You really need to get a professional on board to help you determine how to proceed.

    Most client I talked to recently don't like VPN as first contact (yes I know what we all think).

    Donc je parle très bien français et je suis à 0 minute par VPN. 😀

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply