December 6, 2011 at 4:36 am
We have a SQL Server system which is used by BusinessObjects Data Services and exhibits an intermittent problem. As part of the ETL process a number of tables are truncated and repopulated then used in a join. We have found that randomly this join part of the process slows down considerably - what takes about 10 minutes on a normal night takes 4 hours. We have found a few ways to fix the problem - but the fix is not permanent and the issue reoccurs.
I'm a little new to SQL Server so apologies if this is a little jumbled, but this is the extent of my investigations ...
1) A similar query tested the day after the problem exhibits itself in the overnight run shows the issue still exists. So once it has occurred it is reproduceable.
2) A check on the Execution Plan shows that when the problem occurs a couple of the joins, which are normally all Hash Match, are now Nested Loop preceded by a Lazy Spool.
3) The "issue" can be narrowed down to one of the two larger tables in the query, and if these are truncated and repopulated then the problem usually goes away (for a few days at least).
4) Dropping and recreating statistics for the problem table does not resolve the issue - only truncation/repopulation or complete removal and rebuild works.
5) Recreating statistics for the problem table with fullscan does not resolve the issue.
6) I've been trying to collect fragmentation and statistics output when the problem occurs and when it is temporarily resolved, but can't see any obvious differences.
7) Creating an a few indexes on key join columns can also resolve the issue temporarily, but again it will eventually reoccur.
8) Possibly most confusing, we have Development and QA systems which do the same process each night with the same data and same volumes - and they never have this issue, it just affects the Production system.
9) Using optimiser hints to force Hash joins is not supported by BusinessObjects Data Services - so this isn't an option to resolve it.
10) If left to it's own devices the problem may disappear again as the process will truncate the tables and repopulate anyway (see 3 above).
So what can I do next?
December 6, 2011 at 5:44 am
1. As the data is frequently inserted & deleted, get the fragmentation report beforehand so that you know when to rebuild the indexes proactively.
(You can schedule a job to get that information from sys.dm_db_index_physical_stats in the off hours & send this information through mail.)
2. Ensure the statistics are always up-to-date.
3. You can add the OPTION to RECOMPILE the stored procedure every time it runs, so that the optimizer creates new execution plan based on the current data distribution.
December 6, 2011 at 6:28 am
Thanks for the response.
1. Currently there are no indexes on the tables - so presumably just the statistics are being used to determine the plan? As mentioned above I did add some indexes to the larger tables on the join columns - this temporarily resolved the issue but it would still occur at a later date. If the tables are being fully truncated and then repopulated would that not automatically mean any indexes should be up to date?
2. Statistics are on and up to date on execution - the last refresh time on the various statistics always tallies with the ETL process and are post truncation but prior to execution of the problem query.
3. It's not a stored procedure but part of the ETL process. So no real controls available to do this. However this is actually the crux of the issue. I.e. it is choosing the plan on execution every night, but sometimes it chooses the very slow one.
A couple of additional comments.
1. The actual data doesn't change that much. Mainly a little growth and some minor updates to the previous data. The tables in question are fully truncated and repopulated from source each night though.
2. Nothing seems to explain why the development and QA systems never experience the same problem too, when they use the same data and the same ETL configuration.
December 6, 2011 at 6:57 am
It sure sounds like it's statistics related. When you rebuild the stats, are you using sp_updatestats or are you running UPDATE STATISTICS? If the former, try using the latter and adding WITH FULLSCAN.
No parameters right? It's ad hoc TSQL? That eliminates bad parameter sniffing from the mix, but with a recompile you can actually have bad variable sniffing too. Any variables in the TSQL statement? Or, if it is a parameter, you might be dealing with bad parameter sniffing.
Indexes, especially on large tables and especially clustered indexes, are a fundamental piece of the puzzle when it comes to SQL SErver. You absolutely should get, at least, clustered indexes in place.
Fragmentation won't affect execution plans, ever.
It still sounds like a statistics issue.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
December 6, 2011 at 7:45 am
Grant Fritchey (12/6/2011)
It sure sounds like it's statistics related. When you rebuild the stats, are you using sp_updatestats or are you running UPDATE STATISTICS? If the former, try using the latter and adding WITH FULLSCAN.
I used the latter using WITH FULLSCAN.
No parameters right? It's ad hoc TSQL? That eliminates bad parameter sniffing from the mix, but with a recompile you can actually have bad variable sniffing too. Any variables in the TSQL statement? Or, if it is a parameter, you might be dealing with bad parameter sniffing.
Correct - or at least the query I use to test and diagnose the issue has no parameters. This was grabbed directly from the server to check what BO Data Services was trying to execute during a long executions so the same should be true for that.
Indexes, especially on large tables and especially clustered indexes, are a fundamental piece of the puzzle when it comes to SQL SErver. You absolutely should get, at least, clustered indexes in place.
Fragmentation won't affect execution plans, ever.
It still sounds like a statistics issue.
Ok I'll revisit indexes more extensively. I originally thought this was the resolution to the problem until I reproduced it whilst the indexes were defined.
Thanks for the help.
December 7, 2011 at 3:48 am
A short update. I revisited indexes more extensively as Grant suggested and defined a number of clustered and non-clustered indexes on the tables used in the process based on join conditions. Once done I couldn't get the execution plan to revert to the "bad" plan by using my previous techniques - so this is very good. The core test query execution time has benefited greatly from the indexes too - down from 50-60 seconds to 10 seconds. There is a performance hit back up the line in the initial population of the tables, but overall it should be at least a small improvement - especially if it finally resolves the random 4 hour poor performance hit when the "bad" plan gets used.
Thanks again for all the suggestions and help.
Viewing 6 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply