January 22, 2016 at 7:13 am
Hello guys,
I have done a little bit of SQLServer troubleshooting and I would like to get my actions peer-reviewed.
Given:
SQLServer 2008 with a 4 GB database. It gets hammered with various queries from the application. Queries often use joins and deal with tables that have the following number of rows:
Table1: 297.716
Table2: 4.089.520
Table3: 3.625.970
Table4: 1.001.810
Table5: 14.306.843
Table6: 574.189
Table7: 297.554
Table8: 115.428
Queries are all different, but some can be as follows:
set transaction isolation level read uncommitted; SELECT DISTINCT Table1.ID, Table1.CreationDate FROM Asset JOIN Table2 ON table1.ID = table2.AID WHERE 1=1 AND (Table2.On = 1) AND 1=1 AND (Table2.Type!=2 OR EXISTS (SELECT 0 FROM Field1 WHERE Table1.SID=Table2.SID AND UID IN(1542,942,950,1487,1488) ) ) AND Table1ID IN (((SELECT Table1.ID FROM Table1 WHERE CONTAINS (Field2,N'"ssad_sds_sds*"')AND VirtualKeywords LIKE N'%asda&_asd&_sad%' ESCAPE '&' OR CONTAINS (VirtualField3,N'"xxx_che_bou*"')AND VirtualField2 LIKE N'%xxx&_che&_bou%' ESCAPE '&') UNION (SELECT Table1.ID FROM Table1 WHERE (CONTAINS (Table1.Name,N'"xxx_che_bou*"')AND Table1.Name LIKE N'%xxx&_che&_bou%' ESCAPE '&') OR (Table1.VirtualLIKE N'xxx&_che&_bou%' ESCAPE '&') OR (CONTAINS (Table1.Virtual,N'"xxx_che_bou*"')AND Table1.Virtual LIKE N'%xxx&_che&_bou%' ESCAPE '&') OR (CONTAINS (Table1.Virtual,N'"xxx_che_bou*"')AND Table1.VirtuaName LIKE N'%xxx&_che&_bou%' ESCAPE '&') OR (CONTAINS (Table1.NameOnline,N'"xxx_che_bou*"')AND Table1.NameOnline LIKE N'%xxx&_che&_bou%' ESCAPE '&') OR (CONTAINS (Table1.Owner,N'"xxx_che_bou*"')AND Table1.Owner LIKE N'%xxx&_che&_bou%' ESCAPE '&') ) )) ORDER BY Table1.CreationDate DESC;
Symptoms:
High CPU usage, potential CPU pressure (all processor threads sitting at 100% for a long time).
What I have done:
Ran profiler tests around 3 PM (looking at queries with CPU consumption above 100 or Duration above 1000)
At 6 PM the DB server crashed and had had to be restarted.
Clue 1
This is what was in the log:
2016-01-16 18:02:55.53 Server * BEGIN STACK DUMP:
2016-01-16 18:02:55.53 Server * 01/16/16 18:02:55 spid 2160
2016-01-16 18:02:55.53 Server *
2016-01-16 18:02:55.53 Server * Deadlocked Schedulers
I have analysed the dump with SQL Server symbols and saw that the majority of threads had this:
00000000`1d4eece0 00000000`00b55fff sqlservr!CQScanXProducerNew::Open+0x35
00000000`1d4eed10 00000000`00b0137b sqlservr!FnProducerOpen+0x45
00000000`1d4eed50 00000000`00b02a71 sqlservr!FnProducerThread+0x4df
00000000`1d4eeed0 00000000`007bab94 sqlservr!SubprocEntrypoint+0x794
Suggested a problem with parallelism
Clue 2
I looked into this further and found that the DOP settings were:
Min 0
Max 1024
Run value 0
Config value 0
I assumed that this has allowed the SQLServer to decide for itself how many parallel threads to start for each query.
This machine has 24 schedulers, so the recommended value for MAXDOP should be 8.
Clue 3
I have also looked at the wait stats:
wait_typewaiting_tasks_countwait_time_msmax_wait_time_mssignal_wait_time_ms
CXPACKET5682620066314514119590733499132
Online sources suggested that if the queries are not written in the most efficient way, high parallelism can cause high CXPACKET wait values.
That was 3 factors pointing to the same issue.
I have initially set the MAXDOP to 8, but still had some timeouts (although the application performance was much better).
Then I have set it to 1 and had no problems whatsoever - CPU was never under pressure and the custom SQL monitoring pages have never timed out. I had some timeouts with MAXDOP set to 8.
Could you please suggest if my actions were correct and valid?
Could my profiling cause the server crash above?
Thank you very much!
P.S>
I have checked the query plans and yes, the index scan is taking place instead of the Seek - but I have no control over the application.
January 22, 2016 at 8:26 am
High CXPACKET waits are an indication of parallel query execution, but not an indication that parallel execution is causing problems on your server. You need to look to other wait statistics and queues to indicate that you have CPU issues above and beyond the CXPACKET waits. You'll need to gather additional metrics to understand if this is an issue for you or not. If it is, the first approach is not to mess with the max degree of parallelism (although, the default value of 0 is not necessarily the best choice on any system), but instead to go after the Cost Threshold for Paralelism to be sure that's good for the system.
However, the biggest problem is the query, specifically this:
Table1.VirtuaName LIKE N'%xxx&_che&_bou%' ESCAPE '&') OR (CONTAINS (Table1.NameOnline,N'"xxx_che_bou*"')
That pattern of querying pretty much guarantees that you're going to be seeing all scans on your tables/indexes with no filtering. First, the wild card search: LIKE '%anything%'. There is no possibility of using the index because of this. Next, CONTAINS after the OR clause is pretty much going to cause scans as well. Not only this, but that pattern is repeated more than once in the query. This will always run slow. The only way to speed it up is to buy a bigger, faster machine, and even then it's going to perform poorly. You need to use better filtering. If a LIKE has to be used, eliminate the opening wild card: LIKE 'somevalue%'. If you have to search nothing but strings like this, maybe a different technology than SQL Server (a relational database management system, not well oriented to document-style queries).
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
January 22, 2016 at 8:47 am
Thank you for your reply, Grant!
Could my profiling cause the crash listed above?
Also, here are the other stats:
wait_typewaiting_tasks_countwait_time_msmax_wait_time_mssignal_wait_time_ms
CXPACKET5682620066314514119590733499132
BROKER_TASK_STOP50299084649277926100332181434
FT_IFTS_SCHEDULER_IDLE_WAIT297923844088436000163477
LAZYWRITER_SLEEP23268143824421671571772179345
LCK_M_RS_U2202093094436804632853979
CLR_AUTO_EVENT440305166910707720866
ASYNC_NETWORK_IO2270203920710732630251822400
REQUEST_FOR_DEADLOCK_SEARCH396461913096095131191309609
XE_TIMER_EVENT637619130951530019191308951
XE_DISPATCHER_WAIT1165191129317151270420
FT_IFTSHC_MUTEX56081191091553377397007871
LOGMGR_QUEUE6730450191018829600247495016
SQLTRACE_INCREMENTAL_FLUSH_SLEEP474791903350936720295
DBMIRROR_EVENTS_QUEUE2596974918985005245391447246
LATCH_EX587554181885471182221009073509
SLEEP_TASK1085288818373993260016798757
CHECKPOINT_QUEUE316918303238032790984462
DBMIRRORING_CMD1308461514607044899889918718
DBMIRROR_DBM_EVENT1165667014538029013363753796
THREADPOOL11651011172919514850323
BROKER_TO_FLUSH9290995661106479112314
LCK_M_U170943837357646241838000
DBMIRROR_DBM_MUTEX61252756821468638466130286
PAGEIOLATCH_SH187778596832393979225654
If there are too many queries accessing the same tables, and all of them are running in parallel, could it cause the situation where all threads (823 of them) get stuck and portray the memory state seen in the dump?
January 22, 2016 at 8:54 am
You have a lot of wait types there that are meaningless or expected. Look for one of Glenn Berry's wait scripts and use that.
The totals are also very hard to work with, because they're since SQL started. Query the DMV on a regular basis (every 30 min, for example) and take the difference of the totals, that gets you the waits in a known period, which is a lot more useful.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
January 22, 2016 at 8:56 am
Oh, and if you used the Profiler GUI against the server, you could well have caused the crash. Wouldn't be the first time.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
January 22, 2016 at 9:04 am
Hello GilaMonster,
So if I ran the trace at 3 PM and the crash took place at 6 PM, it still would have caused it?
Thanks,
January 22, 2016 at 9:04 am
And here is the script from this article:
http://www.sqlskills.com/blogs/paul/wait-statistics-or-please-tell-me-where-it-hurts/
WaitType Wait_S Resource_S Signal_S WaitCount Percentage AvgWait_S AvgRes_S AvgSig_S
LATCH_EX 2.35 2.06 0.29 6554 62.62 0.0004 0.0003 0.0000
CXPACKET 1.10 1.06 0.04 1011 29.30 0.0011 0.0011 0.0000
WRITELOG 0.18 0.17 0.01 334 4.75 0.0005 0.0005 0.0000
January 22, 2016 at 9:13 am
Does profiler leave any 'residue' in the system if I was saving the trace to file?
January 22, 2016 at 10:07 am
Profiler always has some hit to the system when it's in use. If you use the GUI on the server, that's bad (I've nearly brought down a production server once doing that, thankfully realized what I was doing and shut it down.) If you set up a server-side trace (I think that's the correct term,) so that the trace is running without the GUI and saving its results to file(s), that's less bad (for security auditing purposes, I'm *required* to do this on my servers, hasn't crashed them yet and it runs 24x7.)
Presuming you run the trace without the GUI, once you shut down the trace and ideally drop it from SQL Server, other than the files it created, it's gone.
January 22, 2016 at 10:39 am
Thank you for answering! Could you please confirm what the 'server-side trace' is?
Thank you,
January 22, 2016 at 4:08 pm
(Bear in mind, the term could be wrong)
It's a Profiler trace, but you set it up using SP_TRACE_CREATE
If you're really lazy, like me, you set up the items you want to trace on your local PC using Profiler, then script it out.
January 22, 2016 at 4:19 pm
Here's what I found on MSDN site:
2. If most of threads are stuck while trying to write profiler events
to the destination you might find stack similar to one below
sqlservr!CTraceWriteRequest::InitForRowsetTrace
There wasn't a single message like that in the crash dump.
However, I had loads of these:
"5. If you see many stacks like the one below it should be because of
excessive parallelism
sqlservr!CQScanXProducerNew::Open
sqlservr!FnProducerOpen
sqlservr!FnProducerThread"
I would say this suggests that the server crashed due to the excessive parallelism. Is that a fair assumption, given the MAXDOP was set to 0?
January 22, 2016 at 6:13 pm
electrostep (1/22/2016)
Thank you for answering! Could you please confirm what the 'server-side trace' is?Thank you,
My recommendation would be to not even worry about SQL Profiler to start with. Instead, open SSMS and then open the Object Explorer (press the {f8} key if it's not already open). Right click on the instance (very top of the tree), click on {Reports},{Standard Reports}, and then select one of the reports for "Performance - Top Queries by Total...". Depending, of course, on which report you select, you'll get the top 10 queries by CPU or IO (logical reads).
Once you've identified those worst items (and the longest running queries aren't always the worst), then you can begin the search for them. You can get the parameters used for those queries by using SQL Profiler to search for something fairly unique either for RPC Complete or SP Complete. Although it flies in the face of what most people recommend, such a search done on the server itself instead of a local copy doesn't cause much overhead especially if you don't run it for long. You only need a page or so of examples (make sure you both display TextData and filter against it and only use RPC Complete and SP Complete). The TextData column will contain the full call including any parameters that may have been used so that you can copy/paste those into SSMS and start troubleshooting.
I can't stress enough that the other folks are right about SQL Profiler... you can cripple the server by using the GUI even if you run the GUI on the server but, with the correct events (JUST RPC Complete and SP Complete) and just running it for long enough to get what you really need (a couple of dozen rows), you can do some really good troubleshooting very quickly.
I'll also add that if your server is very memory limited and you haven't set MAX Memory for SQL Server to leave some room for the OS and things like the GUI of SQL Profiler, then you absolutely should setup a "server side trace", which is much more light weight. You can do that pretty easily by setting up a trace in the SQL Profiler GUI, running it for less than a second, and then use the FILE menu to save the trace meta data to run your server side trace without having to manually write a shedload of code to do it.
--Jeff Moden
Change is inevitable... Change for the better is not.
January 22, 2016 at 6:28 pm
Hello Jeff, thank you very much for your comprehensive response.
What do you make of the server memory dump values? Is MSDN link that I've posted above correct about issues with parallelism?
January 24, 2016 at 8:46 am
It's possible that you're experiencing CPU pressure. Those queries are horrific. However, changing your max degree of parallelism is only going to make those queries run in a serial mode. Maybe they should be, maybe they shouldn't. However, the issue isn't a system setting. It's the queries. That query pattern is not going to perform well.
Again, the first thing I would do is change the cost threshold for parallelism, and then monitor the system to gather the wait statistics and queues, as Gail has noted, ignoring the waits that are not informative.
All that said, the issue is still going to come back around to the queries following that pattern.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
Viewing 15 posts - 1 through 15 (of 22 total)
You must be logged in to reply to this topic. Login to reply