Query never completes with CXPACKET waits

Question

Query never completes with CXPACKET waits

Viewing 15 posts - 16 through 29 (of 29 total)

You must be logged in to reply to this topic. Login to reply

Sue_H SSC Guru Points: 90860 More actions · Answer 1

Paul Randal wrote a wait stats query which is probably a good one to run for this one. This is a link to the current version:
Updated sys.dm_os_waiting_tasks script

You may want to take a look at what is going on with the schedulers as well since it can potentially show a few different issues. Something like:
select scheduler_id, current_tasks_count, runnable_tasks_count, current_workers_count, active_workers_count, work_queue_count, pending_disk_io_count, load_factor, failed_to_create_worker from sys.dm_os_schedulers where status = 'VISIBLE ONLINE'
And execute it throughout as sustained high values are usually more important than a spike. You'd especially want to do that if runnable task count is high.

When you primarily see cxpackets waits and nothing else is running on the box, that's pretty much what is to be expected for a parallel plan and nothing else executing.
But when running parallel, there is usually one process in there that is not waiting on cxpackets and that would be the one to pay attention to. You probably want to focus on things other than just the cxpacket waits since those are normal.
Have you looked at the disk stats at all? I would probably take a look at those as well.

Sue

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 2

What's the compatibility level of the database?

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Isabelle2378 SSCertifiable Points: 6174 More actions · Answer 3

Hi Gail,
The compatibility level of the test server that completes is SQL Server 2008(100). On the new 2016 and 2012 production server where it never completes, it is 110.

Thanks!
Bea Isabelle

ChrisM@home SSC-Insane Points: 24260 More actions · Answer 4

Isabelle2378 - Monday, February 13, 2017 11:14 AM
Hi,
I agree on the comments that MAXDOP is not the solution or that CXPACKETS is the problem. I'm trying to figure out why it's behaving this way on my production server but having no problems completing on the test server. This is why I restored the same exact database to another test server, has the same memory, the same MAXDOP settings (default 0) and getting 2 different results. My 2008R2 test server completes in 40 minutes (3GB memory, MAXDOP 0) and my 2016 test server never finishes. Just hangs and the only thing I see is suspended waiting on CXPACKETS. There is no load on this server, it is brand new and no one is using it but me for this testing. So I'm trying to figure out what am I missing? The indexes are all the same, I rebuilt anything that was fragmented over 30%, updated statistics on both databases. It's the same exact query so what else can I look at on this 2016 server to try and see why it won't complete? I was hoping to get some fresh eyes to maybe point out something obvious that I'm just not seeing. Both systems are running collation Latin1_General_BIN. I asked the analyst to check the language setting of the application to see if he see's anything different with the way the 2 queries are running the code. Any other thoughts?

Some smart watches have more than 3GB as standard. Have you checked the minimum and minimum recommended memory requirements of your server versions? Why so little RAM when it's do cheap? We use test servers with ã€‹200x this amount. 3GB is asking for trouble.

[font="Arial"]^{Low-hanging fruit picker and defender of the moggies}[/font]

For better assistance in answering your questions, please read this[/url].

Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

Isabelle2378 SSCertifiable Points: 6174 More actions · Answer 5

Hi,
Our production server has 48GB of memory. The test servers only have 16GB of memory and this has a named instance that has 3GB allotted to it. The test server with only 3GB of memory has no problems with the query completing. We are only using these servers to test it out so memory is not really a concern here. The production server with 48GB is having the issue with it not completing and my 2016 test box that I setup is just for testing.

I have looked at the disks and see latency on the production server:
Drive    Volume Mount Point    Read Latency    Write Latency    Overall Latency    Avg Bytes/Read    Avg Bytes/Write    Avg Bytes/Transfer
S:          S:\                           17                     3                      4                        448146                 57218                  64094
K:          K:\                           88                     28                    87                       74360                  71404                  74264
G:         G:\                           4                       400                   201                     65104                  65584                  65343

And here's on the test server that completes the query okay:

Drive    Volume Mount Point    Read Latency    Write Latency    Overall Latency    Avg Bytes/Read    Avg Bytes/Write    Avg Bytes/Transfer
W:        W:\                           224                   1                       5                        904816                 13143                  25745
K:         K:\                            80                     140                   82                       1004123               40982                  962235

Thanks!
Bea Isabelle

Sue_H SSC Guru Points: 90860 More actions · Answer 6

Not sure what you used for the IO stats but if you were using dm_io_virtual_file_stats then that just tells you the cumulative stats since the server started. If that's what you are using, you would want to capture the data for a baseline and then run the query and capture it again after it's been running for an hour or so. The second set minus the baseline tells you what the IO activity was during that hour, or however long between the two.
If there were IO issues, I'd be surprised to not see it in pending_disk_io_count when querying the schedulers. Or with the waiting tasks.

Sue

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 7

Isabelle2378 - Monday, February 13, 2017 1:12 PM
Hi Gail,
The compatibility level of the test server that completes is SQL Server 2008(100). On the new 2016 and 2012 production server where it never completes, it is 110.

Shouldn't make a difference with those settings, none of those are using the new cardinality estimator (available from compat mode 120)

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

whenriksen SSCarpal Tunnel Points: 4752 More actions · Answer 8

First, I don't recall seeing how many processors your 2008R2 server has. If it only has 3GB, any chance it only has a single core as well? When MAXDOP=0, MAXDOP = #Cores. I have experienced similar problems in the past when code was promoted to big iron only to run into concurrency issues that were not reproducible in the lower environments because of resource limits.

My assumptions:
1. You are creating a queue or worker or drain table, to grab the "next" item to work on and you are running it in multiple threads concurrently
2. #HIndex is created in a parent stored procedure or transaction and is available to all threads that are running the query that is not working.

Try this
--I don't know the length of your key
DECLARE @T TABLE (Xkey VARCHAR(100));
WITH CTE
AS (SELECT TOP (1)
        XKey
   FROM    #HIndex WITH (READPAST)
   )
    DELETE FROM CTE
    OUTPUT Deleted.XKey
    INTO @T;

That will get the XKey you deleted into a table variable without using two separate statements. The readpast will tell the query to ignore any rows that are already locked by another thread. That would replace your select & delete statement for #Hindex.

It would be helpful if you could give a bit more information, for example, why are you inserting into #Hindex index, then selecting 1 record from it, which is then deleted?

If these are worker tables, any chance you can move any of those temp tables to In-Memory? You will likely see some performance improvements if you have memory to spare.

Wes
(A solid design is always preferable to a creative workaround)

Isabelle2378 SSCertifiable Points: 6174 More actions · Answer 9

Hi,
The query is coming from an application so I have no control over how it is written and I cannot modify it. I am using the same exact query on all 3 servers. Here is details about each server...my main concern is to try and get the query to complete on the Production SQL 2012 server. Our analyst uses the 2008R2 test server and has no problems running the query. I'm only using the 2016 test server because it is duplicating the problem that I'm seeing on the 2012 production server. This way I can test different things out to try and fix the problem and not have anyone else using the server.

Test Server àquery completes OK. No issues here

SQL 2008 R2 SP3 named instance with 3GB memory

Windows Server 2008R2 Standard

Logical CPU Count: 2 Hyperthread Ratio: 1 Physical CPUCount:2

MAXDOP:0

Production Server àquery never completes…CXPACKET waits

SQL 2012 named instance with 10GB memory
Windows Server 2012 R2 Standard

SQL Server detected 2 sockets with 4 cores per socket and 4logical processors per socket, 8 total logical processors; using 8 logicalprocessors based on SQL Server licensing.

MAXDOP was set 0. Query never completed…set MAXDOP: 6 improved CPU usage but query stillnever completes.

Test Server I am using for testing à query never completes…CXPACKETwaits

SQL 2016 named instance with 3GB of memory
Windows Server 2012 R2 Standard

SQL Server detected 2 sockets with 2 cores per socket and 2logical processors per socket, 4 total logical processors; using 4 logicalprocessors based on SQL Server licensing.

MAXDOP: 0

Thanks!
Bea Isabelle

John Mitchell-245523 SSC Guru Points: 148809 More actions · Answer 10

As I suggested before, I don't think CXPACKET is the root cause. What other wait types do you get for the same process? Please run this while you're waiting (fruitlessly) for the query to complete:

WITH tasks AS ( SELECT *, COUNT(*) OVER (PARTITION BY session_id) AS threadcount FROM sys.dm_os_waiting_tasks ) SELECT * FROM tasks WHERE threadcount > 1 AND session_id IS NOT null

John

whenriksen SSCarpal Tunnel Points: 4752 More actions · Answer 11

Isabelle2378 - Wednesday, February 15, 2017 3:59 PM
The query is coming from an application so I have no control over how it is written and I cannot modify it.

Can you reduce the number of processors available to SQL Server on the Test server temporarily? That is not a permanent solution but is just something to try. If that eliminates the problem and you can't change the query, then you'll need to go back to the application developer and let them know the query needs to be changed to better support a multi-core server. CXPacket waits are just a symptom of the problem.

Wes
(A solid design is always preferable to a creative workaround)

Ethan.Hinton SSC Enthusiast Points: 122 More actions · Answer 12

For reference, here are the two query plans on PasteThePlan:

Query_that_completed_40_minutes: https://www.brentozar.com/pastetheplan/?id=SJklWSXFl

Query_not_finishing_still_running: https://www.brentozar.com/pastetheplan/?id=S1OmbS7Fl

Erick Bailey SSCommitted Points: 1862 More actions · Answer 13

Erick Bailey

SSCommitted

Points: 1862

February 16, 2017 at 1:51 pm

#1928970

Sorry wrong question

ChrisM@home SSC-Insane Points: 24260 More actions · Answer 14

Ethan.Hinton - Thursday, February 16, 2017 8:16 AM
For reference, here are the two query plans on PasteThePlan:
Query_that_completed_40_minutes: https://www.brentozar.com/pastetheplan/?id=SJklWSXFl
Query_not_finishing_still_running: https://www.brentozar.com/pastetheplan/?id=S1OmbS7Fl

Same as last time - two completely different plans. I think you should resolve this before tackling the performance issue.

[font="Arial"]^{Low-hanging fruit picker and defender of the moggies}[/font]

For better assistance in answering your questions, please read this[/url].

Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

Isabelle2378 SSCertifiable Points: 6174 More actions · Answer 15

Hi,
Based on the fact that it looks like two completely different queries are running for TST and Production, I went ahead and restored the Test database to the 2016 test server and ran that query against it and it completed in 50 minutes. So it's definitely something with the production data and stored procedure that is the problem. The analyst assured me that they were the same but after seeing the query plans that you posted, he looked further and came back that he does see that they are different. Thanks everyone who responded and helped pinpoint the issue. I truly appreciate all the help! 😀 Chris, thanks for posted the plans using PasteThePlan. I've never seen that before and it was very helpful!

Thanks!
Bea Isabelle