Query never completes with CXPACKET waits

  • Paul Randal wrote a wait stats query which is probably a good one to run for this one. This is a link to the current version: 
    Updated sys.dm_os_waiting_tasks script

    You may want to take a look at what is going on with the schedulers as well since it can potentially show a few different issues. Something like: 
    select scheduler_id, current_tasks_count, runnable_tasks_count,
    current_workers_count, active_workers_count,
    work_queue_count, pending_disk_io_count, load_factor, failed_to_create_worker
    from sys.dm_os_schedulers
    where status = 'VISIBLE ONLINE'

    And execute it throughout as sustained high values are usually more important than a spike. You'd especially want to do that if runnable task count is high.

    When you primarily see cxpackets waits and nothing else is running on the box, that's pretty much what is to be expected for a parallel plan and nothing else executing.
    But when running parallel, there is usually one process in there that is not waiting on cxpackets and that would be the one to pay attention to. You probably want to focus on things other than just the cxpacket waits since those are normal.
    Have you looked at the disk stats at all? I would probably take a look at those as well.

    Sue

  • What's the compatibility level of the database?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Hi Gail,
    The compatibility level of the test server that completes is SQL Server 2008(100).  On the new 2016 and 2012 production server where it never completes, it is 110.

    Thanks!
    Bea Isabelle

  • Isabelle2378 - Monday, February 13, 2017 11:14 AM

    Hi,
    I agree on the comments that MAXDOP is not the solution or that CXPACKETS is the problem.  I'm trying to figure out why it's behaving this way on my production server but having no problems completing on the test server.  This is why I restored the same exact database to another test server, has the same memory, the same MAXDOP settings (default 0) and getting 2 different results.  My 2008R2 test server completes in 40 minutes (3GB memory, MAXDOP 0) and my 2016 test server never finishes.  Just hangs and the only thing I see is suspended waiting on CXPACKETS.  There is no load on this server, it is brand new and no one is using it but me for this testing.  So I'm trying to figure out what am I missing?  The indexes are all the same, I rebuilt anything that was fragmented over 30%, updated statistics on both databases.  It's the same exact query so what else can I look at on this 2016 server to try and see why it won't complete? I was hoping to get some fresh eyes to maybe point out something obvious that I'm just not seeing. Both systems are running collation Latin1_General_BIN.  I asked the analyst to check the language setting of the application to see if he see's anything different with the way the 2 queries are running the code.  Any other thoughts?

    Some smart watches have more than 3GB as standard. Have you checked the minimum and minimum recommended memory requirements of your server versions? Why so little RAM when it's do cheap? We use test servers with 》200x this amount. 3GB is asking for trouble.


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • Hi,
    Our production server has 48GB of memory.  The test servers only have 16GB of memory and this has a named instance that has 3GB allotted to it.  The test server with only 3GB of memory has no problems with the query completing.  We are only using these servers to test it out so memory is not really a concern here.  The production server with 48GB is having the issue with it not completing and my 2016 test box that I setup is just for testing.

    I have looked at the disks and see latency on the production server:
    Drive    Volume Mount Point    Read Latency    Write Latency    Overall Latency    Avg Bytes/Read    Avg Bytes/Write    Avg Bytes/Transfer
    S:          S:\                           17                     3                      4                        448146                 57218                  64094
    K:          K:\                           88                     28                    87                       74360                  71404                  74264
    G:         G:\                           4                       400                   201                     65104                  65584                  65343

    And here's on the test server that completes the query okay:

    Drive    Volume Mount Point    Read Latency    Write Latency    Overall Latency    Avg Bytes/Read    Avg Bytes/Write    Avg Bytes/Transfer
    W:        W:\                           224                   1                       5                        904816                 13143                  25745
    K:         K:\                            80                     140                   82                       1004123               40982                  962235

    Thanks!
    Bea Isabelle

  • Not sure what you used for the IO stats but if you were using dm_io_virtual_file_stats then that just tells you the cumulative stats since the server started. If that's what you are using, you would want to capture the data for a baseline and then run the query and capture it again after it's been running for an hour or so. The second set minus the baseline tells you what the IO activity was during that hour, or however long between the two.
    If there were IO issues, I'd be surprised to not see it in pending_disk_io_count when querying the schedulers. Or with the waiting tasks.

    Sue

  • Isabelle2378 - Monday, February 13, 2017 1:12 PM

    Hi Gail,
    The compatibility level of the test server that completes is SQL Server 2008(100).  On the new 2016 and 2012 production server where it never completes, it is 110.

    Shouldn't make a difference with those settings, none of those are using the new cardinality estimator (available from compat mode 120)

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • First, I don't recall seeing how many processors your 2008R2 server has.  If it only has 3GB, any chance it only has a single core as well?  When MAXDOP=0, MAXDOP = #Cores.  I have experienced similar problems in the past when code was promoted to big iron only to run into concurrency issues that were not reproducible in the lower environments because of resource limits.

    My assumptions:
    1. You are creating a queue or worker or drain table, to grab the "next" item to work on and you are running it in multiple threads concurrently
    2. #HIndex is created in a parent stored procedure or transaction and is available to all threads that are running the query that is not working.

    Try this

    --I don't know the length of your key

    DECLARE @T TABLE (Xkey VARCHAR(100));
    WITH CTE
    AS (SELECT TOP (1)
            XKey
       FROM    #HIndex WITH (READPAST)
       )
        DELETE  FROM CTE
        OUTPUT Deleted.XKey
        INTO @T;

    That will get the XKey you deleted into a table variable without using two separate statements.  The readpast will tell the query to ignore any rows that are already locked by another thread.  That would replace your select & delete statement for #Hindex.

    It would be helpful if you could give a bit more information, for example, why are you inserting into #Hindex index, then selecting 1 record from it, which is then deleted?

    If these are worker tables, any chance you can move any of those temp tables to In-Memory?  You will likely see some performance improvements if you have memory to spare.

    Wes
    (A solid design is always preferable to a creative workaround)

  • Hi,
    The query is coming from an application so I have no control over how it is written and I cannot modify it.  I am using the same exact query on all 3 servers.  Here is details about each server...my main concern is to try and get the query to complete on the Production SQL 2012 server.  Our analyst uses the 2008R2 test server and has no problems running the query.  I'm only using the 2016 test server because it is duplicating the problem that I'm seeing on the 2012 production server.  This way I can test different things out to try and fix the problem and not have anyone else using the server. 

    Test Server àquery completes OK.  No issues here

    SQL 2008 R2 SP3 named instance with 3GB memory

    Windows Server 2008R2 Standard

    Logical CPU Count: 2 Hyperthread Ratio: 1 Physical CPUCount:2

    MAXDOP:0

     

    Production Server àquery never completes…CXPACKET waits

    SQL 2012 named instance with 10GB memory
    Windows Server 2012 R2 Standard

    SQL Server detected 2 sockets with 4 cores per socket and 4logical processors per socket, 8 total logical processors; using 8 logicalprocessors based on SQL Server licensing.

    MAXDOP was set 0. Query never completed…set MAXDOP: 6 improved CPU usage but query stillnever completes.

     

    Test Server I am using for testing à query never completes…CXPACKETwaits

    SQL 2016 named instance with 3GB of memory
    Windows Server 2012 R2 Standard

    SQL Server detected 2 sockets with 2 cores per socket and 2logical processors per socket, 4 total logical processors; using 4 logicalprocessors based on SQL Server licensing.

    MAXDOP: 0

    Thanks!
    Bea Isabelle

  • As I suggested before, I don't think CXPACKET is the root cause.  What other wait types do you get for the same process?  Please run this while you're waiting (fruitlessly) for the query to complete:

    WITH tasks AS (
        SELECT *, COUNT(*) OVER (PARTITION BY session_id) AS threadcount
        FROM sys.dm_os_waiting_tasks
    )
    SELECT * FROM tasks
    WHERE threadcount > 1
    AND session_id IS NOT null

    John

  • Isabelle2378 - Wednesday, February 15, 2017 3:59 PM

    The query is coming from an application so I have no control over how it is written and I cannot modify it. 

    Can you reduce the number of processors available to SQL Server on the Test server temporarily?  That is not a permanent solution but is just something to try.  If that eliminates the problem and you can't change the query, then you'll need to go back to the application developer and let them know the query needs to be changed to better support a multi-core server.  CXPacket waits are just a symptom of the problem. 

    Wes
    (A solid design is always preferable to a creative workaround)

  • For reference, here are the two query plans on PasteThePlan:  

    Query_that_completed_40_minutes: https://www.brentozar.com/pastetheplan/?id=SJklWSXFl

    Query_not_finishing_still_running: https://www.brentozar.com/pastetheplan/?id=S1OmbS7Fl

  • Sorry wrong question

  • Ethan.Hinton - Thursday, February 16, 2017 8:16 AM

    For reference, here are the two query plans on PasteThePlan:  

    Query_that_completed_40_minutes: https://www.brentozar.com/pastetheplan/?id=SJklWSXFl

    Query_not_finishing_still_running: https://www.brentozar.com/pastetheplan/?id=S1OmbS7Fl

    Same as last time - two completely different plans. I think you should resolve this before tackling the performance issue.


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • Hi,
    Based on the fact that it looks like two completely different queries are running for TST and Production, I went ahead and restored the Test database to the 2016 test server and ran that query against it and it completed in 50 minutes.  So it's definitely something with the production data and stored procedure that is the problem.  The analyst assured me that they were the same but after seeing the query plans that you posted, he looked further and came back that he does see that they are different.  Thanks everyone who responded and helped pinpoint the issue.  I truly appreciate all the help! 😀  Chris, thanks for posted the plans using PasteThePlan.  I've never seen that before and it was very helpful!

    Thanks!
    Bea Isabelle

Viewing 15 posts - 16 through 29 (of 29 total)

You must be logged in to reply to this topic. Login to reply