Stuck Parallel Query

  • I've never seen this before, and I have no idea what to do (other then kill the SPID).

    Third party vendor code running on SQL 2014 SP2 generates a slowness complaint and the vendor is engaged to help understand whats going on. The workload is an ETL and they are transforming some data and the query goes parallel with a DOP of 8, but stalls and no progress happens for hours. They ask me to take a look and I do some digging and this is what I find:

    session_id    scheduler_id    worker_address    context_switch_count    task_state    wait_type    wait_duration_ms
    165    0    0x000000375BF28160    6793019    RUNNING    NULL    NULL
    165    0    0x000000375514E160    443453    SUSPENDED    CXPACKET    12692158
    165    1    0x0000002BF5AC8160    69815    SUSPENDED    CXPACKET    12692158
    165    2    0x00000028B24D6160    48514    SUSPENDED    CXPACKET    12692158
    165    3    0x0000001E97970160    799102    SUSPENDED    CXPACKET    12692158
    165    4    0x000000065A910160    74906    SUSPENDED    CXPACKET    12692158
    165    5    0x000000380650E160    342172    SUSPENDED    CXPACKET    12692358
    165    5    0x000000314495E160    5047486    SUSPENDED    CXPACKET    12692158
    165    6    0x0000002EAD0E6160    352198    SUSPENDED    CXPACKET    12692158
    165    7    0x000000199D466160    106761    SUSPENDED    CXPACKET    12692158

    The  documented wait type for the running worker is SOS_SCHEDULER_YIELD which to me means its  using up its quantum and going back to the runnable list. It really seems that its deadlocked or something but SQL isn't sensing it? I checked memory grants and everything seems ok. Nothing waiting in sys.dm_io_pending_io_requests and the SPID is just accumulating CXPACKET waits.

    Thoughts?

  • Not nearly enough information.  My recommendation is to download a copy of sp_WhoIsActive by Adam Machanic (Google for it, please) and find out what is running (it'll show the actual code) when this happens.  From the stall I see, I suspect an "illegal update" but there's no way of knowing without being able to see the code that's running.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • The query is an insert that is part of a stored procedure. They are using a TABLOCK hint on the table being written to as well.
    (table and some column names have been changed). The only I've noticed is it appears they've duplicated a condition in their inner join in the where clause as well.

    INSERT
        INTO dbo.TEMP WITH (TABLOCK)
    SELECT
        T1.GROUP,
        T1.INVOICE,
        NULL AS LAST_TRANSACTION_ID,
        NULL AS LAST_TXN_NUM,
        NULL AS LAST_CODE
    FROM
        dbo.TEMP_WORK T1
    INNER JOIN
        TEMP_WORK T2
        ON    T2.GROUP = T1.GROUP AND
            T2.INVOICE = T1.INVOICE AND
            T2.CODE <> T1.CODE AND
            T2.POST_PERIOD = T1.POST_PERIOD
    WHERE
        T1.CODE <> T2.CODE AND
        T1.CUR_AR <> 0 AND
        T2.CUR_AR <> 0
    GROUP BY
        T1.GROUP,
        T1.INVOICE

    The other detail I should have mentioned was that there were no IOs accumulating -- nothing was being read by the parallel worker threads and nothing was being written either. There were pending IOs on one of the parallel threads, however I checked the pending IO dmv there wasn't anything listed. I find that a bit weird.

  • It appears to me that this is an attempt at the "simple" deduplication of data (Temp_Work is self joined) combined with a check for valid data (the <> 0 stuff).  That leads me to believe that a simple DISTINCT or partitioned ROW_NUMBER() would do the trick quite nicely here.  I'm looking deeper.

    In the meantime, can you confirm that's what the intent of this code is?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Adding a bit of a correction, it looks like the dupe-check goes a bit further.  It also looks like the only time a particular group/invoice should be returned is if there's more than one distinct code for any given group/invoice/post_period.  Is that also what was intended by this code?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Wednesday, December 12, 2018 7:09 PM

    Adding a bit of a correction, it looks like the dupe-check goes a bit further.  It also looks like the only time a particular group/invoice should be returned is if there's more than one distinct code for any given group/invoice/post_period.  Is that also what was intended by this code?

    I'm not sure exactly what they are trying to do here, but while doing some testing around this (applying a maxdop of 1) they realized that there was duplicate data upstream of them that had caused an explosion in the number of records.

    Not sure if it is actually the true root cause of the query hanging and not make any progress, but it certainly could explain the initial slowness complaint. Since I wasn't familiar with the data itself I didn't pick up on it.

    Thanks for your help, and hopefully this will also lead to some clean up of some of the more inefficient queries that were found.

    Jim

  • No problem.  Thanks for the feedback.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply