DBCC Check hanging, killed but does not roll back.

  • Hi everyone.

    I have a problem with a Killed session.
    The command Issued was
    DBCC CHECKDB ([APPDB]) WITH NO_INFOMSGS, ALL_ERRORMSGS
    (Ola Halengrens scripts were used here)

    The session id 188 which was running the command got killed and is now in the Rollback State since 2 Days!!
    it stays at 45% and estimatet rollback time is growing with each run of
    KILL 188 WITH STATUSONLY
    SPID 188: transaction rollback in progress. Estimated rollback completion: 45%. Estimated time remaining: 267449 seconds.

    Our Monitoring Solution states the Session is Killed/RollBack, in status suspended and The Wait Type is EXECSYNC
    Activity Monitor say s Wait Type as CHECK_TABLES_THREAD_BARRIER.

    Please Help!!
    I fear if i fail the server over to another cluster node it will not stop normaly.
    What can i do?

    edit: this is the output of waiting Tasks
    session_id    exec_context_id    scheduler_id    wait_duration_ms    wait_type    blocking_session_id    resource_description    Node ID    text
    188    0    4    233597001    EXECSYNC    NULL    NULL    NULL    DBCC CHECKDB ([APPDB]) WITH NO_INFOMSGS, ALL_ERRORMSGS, DATA_PURITY
    188    9    6    233597000    CHECK_TABLES_THREAD_BARRIER    NULL    NULL    NULL    DBCC CHECKDB ([APPDB]) WITH NO_INFOMSGS, ALL_ERRORMSGS, DATA_PURITY
    188    10    3    233597000    CHECK_TABLES_THREAD_BARRIER    NULL    NULL    NULL    DBCC CHECKDB ([APPDB]) WITH NO_INFOMSGS, ALL_ERRORMSGS, DATA_PURITY
    188    11    1    233597000    CHECK_TABLES_THREAD_BARRIER    NULL    NULL    NULL    DBCC CHECKDB ([APPDB]) WITH NO_INFOMSGS, ALL_ERRORMSGS, DATA_PURITY
    188    12    11    233597003    CXPACKET    NULL    exchangeEvent id=Pipe87e3ccae00 WaitType=e_waitPipeNewRow nodeId=3    3    DBCC CHECKDB ([APPDB]) WITH NO_INFOMSGS, ALL_ERRORMSGS, DATA_PURITY
    188    13    8    233597000    CHECK_TABLES_THREAD_BARRIER    NULL    NULL    NULL    DBCC CHECKDB ([APPDB]) WITH NO_INFOMSGS, ALL_ERRORMSGS, DATA_PURITY
    188    14    0    233597000    CHECK_TABLES_THREAD_BARRIER    NULL    NULL    NULL    DBCC CHECKDB ([APPDB]) WITH NO_INFOMSGS, ALL_ERRORMSGS, DATA_PURITY
    188    15    2    233597000    CHECK_TABLES_THREAD_BARRIER    NULL    NULL    NULL    DBCC CHECKDB ([APPDB]) WITH NO_INFOMSGS, ALL_ERRORMSGS, DATA_PURITY
    188    16    9    233597000    CHECK_TABLES_THREAD_BARRIER    NULL    NULL    NULL    DBCC CHECKDB ([APPDB]) WITH NO_INFOMSGS, ALL_ERRORMSGS, DATA_PURITY

  • Hi Johnny,
    All transactions take as long to rollback as it did to run. How big is the database?
    I would check to see if anything is blocking it and what else may be running.  Exec sp_who2. I would look up "sp_who_3" online it gives much more detail than the sp_who2.
    Another alternative is to restart the service (not the server), however last alternative if this is production.
    Hope this helps.

    ¤ §unshine ¤

  • Nope
    Rollbacks are singlethtreaded.
    They take longer to rollback.
    Try looking at this post.
    Paul Randall says that you have to let it complete the rollback to the snapshot that was created for the checkdb process.
    https://sqlmag.com/blog/unkillable-dbcc-checkdb

    Catch-all queries done right [/url]
    Gail Shaw's Performance Blog[/url]

  • Hm. 
    the weekend is gone (d'oh!) and the transaction is still in rollback state. (d'oh!)
    No blocking sessions found.
    Database is about 70GB in Size.
    so it looks like i have to reboot the service?
    will it shut down cleanly and in an normal amount of time? or will it again wait for the rollback to complete?
    nothing else i can do?

    As this is a production server i'm a bit concerned.

  • JohnyRotten - Monday, June 12, 2017 2:37 AM

    Hm. 
    the weekend is gone (d'oh!) and the transaction is still in rollback state. (d'oh!)
    No blocking sessions found.
    Database is about 70GB in Size.
    so it looks like i have to reboot the service?
    will it shut down cleanly and in an normal amount of time? or will it again wait for the rollback to complete?
    nothing else i can do?

    As this is a production server i'm a bit concerned.

    Don't reboot the service, just leave the the rollback to continue. Rebooting will make no difference to the rollback.

    Thanks

  • But the rollback is running for 6 days now, and it still shows 45% rollback completion. 
    The linked information from @MadAdmin (thx) states, the rollback should be in the database snapshot of DBCC Checkdb, and be removed, once the service restarts.

    https://sqlmag.com/blog/unkillable-dbcc-checkdb

  • JohnyRotten - Monday, June 12, 2017 3:19 AM

    But the rollback is running for 6 days now, and it still shows 45% rollback completion. 
    The linked information from @MadAdmin (thx) states, the rollback should be in the database snapshot of DBCC Checkdb, and be removed, once the service restarts.

    https://sqlmag.com/blog/unkillable-dbcc-checkdb

    The rollback has to complete to leave the database in a consistent state. If you restart the service the rollback will continue.

    Thanks

  • Just for the record.
    The session was running until just now, when I restarted sql service.
    Now its gone.

  • Sometimes a rollback will get hung and you need to reboot to clear it.

    You want to have an idea of normal DBCC time. If this normally took 4 hours and the rollback got stuck more than 12, I'd just reboot. I'm guessing here, and I have no scientific data on how long to wait, but at 2-3x normal, I'd just restart.

  • Also, please don't stop running dbcc. You need to run these, but I'd do this on another machine. Restore a backup and run this regularly on another machine

  • Did a human kill CHECKDB or was CHECKDB unexpectedly killed by a high severity error that was raised to SQL Server's errorlog?

    If the latter, you need to inspect CHECKDB's partial results (if any)  for error messages about corruption. If messages are found, you may need to make some hard choices. If you see the errorlog handling OS error 1450 or 665, and that is temporally related to CKECKDB being rolled back/killed, you are facing a fragmented file system caused by CHECKDB's use of snapshots. While some workarounds are in Sparse File Errors: 1450 or 665 due to file fragmentation: Fixes and Workarounds, the  best workaround is to restore the backup on another server and run CHECKDB there. You can call me paranoid, but I always want to ensure I can restore what I backed up (because 'garbage backed up equals garbage restored'). I run CHECKDB on full backups that have been restored on one of my DR servers.

    CHECK_TABLES_THREAD_BARRIER is an interesting wait_type, because it is not documented. I don't have access to source code - I can only wax poetic. CHECKDB introduced its MAXDOP hint in SQL Server 2014. while CHECK_TABLES_THREAD_BARRIER was introduced in SQL Server 2016. CHECKDB's MAXDOP hint overrides the server's max degree of parallelism setting, and includes a warning "If MAXDOP is set to zero then SQL Server chooses the max degree of parallelism to use." There are blogs about setting an appropriate max degree of parallelism, so no need for a detailed discussion here, but your system's setting invariably should not be at its default value of 0 (including  the server where your backups are being restored).

    For me, CHECK_TABLES_THREAD_BARRIER implies CHECKDB ran out of available threads. Because that wait_type was introduced by the next major version after CHECKDB's MAXDOP introduction, I can imagine there having been at least a few SQL Server 2014 MS Support cases where a memory dump had to be collected, because the SQL OS (SOS) was unresponsive/stalled while CHECKDB was running (and THREADPOOL waits were either not reported or not sufficiently diagnostic)..  As a result I imagine SQL Server 2016 developers chose to throttle the number of threads that a DBCC CHECK can consume (for the good of the SOS), and the developers chose to use the CHECK_TABLES_THREAD_BARRIER wait_type as a diagnostic wait_type that avoids the need for a memory dump. 

    There is an old fable about the "tortoise and the hare". While a non-zero MAXDOP or max degree of parallelism setting might make CHECKDB slower, but they are also less likely to occasionally hang the system (or CHECKDB) while waiting for available thread(s). I set max degree of parallelism to 5 on my 24-way OLTP servers, simply because I like prime numbers. You should do some research and test before setting one or both of them. If already set, try an even lower number for CHECKDB (and if being done, strive to avoid running CHECKDB on a production server:).

  • Session was killed by human intervention. I think it was killed because another session was blocked by Checkdb. (I'm not sure what the initial intention was, but i will prevent that in future)
    I could not find errors you mentioned, so I think the Database has no problem.
    MaxDOP = 8 for the Server.
    I will test running CHECKDB with lower Maxdop, and see if the problem comes up again,
    I tested Backup Restore and CHECKDB and it ran through without any error.
    Thanks to all.

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply