CheckDB running for two days

  • Hello all,

    We carried out an in-place upgrade on our production server on Saturday - going from 2008 R2 to 2014.

    We had tested this method out in dev/test and pre-production with only minor post issues to fix.

    However, on production we had an issue whereby checkdb was hitting 100% CPU and caused overnight processes to hang. The checkdb statement was terminated and disabled by a colleague at 1 am.

    Since then we have restored this database to a dev server and ran checkdb against it with no_infomsgs and all_errormsgs but it still hasn't finished since Monday morning!

    The database is just over 800 GB and whilst checkdb was crippling the cpu, logical reads are less than one. However, sp_whoisactive is showing that it has done 56 million reads so far and this number increases periodically so it looks like it keeps going back to re-check the database with a deep dive.

    Also, on a different environment, we ran check table statements and one of them took over 9 hours for a single table but came back clean (see attachment).

    Has anyone come across something like this before? I appreciate we need to wait for the output but the database is still in use in production and the mess will just get worse if it is indeed corrupted.

  • What are normal CHECKDB times for that DB? That high number of reads isn't necessarily surprising, because there are more than that many pages in 800 GB.

    There are some things that can cause CHECKDB to run especially slowly, like non-clustered indexes using computed columns (something I've had the "pleasure" of encountering in the wild, and see http://www.sqlskills.com/blogs/paul/dbcc-checkdb-performance-and-computed-column-indexes/)

    I've also seen CHECKDB take considerably more time if it's running in parallel (and that will obviously cause additional CPU pressure). So, if you're on Enterprise, have many cores, and have MAXDOP for the instance set to something fairly high, that could also be a contributing factor.

    You could test that by enabling trace flag 2528 for the session running CHECKDB or CHECKTABLE(disables parallel CHECKing of objects), and seeing how that affects things. Even if parallelism is not increasing the duration of the check, you might find that it runs almost as quickly serially, and that will help with the CPU pressure.

    Without having access to more information about they environment, it's hard to say for sure what might be happening though. If you don't run CHECKDBs on a regular basis, then this might just be as fast as CHECKDB will run on your system.

    Do you have any wait stats collected for the times CHECKDBs have been run?

    I hope this helps!

  • A more basic question : Do you have any baseline runs of the same process(CHECKDB) on SQL 2008 ? In general, it would not be a good idea(I think) to do an in place upgrade. Having said that, I myself have issues with my upgrade from SQL 2008 R2 to SQL 2014. I see a lot of memory pressure on SQL 2014 when compared to SQL 2008 R2, with the same hardware and workload.

  • We use Ola's scripts and previously checkdb would run for an average of 1.5 hours max for all databases on the server (about a dozen of them - some almost 1 TB).

    The server is running Windows 2008 R2 and SQL 2014 (I appreciate this isnt ideal). Four core cpu with 128 gb of RAM.

    The wait stats are page IO as it reads in the data and then scheduler_yield waits as the thread exhausts it's cpu quanta.

    On the plus side, we may have found the issue. We believe it is related to the number of partitions - some 7000 per table! Someone thought it was a good idea to partition by day and they run from 2002 until 2021. One of the dev/DBAs here has done some good investigative work and we're pretty sure this is what the issue is.

    Also checkdb has finally finished and there's no corruption - phew! 😀

    Thanks all for replying

  • I'm glad you seem to have found the issue!

    There actually was a hint of that even in the article I cited, and I just overlooked it because of the focus on computed columns.

    As Paul points out there, in 2012 two hash tables are created for each partition of a non-clustered index. I imagine that might cause some problems when you have a large number of partitions.

    That seems to be confirmed by this Connect item: https://connect.microsoft.com/SQLServer/feedback/details/833842/checkdb-takes-considerably-longer-on-sql-2012-vs-sql-2008-r1-r2-with-many-partitions

    Cheers!

  • Response received from MS. They basically acknowledged the issue and said a fix was being prepared in a new version of SQL (hopefully this is a hotfix and not 2016?!).

    As a workaround he suggested using physical_only checks for the time being (which i used to run at my last place and it worked well).

    Jacob Wilkins - that connect article seems to be exactly what we are experiencing.

    Thanks again for replying.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply