What is going on with my rebuilding index job?

  • We migrated a major server to a new cluster.  Installed SQL Server 2017 and set up all of the nodes and the AGs.

    After a couple of months, we started seeing an occasional blocking of jobs in the morning.  The indexoptimize_user_databases job was still running.

    We put in some code to kill the job at 6:00 am.  At the same, time we been troubleshooting.  I've studied hallengren's script and I can't see what is taking the index optimize job that starts at 12:00 am to still be running at 6:00 am.

    1.  I am using hallengren's index maintenance script.
    2. I know the tlogs backup jobs are still processing.
    3. We have a third party replication software that is replicating data at the same time.
    4. It is not always stopping at the same index when it is stopped.
    5. When I check the SQL Server monitoring software, I see waitypes of LCK_M_SCH_S from the replicating software and LCK_M_SCH_M from SQLCMD.
    6. We are using the ONLINE parameter.  ALTER INDEX [PK_indexname] ON [DB].[SCHEMA].[tABLE] REBUILD WITH (SORT_IN_TEMPDB = OFF, ONLINE = ON)
    7. I've added a maxdop parameter - @maxdop = 16
    8. I added up the sizes of the user databases.  The sum of the DB sizes is about 4 TB.

    I'm not sure what else to review.  I've checked the command table for the hallengren script, I see alter index statements up to a point and then I see a start time and a null for the end time.  I assume that is when the job was cancelled.

    Does anyone have any other troubleshooting ideas?  It didn't take 6 hours before.  I don't know why there are blocks when we are using the "ONLINE" parameter.  I know transactions have to finish and then the online indexing starts.  Not sure, what stops the indexing and puts it wait/blocking state.

    Any ideas from anyone would be greatly appreciated.

    Script being executed at 12:00 am.

     

    EXECUTE [dbo].[IndexOptimize] @databases = 'USER_DATABASES',

    @Indexes = 'ALL_INDEXES',

    @LogToTable = 'Y',

    @PageCountLevel=200,

    @FragmentationLow = NULL,

    @FragmentationMedium = 'INDEX_REORGANIZE,INDEX_REBUILD_ONLINE,INDEX_REBUILD_OFFLINE',

    @FragmentationHigh = 'INDEX_REBUILD_ONLINE,INDEX_REBUILD_OFFLINE',

    @FragmentationLevel1 = 5,

    @FragmentationLevel2 = 30,

    @UpdateStatistics = 'ALL',

    @OnlyModifiedStatistics = 'Y',

    @maxdop = 16

     

    Thanks.

     

     

     

     

    Things will work out.  Get back up, change some parameters and recode.

  • Having about 400 hours of testing into index maintenance and discovering many of the problems with it, my first suggestion would be to stop using REORGANIZE except when you KNOW that you need to compact LOBs.  It's single threaded and isn't the lightweight and friendly little puppy that a lot of people think it is.  It also removes critical free space from certain index patterns when the most need it and that causes massive page splits which cause excessive log file writes, extended system transactions during the bad page splits, and all of that cause blocking that you might not be aware of but is having incredible "morning after" affects.

    If you can, rebuild your smaller indexes using OFFLINE.  ONLINE takes a whole lot longer and you can fly through most of your smaller stuff in very short order without ONLINE.

    To be honest, I'll eventually go back to doing regularly schedule index maintenance after I've finished writing some special code to do a better job but... I've not scheduled index maintenance to run since Sunday evening, 17 January 2016 (more than 3 years ago now).  I had to watch some of the larger indexes for page density and occasionally rebuild them but CPU dropped from an average of 22% to 8% and stayed there since I stopped doing regular index maintenance and especially since I stopped using REORGANIZE.

    I've been giving a 2 hour introduction presentation on how to fix things.  If you're in the Columbus, OH area on June 8th, 2019, go to the SQLSaturday there.  That's the next place I'm giving the presentation.  To whet your whistle, you'll find out that there are 6 different index insert/update patterns and what you can do about them with index maintenance.

    Ed Wagner and I are also working (feverishly... there's a lot to cover) on a precon for Pittsburgh this October on this subject including additional things like how LOBs screw you up and what to do about it while still being able to use LOBs, etc, etc, etc.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Getting away from the problems associated with index maintenance in general, you moved to a new machine.  You also have replication of some form going and AG.  IIRC, all of that relies pretty heavily on the log files.  Have you checked the comms between the server and the SAN and the latency of the log file drives?

    And, you said you went to a new machine AND 2017... what did you migrate from for SQL Server?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • The old server was SQL Server 2008 R2.

     

    No, I hadn't checked the latency of the log file drives.  I will check that as well.

    Thanks.

     

    Things will work out.  Get back up, change some parameters and recode.

  • Not sure but, since you went from 2008 R2 to 2017, you're also in the realm of the "improvement" they made to the "Cardinality Estimator" (which happened in 2014, IIRC).  That caused me some HUGE performance issues in some areas and no gains in the rest.  I used the trace flag that allows me to use the old Cardinality Estimator and things went back to "normal".  (Search for the trace flag... I can't recall off the top of my head what it was).

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I am still researching the cardinality estimator.  While researching that, I noticed that the database compatibility  has not changed from the migration.  In fact, it looks like it wasn't changed from when we went to SQL Server 2008 R2 from SQL Server 2008.  The database compatibility level is still set at 100 for SQL Server 2008.

    Still researching, so not sure if that may be part of the problem.

     

    Things will work out.  Get back up, change some parameters and recode.

  • Hello Webtechie,

    I would not look for the old cardinality trace flag.

    As it turns out, you mentioned that the databases are still in the SQL 2008 Compat level, which is a very good way to use the old cardinality algoritm.

    Since SQL2016+ you can set this option on a per database level:

    ALTER DATABASE SCOPED CONFIGURATION SET LEGACY_CARDINALITY_ESTIMATION = ON

    Before it was:

    Trace Flag 9481 to Disable New Cardinality Estimation on SQL Server 2014.

    However, you mentioned also that you went to a new machine, using HAG, and online as the index rebuild option.

    SQL 2008 did not have HAG but more a per database synchro (mirroring), where as 2017 HAG have a group synchro (async of sync) with multiple worker threads depending on the number of databases involved.

    You set a maxdop of 16 for the index rebuild which is a lot of firepower (and threads..).

    Now, some problems may arise here:

    You may run out of worker threads (check the wait stats) or possibly you run into locking.

    Even an ONLINE index rebuild take 2 (normal short) locks, 1 at the start and 1 at the end, as well as a schema stability lock.

    Maybe you can activate (or check if already activated)  the duration on a per index base, maybe the system was just waiting to get a hand on the required lock to switch over the index after the online rebuild.

    Note on using 16 cores for rebuilding..

    Watch out that the indexes involved do allow for page locks, or you may end up getting more fragmentation afterwards.

    ref: .https://blogs.msdn.microsoft.com/psssql/2012/09/05/how-it-works-online-index-rebuild-can-cause-increased-fragmentation/

    I'm more with Jeff on this, rebuild only when really needed, but spend more attention on the statistics, less overhead, more gain.

    Wkr,

    Eddy

  • Jeff / Eddy,

    I really appreciate the feedback and replies.  I've seen that most of our databases have LOB data.

    I am preparing to talk to my manager about your suggestions.  Here are my thoughts are reading your replies and doing some research.

    1.  Create statistic updating jobs to execute a few times a day.
    2.  Create one job for during the week and one job for the weekends.  The job during the week will have a higher fragmentation level.  We have a job that kills the index job at 6:00 am if it is still running.  I figure if I change that and only rebuild the indexes that are really fragmented during the week, I can completely rebuild the index on the weekends.  I think that would be better than killing the job at 6:00 am if it is still running.

    Questions

    1. What do you mean activate the duration of an index base?
    2. We are an enterprise that has public facing websites 24 hours a day.  We are processing customer data all the time.  I have a maintenance window of 12:00 am to 6:00 am, but I am not sure if I can take database offline each night.  If I use the rebuild offline option, wouldn't that make the database unusable for that time?  I would have to create a script to rebuild index offline under size x.

    Note

    Jeff, I would love to hear your presentation this weekend.  I really would, but I am in Texas.  Hopefully, the presentation will get posted somewhere after Saturday where I can listen to it.

     

    Thanks.

     

     

     

    Things will work out.  Get back up, change some parameters and recode.

  • This was removed by the editor as SPAM

  • This was removed by the editor as SPAM

  • This was removed by the editor as SPAM

  • Man... I'm sorry. I lost track of this thread and didn't answer your additional questions.

    I guess my biggest question would be, did you stop using REORGANIZE or not?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply