Online index rebuild causing application timeouts

  • Hi,

    I am at my wit's end trying to get online index rebuilds to run without causing application timeouts as our SLA is 24x7. I have been using Ola Hallengren's database maintenance script (see the commands below) for some time but as our databases have grown, the reorg/rebuilds are running longer and causing application timeouts. We are running MS SQL Server Enterprise Edition (10.0.2799.0), have 16 virtual CPU's, 32,767 Mb of memory (28,000 reserved for SQL Server). We recently moved disk storage from a 2 disk RAID array to a SAN and have seen 80% reduction in DISK wait time as the database, tempdb, and transaction log files are now spread across the SAN drives. We have asynchronous mirroring set up for the database.

    Our application programmers have their code set to trigger a timeout failure if the application waits for more than 30 seconds for a response from SQL Server and they record the timeout failures in an error log table in the database (or file if the write to the error log table fails). I understand there can be some timeouts at the beginning and end of the REBUILD, but why are we getting timeouts occurring for the entire duration of the REBUILD? I also have a run of Brent Ozar's 'AskBrent' which I executed during the REBUILD if that would help.

    Any help in determining why we get timeouts for the duration would be greatly appreciated.

    EXECUTE dbo.IndexOptimize

    @databases = 'USER_DATABASES',

    @Indexes = 'ALL_INDEXES',

    @FragmentationLow = NULL,

    @FragmentationMedium = 'INDEX_REORGANIZE,INDEX_REBUILD_ONLINE',

    @FragmentationHigh = 'INDEX_REBUILD_ONLINE',

    @FragmentationLevel1 = 5,

    @FragmentationLevel2 = 30,

    @MaxDOP = 4,

    @LogToTable = 'Y'

    ALTER Index statement in CommandLog table

    ALTER INDEX [PK_FNA_BW] ON [TAW].[dbo].[FNA_BW] REBUILD WITH (SORT_IN_TEMPDB = OFF, ONLINE = ON, MAXDOP = 4)

    StartTime & EndTime for above ALTER statement from CommandLog table

    2015-06-13 23:05:18.1172015-06-13 23:11:47.120

    ExtendedInfo from CommandLog table

    <ExtendedInfo>

    <PageCount>77268</PageCount>

    <Fragmentation>41.1192</Fragmentation>

    </ExtendedInfo>

    Count of application timeouts by minute beginning at 23:05:56 (while ALTER was executing)

    23:05:56 - 23:05:59 - 3

    23:06:00 - 23:06:59 - 7

    23:07:00 - 23:07:59 - 19

    23:08:00 - 23:08:59 - 27

    23:09:00 - 23:09:59 - 35

    23:10:00 - 23:10:59 - 34

    23:11:00 - 23:11:47 - 27

    Create Table Command generated

    USE [TAW]

    GO

    /****** Object: Table [dbo].[FNA_BW] Script Date: 06/16/2015 10:05:09 ******/

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    CREATE TABLE [dbo].[FNA_BW](

    [key_id] [int] NOT NULL,

    [sequence] [smallint] NOT NULL,

    [budget_id] [smallint] NOT NULL,

    [item_value] [int] NOT NULL,

    [custom_label] [nvarchar](50) NULL,

    CONSTRAINT [PK_FNA_BW] PRIMARY KEY CLUSTERED

    (

    [key_id] ASC,

    [sequence] ASC,

    [budget_id] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    ) ON [PRIMARY]

    GO

    ALTER TABLE [dbo].[FNA_BW] ADD CONSTRAINT [DF_FNA_BW_custom_label] DEFAULT (N'') FOR [custom_label]

    GO

    Steps to prove that X(Exclusive) locks were held for the duration of the REBUILD

    1) Created a DBASandbox database on a test Enterprise edition server

    2) Created table FNA_BW in the DBASandbox database just like the table exists in production

    3) Copied the production data in table FNA_BW (22+ million rows) from production to the new table in the DBASandbox database

    4) Created a LockLog table in the master database in test to hold the data about locks held

    5) Created job DBALockSnapshot to execute every 10 seconds in test to INSERT rows into the LockLog table if locking is occurring

    6) Ran the REBUILD command for the primary index of the FNA_BW table in test - ALTER INDEX [PK_FNA_BW] ON [DBASandbox].[dbo].[FNA_BW] REBUILD WITH (SORT_IN_TEMPDB = OFF, ONLINE = ON)

    7) The count of the results from the LockLog table by 10 second intervals and lock type are listed below.

    lockType snapshotTime Count

    IX 2015-06-17 09:00:50.000 22

    X 2015-06-17 09:00:50.000 2784

    IX 2015-06-17 09:01:00.000 22

    X 2015-06-17 09:01:00.000 2601

    IX 2015-06-17 09:01:11.000 22

    X 2015-06-17 09:01:11.000 3207

    IX 2015-06-17 09:01:20.000 22

    X 2015-06-17 09:01:20.000 2844

    IX 2015-06-17 09:01:30.000 22

    X 2015-06-17 09:01:30.000 3151

    IX 2015-06-17 09:01:40.000 22

    X 2015-06-17 09:01:40.000 2819

    IX 2015-06-17 09:01:50.000 22

    X 2015-06-17 09:01:50.000 2149

    IX 2015-06-17 09:02:00.000 22

    X 2015-06-17 09:02:00.000 2911

    IX 2015-06-17 09:02:10.000 16

    X 2015-06-17 09:02:10.000 2226

    IX 2015-06-17 09:02:20.000 6

    X 2015-06-17 09:02:20.000 708

  • A few things:

    Opinion: Your fragmentation level check is set way too low (5% - consider changing this to 10% or 15% as that really isn't a lot of fragmentation unless you've got some incredibly large tables)

    Rebuilding the index online doesn't completely rule out locking - even during an online rebuild When the process completes the table is locked for a brief period and the new index replaces the old index. Basically online index operations need to take two very short-term table locks. An S (Shared) table lock at the start of the operation to force all write plans that could touch the index to recompile, and a SCH-M (Schema-Modification – think of it as an Exclusive) table lock at the end of operation to force all read and write plans that could touch the index to recompile.

    In addition if the index contains LOB columns, ONLINE operations aren't supported anyway (in versions < 2012)

    While the online operation does not take long term locks, it may still consume considerable system resources (e.g. memory, processor, I/O). If these resources become a bottleneck, concurrent queries may experience resource-related waits. Are you noticing any PAGELATCH_UP waits?

    ______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience

  • MyDoggieJessie (6/20/2015)


    A few things:

    Opinion: Your fragmentation level check is set way too low (5% - consider changing this to 10% or 15% as that really isn't a lot of fragmentation unless you've got some incredibly large tables)

    Rebuilding the index online doesn't completely rule out locking - even during an online rebuild When the process completes the table is locked for a brief period and the new index replaces the old index. Basically online index operations need to take two very short-term table locks. An S (Shared) table lock at the start of the operation to force all write plans that could touch the index to recompile, and a SCH-M (Schema-Modification – think of it as an Exclusive) table lock at the end of operation to force all read and write plans that could touch the index to recompile.

    In addition if the index contains LOB columns, ONLINE operations aren't supported anyway (in versions < 2012)

    While the online operation does not take long term locks, it may still consume considerable system resources (e.g. memory, processor, I/O). If these resources become a bottleneck, concurrent queries may experience resource-related waits. Are you noticing any PAGELATCH_UP waits?

    Thank you for the response, but;

    1) I am not concerned about the fragmentation level, just locking (but I appreciate the input)

    2) As I stated in my post 'I understand there can be some timeouts at the beginning and end of the REBUILD, but why are we getting timeouts occurring for the entire duration of the REBUILD?'. I have run some tests and am seeing 'X' locks for the duration of the REBUILD, not just the beginning and end.

    3) I have the 'CREATE' table in my post which clearly shows that no LOB columns exist in the table ergo none in the index.

    4) The resources are not the issue, the issue is the 'X' (exclusive) locks held for pages, keys, and HOBTs in the table for the duration of the REBUILD

  • In an online rebuild, the X locks you are seeing are on the new index structure being built, not the current index structure your queries are hitting.

    You can confirm that by comparing resource_associated_entity_id in sys.dm_tran_locks to partition_id in sys.partitions. While the online rebuild is occurring, there will be an additional entry (if there's only a single partition for that table, since there'll be as many extra entries as partitions for that table) in sys.partitions for the index being rebuilt. That is the target index structure that will get swapped in at the end of the rebuild, which you can confirm by noting that the row count will be ever increasing for that partition, because the target structure is being populated.

    The exclusive locks on the target structure would not be directly affecting your queries. Doing the online index rebuild this way is the whole point: the original index only needs some brief locks at the beginning and the end of the rebuild, and all the X locks are just on a target structure unused by your queries until it gets swapped in at the end of the rebuild. Aside from just noticing the existence of locks, what indicated that they were the cause of the problem?

    I'd start collecting wait stats information on a regular interval (every 5 minutes is my standard, and then I adjust it downward if I need more granular information), and then compare your baseline to how it looks during the rebuild. Seeing what waits change dramatically should help you narrow your search.

    For online index rebuilds, it's more likely resulting resource pressure (IO and maybe CPU as well) that is impacting your workload.

    I hope this helps!

  • Jacob,

    Thank you for the reply. The initial problem we experienced was application timeouts. The developers have implemented a 30 second wait for SQL Server to respond to a request before the application times out. My first attempt at fixing the issue was to run the REBUILD with MAXDOP equal to 4 as it appeared that the processors were being monopolized during the REBUILD, but that did not keep the timeouts from occurring. The CPU usage in activity monitor shows as not going above 30% during the REBUILD. As previously mentioned, we also moved from a 2-disk RAID array to a 20-disk SAN, so the disk performance has improved dramatically. I have tempdb set as 4-1GB files spread across the SAN. The odd thing is that the application INSERT's a row in an Error log table when the timeouts occur and those are successful, so I can't understand why the instance seems to not have any issues except for the code that accesses the table related to the index being rebuilt.

    I am actually using the OFFLINE index rebuild now because it completes in under 30 seconds for all of the indexes save the 2 largest. I ran the OFFLINE rebuild this past weekend for the 2 largest indexes and the index for one was locked for 5 minutes and the other for 6 minutes which is not ideal, but acceptable for the one time every 6 weeks where the fragmentation percentage exceeds 30%. Prior to the SAN being utilized, the rebuild ONLINE was executing for over 4 hours for each of the largest indexes.

  • No problem!

    Since my previous reply I've had more caffeine, and remembered that I was incorrect in one very important way, namely, you can get some blocking.

    Writes to the source index do have to be applied to the target index, and I have seen cases where the rebuild process is taking its X locks at the page level on the target index, and writes from user processes are taking X locks at the key level, and thus an IX lock at the page level.

    In that case, the X locks on the pages in the target index will block the IX locks being taken by the user writes.

    I'd bet dollars to donuts that if you looked at wait stats before and after the rebuild (with concurrent user activity), you'd see a huge spike in LCK_M_IX waits from the sort of scenario described above.

    In that specific scenario, you can mitigate the blocking by disallowing one of the lock types and then just changing the option back when the rebuild is done. So long as both processes have to take out the same sort of lock, the blocking is reduced (in my tests, at least). It's most reduced by disallowing page locks, but that does slow the rebuild fairly substantially.

    Cheers!

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply