Help resolving duplicates with reindexing logic

  • Hey all,

    Recently discovered my reindex routine is causing dupes. I was able to track it down to when/if an index has multiple allocation unit types. For example my particular index I'm playing with has both an IN_ROW_DATA and LOB_DATA.

    I've been reading some random posts here and elsewhere as well as reviewing others scripts and the common theme I'm finding is that most simply limit their query to say WHERE alloc_unit_type_desc = 'IN_ROW_DATA'. I'm just curious if this is the proper thing to do and why? This post and this one do help; however I am simply not yet sure I fully understand.

    For my particular index that I'm focusing on, as stated I have two different alloc_unit_type's; however they're both index_level 0 with two different index_depths. The LOB_DATA is at depth 1 and IN_ROW_DATA at depth 3. Is there logic to determine at what depth or level to focus on, or do you simply focus on IN_ROW_DATA only?

    What happens if you have an index (I can't find any yet) that all the fragmentation exists outside of an IN_ROW_DATA? I read here that LOB allocations will never get fragmented; thus you need not worry.

    So I know I can fix it; however I'd really like to better understand this first.

    Thanks

  • LOB_DATA is not normally re-indexed, as it it can not really be indexed (except by full text indexing). Are these maybe included columns?

  • LOB_DATA is your varchar(max), nvarchar(max), varbinary(max), XML and a few other data types. They have special storage.

    Typically you want to filter for 'in row data' when reindexing because reindexing affects only the in row data pages (reorg can compress lob data too, but that's not fragmentation in the traditional sense). If you don't, you can end up rebuilding the same index up to 3 times, which is a waste of time.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Hi Matt, thanks for the prompt response!

    As I'm looking at my environment at large, I'm finding multiple incidents where I've been doubling down on the work to do because of this logic.

    This particular index, it's a clustered index, on one nvarchar(32) column. So this one in particular, how/why does it have a LOB_DATA?

    Either way, I'm veering off topic vs. the overall inquiry of attempting to understand how to focus a reindex routine. Is there a general rule/reasoning to what to focus on and why? Based on most existing routines out there that I've reviewed, most are simply using a where clause for in_row_data.

  • GilaMonster (1/21/2014)


    LOB_DATA is your varchar(max), nvarchar(max), varbinary(max), XML and a few other data types. They have special storage.

    Typically you want to filter for 'in row data' when reindexing because reindexing affects only the in row data pages (reorg can compress lob data too, but that's not fragmentation in the traditional sense). If you don't, you can end up rebuilding the same index up to 3 times, which is a waste of time.

    Exactly which is what I was doing! Yeah, I realize LOB_DATA is all the lob data types (I exclude those based on sys.columns); however in this particular case as shown above ... it's not using a LOB data type, so now I'm more confused.

    Regardless, you're stating though to truly only focus on IN_ROW_DATA then?

  • Can you post the links you have that relate to reindexing causing duplicate records? I have never experienced this issue.

  • Lynn Pettis (1/21/2014)


    Can you post the links you have that relate to reindexing causing duplicate records? I have never experienced this issue.

    Hey Lynn, it's not the act of reindexing that's causing dupes, it's the logic to query the data from dm_db_index_physical_stats.

  • At a leaf level, a clustered index is the table. If you have a primary key defined on a column defined as nvarchar(32), and your table has a LOB column in it, you will probably see LOB storage at the leaf level of the output of sys.dm_db_index_physical_stats (I don't have an example handy, so some testing will need to be done to confirm).

  • Matt Crowley (1/21/2014)


    At a leaf level, a clustered index is the table. If you have a primary key defined on a column defined as nvarchar(32), and your table has a LOB column in it, you will probably see LOB storage at the leaf level of the output of sys.dm_db_index_physical_stats (I don't have an example handy, so some testing will need to be done to confirm).

    Ah, ok, that makes sense (duh). You are correct, this particular table does have an ntext column, so that would make sense why I have a LOB_DATA then. So again in this particular case (I have well over 50,000 more incidents globally), it would make sense to only focus on IN_ROW_DATA.

    I'm simply trying to understand what if any ramifications exist by limiting myself to IN_ROW_DATA only across the board. Keeping in mind, I have to manage over 5,000 databases ... so I try to handle all scenario's properly. I just worry that by changing my ~7 year old logic that I am going to begin missing indexes out there.

  • Adam Bean (1/21/2014)


    Lynn Pettis (1/21/2014)


    Can you post the links you have that relate to reindexing causing duplicate records? I have never experienced this issue.

    Hey Lynn, it's not the act of reindexing that's causing dupes, it's the logic to query the data from dm_db_index_physical_stats.

    That makes more sense. Your original post made it seem like your reindexing was causing duplicate records to be created.

  • I would expect no issues if you concentrate solely on IN_ROW_DATA. As I mentioned before, the only option for indexing BLOB type data is with full text indexing, so that won't show up in sys.dm_db_index_physical_stats. Re-organizing BLOB type data would be similar to re-orgnizing a heap, and involve a full export and re-import of the data, and hoping the data lands in the datafile in a contiguous manner. If BLOB fragmentation is a problem, I would move to a Filestream setup, and let the OS handle the fragmentation.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply