Reducing Index Fragmentaion during Inserts

  • Hi Folks,

    This is my problem.

    We have a database with a table that contains around 180m records. Each day a further 70k are inserted. No records are ever deleted as this table is used for archiving only.

    Users are required to perform SELECTs on this table constantly but due to the high number of INSERTs the indexes become very fragmented very quickly. My aim is to avoid daily rebuilds of the indexes which is what our software house is telling us we have to do.

    This is the DDL for the table:

    CREATE TABLE [dbo].[Inventory](

    [EAN] [bigint] NOT NULL,

    [Day] [smalldatetime] NOT NULL,

    [State] [int] NOT NULL,

    [Quantity] [int] NULL,

    [StockValue] [float] NULL,

    CONSTRAINT [PK_Inventory] PRIMARY KEY CLUSTERED

    (

    [EAN] ASC,

    [Day] ASC,

    [State] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    ) ON [PRIMARY]

    There are also three clustered Indexes on this table each referencing a single column. The problem from my side is that I cannot understand why the three columns in a primary key would also be configured as non-clustered indexes.

    My solution would be one of the following:

    1. Accept the tables are going to be fragmented and require a daily rebuild (don't like this one!)

    2. Partition the table

    3. Remove the non-clustered Indexes and let the clustered index for the primary key do the work.

    Does anyone have any thoughts on the matter?

    Thanks!

    Kev

  • The problem here is that the inserts are not in an ever increasing order, possible option would be adding an Identity column for the primary key and then a unique index instead of the old pk, something like this

    😎

    CREATE TABLE dbo.Inventory

    (

    EAN bigint NOT NULL

    ,Day smalldatetime NOT NULL

    ,State int NOT NULL

    ,Quantity int NULL

    ,StockValue float NULL

    ,CONSTRAINT PK_Inventory PRIMARY KEY CLUSTERED

    (

    EAN ASC,

    Day ASC,

    State ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    ) ON [PRIMARY];

    /* Replacing the existing PK with an Identity column and adding a unique index */

    ALTER TABLE dbo.Inventory ADD Inventory_ID BIGINT IDENTITY(1,1) NOT NULL;

    ALTER TABLE dbo.Inventory DROP CONSTRAINT PK_Inventory;

    ALTER TABLE dbo.Inventory ADD CONSTRAINT PK_DBO_INVENTORY_INVENTORY_ID PRIMARY KEY CLUSTERED (Inventory_ID ASC);

    CREATE UNIQUE NONCLUSTERED INDEX UNQNCLIDX_DBO_INVENTORY_EAN_DAY_STATE ON dbo.Inventory ( EAN ASC , Day ASC , State ASC );

    70K p. day mounts only to 25-26 million a year, shouldn't be to hard to handle;-)

  • Lowering the fill factor on your index(es) may help as well. Lowering the fill factor will create more pages but it can certainly help reduce (or prevent) fragmentation when you have a lot of non-sequential inserts.

    What is the current fill factor?

  • kevaburg (4/26/2015)


    Hi Folks,

    This is my problem.

    We have a database with a table that contains around 180m records. Each day a further 70k are inserted. No records are ever deleted as this table is used for archiving only.

    Users are required to perform SELECTs on this table constantly but due to the high number of INSERTs the indexes become very fragmented very quickly. My aim is to avoid daily rebuilds of the indexes which is what our software house is telling us we have to do.

    This is the DDL for the table:

    CREATE TABLE [dbo].[Inventory](

    [EAN] [bigint] NOT NULL,

    [Day] [smalldatetime] NOT NULL,

    [State] [int] NOT NULL,

    [Quantity] [int] NULL,

    [StockValue] [float] NULL,

    CONSTRAINT [PK_Inventory] PRIMARY KEY CLUSTERED

    (

    [EAN] ASC,

    [Day] ASC,

    [State] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    ) ON [PRIMARY]

    There are also three clustered Indexes on this table each referencing a single column. The problem from my side is that I cannot understand why the three columns in a primary key would also be configured as non-clustered indexes.

    My solution would be one of the following:

    1. Accept the tables are going to be fragmented and require a daily rebuild (don't like this one!)

    2. Partition the table

    3. Remove the non-clustered Indexes and let the clustered index for the primary key do the work.

    Does anyone have any thoughts on the matter?

    Thanks!

    Kev

    Yes. Partition the table on a temporal basis and then do a daily rebuild of just the current partition.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi all!

    Many, many thanks for the tips. I have decided to use a combination of the advice given here.

    I will add an identity column in the table and assign it a Primary Key as a clustered index. That will Speed up the inserts and reduce fragmentation on the index.

    I will also introduce a temporal partitioning scheme so that we don't have all of the data in a single Partition. That should Speed up searches even further.

    I will post again with the results.

    Regards,

    Kev

  • The partitioning idea makes sense... But I don't see how adding an identity column and making it the clustered index buys you anything if none of you queries are going to use it.

    It may load faster than the current clustered index but it's not going to load as fast as a heap. And... A cluster index that isn't used in a query isn't adding any performance over what you'd get from a heap and you'll still experience the same fragmentation on the unique non-clustered index that you were getting before. It just adds the overhead of maintaining an unused index...

  • The other indexes can be more efficient with the int (or bigint) rather than an internal key for uniqueness.

    I'm not sure at 70k (or 700k) that the PK is an issue here.

  • I wouldn't mind having some sample data to determine if the original primary key actually made sense. As Jeff mentioned a temporal based partition means having some sort of date value somewhere, even if you have to create a default on insert (perhaps the date the records got inserted). I could see this column serving as a CI and helping define the partition.

    As for the NCI's , they could use a identity primary key for their lookups quite well as mentioned, especially if the IDENTITY value correlates to the DATE (both ascend through time together). So it would make sense overall to apply both the partition and identity key approach (I just question the definition of the clustered index) .

    ----------------------------------------------------

  • MMartin1 (4/30/2015)


    I wouldn't mind having some sample data to determine if the original primary key actually made sense. As Jeff mentioned a temporal based partition means having some sort of date value somewhere, even if you have to create a default on insert (perhaps the date the records got inserted). I could see this column serving as a CI and helping define the partition.

    As for the NCI's , they could use a identity primary key for their lookups quite well as mentioned, especially if the IDENTITY value correlates to the DATE (both ascend through time together). So it would make sense overall to apply both the partition and identity key approach (I just question the definition of the clustered index) .

    I can't post data unfortunately because it is live production data based on sales Information! The original script including the Primary key is posted above in my first post though.

    The Problem as I am beginning to realise is the calculation of the three columns that determines where it will be placed on disk as part of the clustered index algorithm. If the hash value of a new record (based on the three columns in the clustered index) is less than that already stored in the table, then the records will be newly sorted to accommodate the new row. This obviously slows up the insert process.

    The advice I received was to create an identity column and base the clustered index on that. The three columns in the Primary key would then use a non-clustered index.

    A temporal Partition scheme would then also be used to separate actual usable data from data that is essentially archived.

    Thanks to other priorities at the Moment I haven't been able to do much to resolve this issue but hopefully next week I can crack on.....

    Regards,

    Kev

  • kevaburg (4/30/2015)I can't post data unfortunately because it is live production data based on sales Information!

    Just a quick caveat having read your comment above, and not wishing to teach grandmother to suck eggs (!) but you do have a non-production environment in which to try the changes that people have recommended, right? Don't make any untested changes to your prod system.

    I'll be quiet now 😉

  • kevaburg (4/27/2015)


    Hi all!

    Many, many thanks for the tips. I have decided to use a combination of the advice given here.

    I will add an identity column in the table and assign it a Primary Key as a clustered index. That will Speed up the inserts and reduce fragmentation on the index.

    I will also introduce a temporal partitioning scheme so that we don't have all of the data in a single Partition. That should Speed up searches even further.

    I will post again with the results.

    Regards,

    Kev

    Then you don't actually want the identity column to be the clustered index. You'll want the date column along with the identity column (for uniqueness to prevent other problems) to be the clustered index.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Just a really simple idea.

    Try setting PAD_INDEX = ON and FILLFACTOR = 80 as an initial step.

    It will grow the index fairly significantly but you'll avoid daily index rebuilds.

    Jeff's suggestion about a temporal partitioning scheme is probably the most elegant solution (especially as this is an add-only table), but just spreading the index out a little further may buy you some time while you're building the REAL solution!

  • Just a really simple idea.

    Try setting PAD_INDEX = ON and FILLFACTOR = 80 as an initial step.

    I recently did something like this in a situation you describe and it worked very well at reducing the fragmentation. The PAD_INDEX is useful if the fragmentation is occuring at levels above the leaf, but that's also what I was experiencing.

  • RonKyle (5/5/2015)


    Just a really simple idea.

    Try setting PAD_INDEX = ON and FILLFACTOR = 80 as an initial step.

    I recently did something like this in a situation you describe and it worked very well at reducing the fragmentation. The PAD_INDEX is useful if the fragmentation is occuring at levels above the leaf, but that's also what I was experiencing.

    You should also generally expect a 20% reduction in SELECT performance and a 20% growth in the size of the index and a 20% increase in the memory usage of the index even for the parts of the index that might not actually be used much not to mention backup/restore footprints etc.

    Be careful when you decide to use lower fill factors especially on larger tables. Remember that indexes are a duplication of data to begin with. Having them take an extra 20% in all the areas they touch might not be what you expected especially when it comes to memory usage.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • You should also generally expect a 20% reduction in SELECT performance and a 20% growth in the size of the index and a 20% increase in the memory usage of the index even for the parts of the index that might not actually be used much not to mention backup/restore footprints etc.

    This might be true for otherwise well-designed tables. But in this case the tables were flexively clustered on the primary key, which was a GUID data type. All the indexes were fragmenting in rapid fashion. After looking at the primary queries used to access data from the tables, I changed many of the tables to a clustered key based on the entry date, a small date time field. Even with the uniqueifier it was smaller than the original GUID. The tables fragment at a very low pace now and all the user feedback from across the country has been positive. Indexing is really more an art than a science, and I had to continue to refine the values in response to observation. In the case of one index I lowered the fill factor to 60. I wouldn't have used GUIDS in this case, but I have to take the databases that have been handed to me.

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply