Should this index be clustered?

  • I have a table that is heavily used which inserts and deletes rows frequently during the day. This is just a placeholder to identify what the system should work on, so it just has a large key. It's currently a heap with a NC Primary Key, but I feel that it should be clustered to prevent duplication of data. The people that created this, thought that it would create too much contention and fragmentation, but in my opinion this would just be replicated on the index.
    Here's the current structure of the table (with the names changed). Am I right? What or how should I test to give a certain answer?

    CREATE TABLE SomeWorkTable(
      MaintTableID    int NOT NULL,
      SomeCode    char(1) NOT NULL ,
      MilestoneID    int NOT NULL DEFAULT 0,
      Loaded    bit NULL,
      RuleID    int NOT NULL DEFAULT 1,
      SomeOtherID    int NOT NULL DEFAULT 0,
    );

    GO
    ALTER TABLE SomeWorkTable ADD CONSTRAINT PK_SomeWorkTable PRIMARY KEY NONCLUSTERED(MaintTableID, SomeCode, MilestoneID, RuleID, SomeOtherID);
    GO

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares - Tuesday, November 14, 2017 10:51 AM

    I have a table that is heavily used which inserts and deletes rows frequently during the day. This is just a placeholder to identify what the system should work on, so it just has a large key. It's currently a heap with a NC Primary Key, but I feel that it should be clustered to prevent duplication of data. The people that created this, thought that it would create too much contention and fragmentation, but in my opinion this would just be replicated on the index.
    Here's the current structure of the table (with the names changed). Am I right? What or how should I test to give a certain answer?

    CREATE TABLE SomeWorkTable(
      MaintTableID    int NOT NULL,
      SomeCode    char(1) NOT NULL ,
      MilestoneID    int NOT NULL DEFAULT 0,
      Loaded    bit NULL,
      RuleID    int NOT NULL DEFAULT 1,
      SomeOtherID    int NOT NULL DEFAULT 0,
    );

    GO
    ALTER TABLE SomeWorkTable ADD CONSTRAINT PK_SomeWorkTable PRIMARY KEY NONCLUSTERED(MaintTableID, SomeCode, MilestoneID, RuleID, SomeOtherID);
    GO

    Not sure what you mean by duplication of data, unless you mean that the nonclustered index is duplicating all the data from the table.  My question would be how are the inserts and deletes being done?  Is this table like a queue, first in first out?  If so, I would make the index a clustered index and do it such that new data is inserted at the end and old data pulled from the front, just like a queue.

  • Lynn Pettis - Tuesday, November 14, 2017 12:00 PM

    Not sure what you mean by duplication of data, unless you mean that the nonclustered index is duplicating all the data from the table.  My question would be how are the inserts and deletes being done?  Is this table like a queue, first in first out?  If so, I would make the index a clustered index and do it such that new data is inserted at the end and old data pulled from the front, just like a queue.

    Yes, that's what I meant. Each row is written once on the table and then basically copied on the nonclustered index (except for the bit column).
    The inserts are being done either by batch loads or  whenever the system performs an action that affects the data from tables referenced by MainTableID and SomeOtherID. The deletes are being done after the row has been evaluated and assigned to the correct location.
    This is not a FIFO, as there are some way to establish precedence based mainly on the MilestoneID. Ideally, everything should be deleted before the next batch load comes which would contain thousands of new rows.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Please forgive me if I'm misunderstanding what you said but I don't get this:

    It's currently a heap with a NC Primary Key, but I feel that it should be clustered to prevent duplication of data.

    This will fail:
    create table dbo.x (id int primary key nonclustered);
    insert dbo.x values (1),(1);

    This will not:
    create table dbo.y (id int not null);
    create clustered index cl_y on dbo.y(id);
    insert dbo.y values (1),(1);

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Alan.B - Tuesday, November 14, 2017 4:36 PM

    Please forgive me if I'm misunderstanding what you said but I don't get this:

    It's currently a heap with a NC Primary Key, but I feel that it should be clustered to prevent duplication of data.

    This will fail:
    create table dbo.x (id int primary key nonclustered);
    insert dbo.x values (1),(1);

    This will not:
    create table dbo.y (id int not null);
    create clustered index cl_y on dbo.y(id);
    insert dbo.y values (1),(1);

    Pretty sure he was talking about making the primary key (all the columns of the table in this case) a clustered index instead of a nonclustered index.

  • Lynn Pettis - Tuesday, November 14, 2017 4:39 PM

    Alan.B - Tuesday, November 14, 2017 4:36 PM

    Please forgive me if I'm misunderstanding what you said but I don't get this:

    It's currently a heap with a NC Primary Key, but I feel that it should be clustered to prevent duplication of data.

    This will fail:
    create table dbo.x (id int primary key nonclustered);
    insert dbo.x values (1),(1);

    This will not:
    create table dbo.y (id int not null);
    create clustered index cl_y on dbo.y(id);
    insert dbo.y values (1),(1);

    Pretty sure he was talking about making the primary key (all the columns of the table in this case) a clustered index instead of a nonclustered index.

    Yes, I should have made myself clearer. I was talking about duplicating the data by storing each row in the table and the non clustered index instead of only once using a clustered index.
    The primary key needs to be kept in place, but I want to make sure that making it clustered instead of non-clustered is the right choice.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Here's a reminder of the issues you may experience if you're performing deletes from your heap table:

    https://www.red-gate.com/simple-talk/sql/t-sql-programming/delete-operation-sql-server-heaps/

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work - Wednesday, November 15, 2017 7:01 AM

    Here's a reminder of the issues you may experience if you're performing deletes from your heap table:

    https://www.red-gate.com/simple-talk/sql/t-sql-programming/delete-operation-sql-server-heaps/

    Thank you, Chris.
    This will be really useful when advocating the use of clustered indexes

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares - Wednesday, November 15, 2017 7:41 AM

    ChrisM@Work - Wednesday, November 15, 2017 7:01 AM

    Here's a reminder of the issues you may experience if you're performing deletes from your heap table:

    https://www.red-gate.com/simple-talk/sql/t-sql-programming/delete-operation-sql-server-heaps/

    Thank you, Chris.
    This will be really useful when advocating the use of clustered indexes

    You're welcome Luis. It's easily forgotten and occasionally causes runaway table bloat.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • If doing a lot of deletes, then cluster the table.
    When deleting from a heap, the empty pages dont get deallocated.
    So your table grows forever.

    Catch-all queries done right [/url]
    Gail Shaw's Performance Blog[/url]

  • Luis Cazares - Wednesday, November 15, 2017 6:22 AM

    Yes, I should have made myself clearer. I was talking about duplicating the data by storing each row in the table and the non clustered index instead of only once using a clustered index. 

    The primary key needs to be kept in place, but I want to make sure that making it clustered instead of non-clustered is the right choice.

    Is this combination of columns (MaintTableID, SomeCode, MilestoneID, RuleID, SomeOtherID) how the table is the most frequently accessed?  Or at least the first column(s)?

  • Luis Cazares - Tuesday, November 14, 2017 10:51 AM

    I have a table that is heavily used which inserts and deletes rows frequently during the day. This is just a placeholder to identify what the system should work on, so it just has a large key. It's currently a heap with a NC Primary Key, but I feel that it should be clustered to prevent duplication of data. The people that created this, thought that it would create too much contention and fragmentation, but in my opinion this would just be replicated on the index.
    Here's the current structure of the table (with the names changed). Am I right? What or how should I test to give a certain answer?

    CREATE TABLE SomeWorkTable(
      MaintTableID    int NOT NULL,
      SomeCode    char(1) NOT NULL ,
      MilestoneID    int NOT NULL DEFAULT 0,
      Loaded    bit NULL,
      RuleID    int NOT NULL DEFAULT 1,
      SomeOtherID    int NOT NULL DEFAULT 0,
    );

    GO
    ALTER TABLE SomeWorkTable ADD CONSTRAINT PK_SomeWorkTable PRIMARY KEY NONCLUSTERED(MaintTableID, SomeCode, MilestoneID, RuleID, SomeOtherID);
    GO

    Just my 2 cents....

    Because they're doing deletes, there will be physical fragmentation that results in reduced page density.  This will be true with both a Clustered Index (CI from here on), a Non-Clustered Index (NCI from here on), and a Heap.  If the "table" is a HEAP with an NCI, then not only have you seriously duplicated the data with the NCI, but you've also doubled the work for the deletes because both the HEAP and the NCI will suffer deletes.  That also means that, regardless of which Recovery Model you use, the Transaction Log File (Log File from here on) will be quite busy doing twice the activity during the DELETEs as it would if a CI were used.  The same holds true with INSERTs.  The Heap/NCI combination will suffer twice the Log File activity as that which will occur with just a CI.  Just that will cause the HEAP/NCI combination to run a fair bit slower for both the INSERTs and the DELETEs.

    If a Heap and NCI is used instead of a CI, then consider the width of the NCI.  It's nearly as wide as the CI and, unless the data is INSERTED into the HEAP with the NCI in the same order as the NCI, then roughly the same number of "bad" page splits will occur on the NCI as they would on the CI.  The trouble is that you'll also have twice the amount of Log File activity as if you were using a CI because both the Heap and the wide NCI must be INSERTED into.

    Since both methods will suffer roughly the same number of "bad" page splits and both methods will suffer massive physical fragmentation due to the DELETEs  and since the activity of the Log File will be double that of the CI, I'd have to say having the CI is the best idea.  Because there would be half the Log File activity with the CI, related INSERT and DELETE code should run approximately twice as fast if no other changes are made.

    Shifting gears a bit and thinking of additional optimizations, in comparison with the index keys, I have some questions.
    1. What order will the inserts be done in?  This will be important if we consider the possibility of "minimal logging", which would double the speed of INSERTs.
    2. Will the inserts interleave with previous rows, or will they be appended to the end of the sorted order or rows?  This is an important consideration for two reasons.  First, if the rows are interleaved for each batch of INSERTs, the we need to consider rebuilding the index (with an appropriate fill factor  ) between batches to prevent "bad" page splits which can and will take a huge toll on the log file.  If the rows of the new INSERTs are NOT interleaved with existing data, then we can take "Minimal Logging" into consideration even with the CI in place if we fire up Trace Flag 610.
    3.  What order are the DELETEs done in?  If they're in the same order as the CI, there are several tricks to make it so that we don't actually have to do DELETEs and, at the same time, rebuild the index with the correct Fill Factor to prepare for the next batch of INSERTs and we could do it all in a minimally logged fashion.

    Bottom line is, even if we can't take advantage of any of the stuff above, the "twice over" usage of the Log File and the fact that the NCI will suffer the same as a CI for page splits (physical fragmentation) and logical fragmentation, the CI is still the way to go and should pan out to be a fair bit faster than the Heap/NCI combo as well as having half the physical footprint on both disk and in memory.

    p.s.  Deletes are stupid on temporary data. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply