Changing the clustered index

  • To be sure, on large tables, it's not just an obsession.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • It's also a bit odd for me to have you tell me not to be so concerned with the speed of imports when you, yourself, have posted just such a concern at http://www.sqlservercentral.com/Forums/FindPost1664429.aspx. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (2/27/2015)


    It's also a bit odd for me to have you tell me not to be so concerned with the speed of imports when you, yourself, have posted just such a concern at http://www.sqlservercentral.com/Forums/FindPost1664429.aspx. 😉

    The concern there is the overhead adding by forcing the varchar(50) to be in a separate nonclustered index, thus doubling I/O for every INSERT.

    And, as I noted above, forcing the varchar(50) to a separate nonclus index also doubles the I/O on every lookup. That's not one I/O being doubled, btw, but one for each level of index, typically three for millions of rows.

    The INSERT speed will also be slower with an additional index to load, particularly since that index will fragment.

    The table with identity will not have 0% freespace either. And all INSERTs have to be serialized, so that identity numbers can be sequentially assigned.

    For this specific table, since SELECTs always specify specific key value(s), why force additional index(es), with subsequent maintenance, adjusting to keep the index covering, etc.? It just doesn't make sense at all from a performance perspective. The only reason seems to be to avoid a varchar(50) as a clustering key: that's just wrong since the varchar(50) is the only index needed.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher (2/27/2015)


    Jeff Moden (2/27/2015)


    It's also a bit odd for me to have you tell me not to be so concerned with the speed of imports when you, yourself, have posted just such a concern at http://www.sqlservercentral.com/Forums/FindPost1664429.aspx. 😉

    The concern there is the overhead adding by forcing the varchar(50) to be in a separate nonclustered index, thus doubling I/O for every INSERT.

    And, as I noted above, forcing the varchar(50) to a separate nonclus index also doubles the I/O on every lookup. That's not one I/O being doubled, btw, but one for each level of index, typically three for millions of rows.

    The INSERT speed will also be slower with an additional index to load, particularly since that index will fragment.

    The table with identity will not have 0% freespace either. And all INSERTs have to be serialized, so that identity numbers can be sequentially assigned.

    For this specific table, since SELECTs always specify specific key value(s), why force additional index(es), with subsequent maintenance, adjusting to keep the index covering, etc.? It just doesn't make sense at all from a performance perspective. The only reason seems to be to avoid a varchar(50) as a clustering key: that's just wrong since the varchar(50) is the only index needed.

    Imagine an "out-of-insert-order" index (clustered or otherwise) that creates a page split for every insert. On large tables, it can very quickly become a serious problem especially if that index is the clustered index.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (2/27/2015)


    ScottPletcher (2/27/2015)


    Jeff Moden (2/27/2015)


    It's also a bit odd for me to have you tell me not to be so concerned with the speed of imports when you, yourself, have posted just such a concern at http://www.sqlservercentral.com/Forums/FindPost1664429.aspx. 😉

    The concern there is the overhead adding by forcing the varchar(50) to be in a separate nonclustered index, thus doubling I/O for every INSERT.

    And, as I noted above, forcing the varchar(50) to a separate nonclus index also doubles the I/O on every lookup. That's not one I/O being doubled, btw, but one for each level of index, typically three for millions of rows.

    The INSERT speed will also be slower with an additional index to load, particularly since that index will fragment.

    The table with identity will not have 0% freespace either. And all INSERTs have to be serialized, so that identity numbers can be sequentially assigned.

    For this specific table, since SELECTs always specify specific key value(s), why force additional index(es), with subsequent maintenance, adjusting to keep the index covering, etc.? It just doesn't make sense at all from a performance perspective. The only reason seems to be to avoid a varchar(50) as a clustering key: that's just wrong since the varchar(50) is the only index needed.

    Imagine an "out-of-insert-order" index (clustered or otherwise) that creates a page split for every insert. On large tables, it can very quickly become a serious problem especially if that index is the clustered index.

    Can't imagine how on earth every insert could cause a page split with any thought in advance given to setting the proper fillfactor. How bizarre. But it would do the same to your nonclustered index, which at that point is your only lookup method for that table, so the detrimental affects are roughly the same.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Can't imagine how on earth every insert could cause a page split with any thought in advance given to setting the proper fillfactor. How bizarre. But it would do the same to your nonclustered index, which at that point is your only lookup method for that table, so the detrimental affects are roughly the same.

    I can.

    I have (actually had) a table that caused a split with nearly every insert.

    To explain further, the system read data into a "staging" table with an identity as the clustered primary key.

    Once processed, the records were inserted into the archive table. The integer from the identity was the clustered primary key on this table.

    Each row was nearly 8000 bytes in length. The rows in the staging table were not processed in order, so almost every insert into this table caused a page split.

    Changing the fill factor to 50% reduced the splits by 30%. The size of the table and the indexes got to be 220 GB, which was 25% of the total database size.

    Approximately 14k rows were inserted into this table every day.

    A new identity column was created as the clustered primary key on this table, and the page splits stopped.

    Is this a common thing? No. But fillfactor, IMHO, is one of the most misused feature in SQL Server.

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • A low cardinality NCI will also cause such splits.

    And that finally brings me to the point that I've been trying to make.

    As you said, you shouldn't just do the "best practice" thing. You need to actually think about things especially when it comes to large tables or, as Michael just pointed out, when you have nearly page-full wide rows to contend with. Just throwing a clustered index in the direction of performance of SELECTs is not enough. Just throwing a FILL FACTOR on a table/index so that INSERTs might run more quickly is not enough. Even if the creation of a perfect clustered index that will support all of your SELECT queries was possible (and, it's not unless you have just one form of criteria that always uses the same leading column), you have to consider more than just the performance of SELECTs.

    In some cases, "best practices" actually are (although I'll agree that many are not).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • sql-lover (2/27/2015)


    djj (2/27/2015)


    Grant Fritchey (2/27/2015)


    I'm not a fan of GUIDs, but they are not inherently evil. Before you replace it, I'd find out why you have it. If there's no good reason, then by all means, rip it out. But be sure there's not a good purpose there.

    I will keep the GUID column but just change from primary/clustered to unique non-clustered index.

    The other option is to change the clustered index and keep the GUID the primary key.

    My initial comment was about that GUID being a Cluster Index, not dropping the GUID, if your app really uses it. A GUID as a Cluster Index is bad! The fragmentation will kill you.

    If that GUID should be a unique NCI or PK, it's up to you. They internally act pretty much the same, except that primary key column cannot be nullable, unique index column can, I think. But for some third party apps and if you are trying to adhere to the relational model, a PK should be deployed in SQL as an actual PK.

    I don't agree. You will see fragmentation. And, depending on the system, that might be very painful. But, depending on the system, and if you set your fill factor to something other than the default, it might be effectively pain free.

    Maybe.

    I'd say test it to be sure.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • The whole idea around "best practices" is not that they fit in ALL circumstances, but that they fit in MOST circumstances (and most is substantially higher than 50.00001%). Of course you should be cautious about following best practices blindly. But, you should also be very damned cautious about throwing the best practices out the door because "in this one situation over here I found the best practice to fail."

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Man, I go away for a day and what I thought was a simple question started a long discussion. :Wow:

    Now I have to digest all the information to learn what I can.

    Thank you everyone. Now to do some testing.

  • Grant Fritchey (2/28/2015)


    sql-lover (2/27/2015)


    djj (2/27/2015)


    Grant Fritchey (2/27/2015)


    I'm not a fan of GUIDs, but they are not inherently evil. Before you replace it, I'd find out why you have it. If there's no good reason, then by all means, rip it out. But be sure there's not a good purpose there.

    I will keep the GUID column but just change from primary/clustered to unique non-clustered index.

    The other option is to change the clustered index and keep the GUID the primary key.

    My initial comment was about that GUID being a Cluster Index, not dropping the GUID, if your app really uses it. A GUID as a Cluster Index is bad! The fragmentation will kill you.

    If that GUID should be a unique NCI or PK, it's up to you. They internally act pretty much the same, except that primary key column cannot be nullable, unique index column can, I think. But for some third party apps and if you are trying to adhere to the relational model, a PK should be deployed in SQL as an actual PK.

    I don't agree. You will see fragmentation. And, depending on the system, that might be very painful. But, depending on the system, and if you set your fill factor to something other than the default, it might be effectively pain free.

    Maybe.

    I'd say test it to be sure.

    Well , we will have to agree on disagree on this one.

    Even if you can minimize or eliminate the fragmentation lowering the fill factor, they use more space than a normal 4 byte Integer ID. In fact, depending of the table, it can lead to huge space consumption.

    Moreover, there is a possibility, while low, that they won't be unique.

    I've never seen in my life , as a DBA, a developer that implemented GUID and it didn't cause performance or space issues at database level.

    I think that GUIDs are one of the worse things a developer can do or implement on a SQL database.

  • sql-lover (3/1/2015)


    Grant Fritchey (2/28/2015)


    sql-lover (2/27/2015)


    djj (2/27/2015)


    Grant Fritchey (2/27/2015)


    I'm not a fan of GUIDs, but they are not inherently evil. Before you replace it, I'd find out why you have it. If there's no good reason, then by all means, rip it out. But be sure there's not a good purpose there.

    I will keep the GUID column but just change from primary/clustered to unique non-clustered index.

    The other option is to change the clustered index and keep the GUID the primary key.

    My initial comment was about that GUID being a Cluster Index, not dropping the GUID, if your app really uses it. A GUID as a Cluster Index is bad! The fragmentation will kill you.

    If that GUID should be a unique NCI or PK, it's up to you. They internally act pretty much the same, except that primary key column cannot be nullable, unique index column can, I think. But for some third party apps and if you are trying to adhere to the relational model, a PK should be deployed in SQL as an actual PK.

    I don't agree. You will see fragmentation. And, depending on the system, that might be very painful. But, depending on the system, and if you set your fill factor to something other than the default, it might be effectively pain free.

    Maybe.

    I'd say test it to be sure.

    Well , we will have to agree on disagree on this one.

    Even if you can minimize or eliminate the fragmentation lowering the fill factor, they use more space than a normal 4 byte Integer ID. In fact, depending of the table, it can lead to huge space consumption.

    Moreover, there is a possibility, while low, that they won't be unique.

    I've never seen in my life , as a DBA, a developer that implemented GUID and it didn't cause performance or space issues at database level.

    I think that GUIDs are one of the worse things a developer can do or implement on a SQL database.

    I mostly agree with the last post sql-lover above. Lowering the FILL FACTOR is in and of itself a performance problem (especially for clustered indexes) because you have to read more pages to get the same amount of information. On the flip side, it won't help at all if you have a row size that's over 4K bytes (for example). There's nothing good about a GUID except maybe for some lazy convenience owing to the very wide random values with certain techniques. Although the likelihood of duplication is insanely low, they're not even guaranteed to be unique across machines since they switched to type 4 GUIDs.

    http://en.wikipedia.org/wiki/Universally_unique_identifier#Version_4_.28random.29

    You also have to consider how many places a GUID will be used if it's a key for DRI purposes. You're not just talking about the difference between a 4 byte integer and a 16 byte GUID one time. You have to consider all of the other tables that it's being used in and the other indexes that it's being used in. And, if it's the first column of those indexes, clustered or not, there will always be a huge and guaranteed pain in dealing with them not to mention the disk space used.

    That being said about GUIDs, I also have to agree with Grant on the point that you can't just do a conversion without an absolute understanding of why a GUID may have been used as a key. If it's like the place I once worked at where they used GUIDs and PKs and, usually, CIs everywhere as a matter of rote, then great candidate for replacement. If they used it to support the likes of peer-to-peer replication or "dual mastering", then changing the GUIDs could have a major detrimental impact.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 12 posts - 16 through 26 (of 26 total)

You must be logged in to reply to this topic. Login to reply