PK Non-CLUSTERED to CLUSTERED

  • All

    I am fighting to get a PK change from Non-Clustered to Clustered, with more vigorous testing than is currently being suggested. It is the main table of the db. As far as I am aware and to my knowledge changing from PK NON-C to PK Clustered is correct to do.

    But if you are not prepared to fully test all scripts which run against the table it can cause issues, as it will now force SQL optimiser to work differently on the table and calculate different paths to get to the data. Some queries may improve some may crash the application due to poor performance.

    Am I correct. If so do you have a link to an article or link that will back me up.

    Thanks

  • There are many, many times when the PK should not be the clustered index. Why do you think it should be changed??

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Not me 3rd Party. The table is a Heap it is badly designed with far too many columns and serving many different purposes.

    They believe it will improve performance. My view is not without extensive testing.

    As it is a major change to the table.

  • Talib123 (4/16/2015)


    Not me 3rd Party. The table is a Heap it is badly designed with far too many columns and serving many different purposes.

    They believe it will improve performance. My view is not without extensive testing.

    As it is a major change to the table.

    if the table is a heap, it should be changed to have a clustered index.

    best practice is every table should have a clustered index, unless you have a specific , compelling reason for it not to.

    there are lots of reasons why heaps can be less than optimal. changing the PK to be clustered where no clustered index existed before is a no-brainer for me.

    the statement "Some queries may improve some may crash the application due to poor performance." i think is unsubstantiated. i've never, ever seen an crash due to adding a clustered index.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Lowell (4/16/2015)


    Talib123 (4/16/2015)


    Not me 3rd Party. The table is a Heap it is badly designed with far too many columns and serving many different purposes.

    They believe it will improve performance. My view is not without extensive testing.

    As it is a major change to the table.

    if the table is a heap, it should be changed to have a clustered index.

    best practice is every table should have a clustered index, unless you have a specific , compelling reason for it not to.

    there are lots of reasons why heaps can be less than optimal. changing the PK to be clustered where no clustered index existed before is a no-brainer for me.

    the statement "Some queries may improve some may crash the application due to poor performance." i think is unsubstantiated. i've never, ever seen an crash due to adding a clustered index.

    Yes, it needs a clustered index, but not necessarily the same as the PK. Maybe the PK is a guid. Maybe the table is always searched by a different value.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Hi All

    I did state in my original post that I agree it should have a clustered index.

    My question is that changing a table which has multiple indexes and has many varying statements being executed against it, should have a very high degree of testing applied as creating a Clustered index can have a negative impact for some queries and a positive impact on others. This table in particular will have hundreds of different queries running against it.

    My question is of degree of testing. Yes I have seen a change of PK to clustered on a table completely bring an application down.

  • Talib123 (4/16/2015)


    As far as I am aware and to my knowledge changing from PK NON-C to PK Clustered is correct to do.

    Thanks

    The right answer is, it depends. To mention a few, it depends of:

    -Access pattern in your table (reads, writes, inserts, updates)

    -Your table's schema

    -The actual data type(s) of the keys that are part of your PK.

    -If your PK is sequential or not

    I just recently changed a PK from PK/CI to a simply UC and fixed a horrible fragmentation issue. I created a surrogate PK and made it a CI. But I performed a 4 weeks testing, talked to app. team and developers, and was logging Index access patterns for months, even before I saw the issue.

    If you post your table's schema we may be able to help or provide a more precise answer. But I would suggest to capture Index utilization for several weeks so you can understand how the table is being used. In fact, running Extended Event for a few min (or when the table is being accessed) may be a good idea too.

    But one thing is true... if the table is a heap, it certainly need a CI.

  • You will need to look at least at the stats SQL provides for indexes:

    1) missing index stats and

    2) index usage stats.

    Between those and you knowledge of the table, we can come up with a decent clustered index recommendation.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • For example:

    --USE [<your_db_name_here>] --naturally make sure you are in the right db

    SET DEADLOCK_PRIORITY LOW --probably irrelevant, but just in case

    DECLARE @list_missing_indexes bit

    DECLARE @table_name_pattern sysname

    --NOTE: showing missing indexes can take some time; set to 0 if you don't want to wait.

    SET @list_missing_indexes = 1

    SET @table_name_pattern = '??' --<<-- !table name goes here!

    --SET @table_name_pattern = '%'

    PRINT 'Started @ ' + CONVERT(varchar(30), GETDATE(), 120)

    IF @list_missing_indexes = 1

    BEGIN

    SELECT

    GETDATE() AS capture_date,

    DB_NAME(mid.database_id) AS Db_Name,

    dps.row_count,

    OBJECT_NAME(mid.object_id /*, mid.database_id*/) AS Table_Name,

    mid.equality_columns, mid.inequality_columns, mid.included_columns,

    user_seeks, user_scans, ca1.max_days_active, unique_compiles,

    last_user_seek, last_user_scan, avg_total_user_cost, avg_user_impact,

    system_seeks, system_scans, last_system_seek, last_system_scan, avg_total_system_cost, avg_system_impact,

    mid.statement, mid.object_id, mid.index_handle

    FROM sys.dm_db_missing_index_details mid WITH (NOLOCK)

    CROSS APPLY (

    SELECT DATEDIFF(DAY, create_date, GETDATE()) AS max_days_active FROM sys.databases WHERE name = 'tempdb'

    ) AS ca1

    LEFT OUTER JOIN sys.dm_db_missing_index_groups mig WITH (NOLOCK) ON

    mig.index_handle = mid.index_handle

    LEFT OUTER JOIN sys.dm_db_missing_index_group_stats migs WITH (NOLOCK) ON

    migs.group_handle = mig.index_group_handle

    LEFT OUTER JOIN sys.dm_db_partition_stats dps WITH (NOLOCK) ON

    dps.object_id = mid.object_id AND

    dps.index_id IN (0, 1)

    --order by

    --DB_NAME, Table_Name, equality_columns

    WHERE

    1 = 1

    AND mid.database_id = DB_ID() --only current db

    AND OBJECT_NAME(mid.object_id) LIKE @table_name_pattern

    --AND mid.object_id IN (OBJECT_ID('<table_name_1>'), OBJECT_ID('<table_name_2>'))

    ORDER BY

    --avg_total_user_cost * (user_seeks + user_scans) DESC,

    Db_Name, Table_Name, equality_columns, inequality_columns

    END --IF

    PRINT 'Midpoint @ ' + CONVERT(varchar(30), GETDATE(), 120)

    -- list index usage stats (seeks, scans, etc.)

    SELECT

    ius2.row_num, DB_NAME() AS db_name,

    CASE WHEN i.name LIKE ca2.table_name + '%'

    THEN '~' + SUBSTRING(i.name, LEN(ca2.table_name) + 1, 200)

    ELSE i.name END AS index_name,

    CASE WHEN i.is_unique = 0 THEN 'N' ELSE 'Y' END + '.' +

    CASE WHEN i.is_primary_key = 0 AND i.is_unique_constraint = 0 THEN 'N' ELSE 'Y' END AS [uniq?],

    ca2.table_name,

    i.index_id, --ius.user_seeks + ius.user_scans AS total_reads,

    dps.row_count,

    SUBSTRING(key_cols, 3, 8000) AS key_cols, SUBSTRING(nonkey_cols, 3, 8000) AS nonkey_cols,

    ius.user_seeks, ius.user_scans, ius.user_lookups, ius.user_updates,

    ius.last_user_seek, ius.last_user_scan, ius.last_user_lookup, ius.last_user_update,

    fk.Reference_Count AS fk_ref_count,

    DATEDIFF(DAY, CASE WHEN o.create_date > ca1.sql_startup_date THEN o.create_date

    ELSE ca1.sql_startup_date END, GETDATE()) AS max_days_active,

    FILEGROUP_NAME(i.data_space_id) AS filegroup_name,

    ius.system_seeks, ius.system_scans, ius.system_lookups, ius.system_updates,

    ius.last_system_seek, ius.last_system_scan, ius.last_system_lookup, ius.last_system_update

    FROM sys.indexes i WITH (NOLOCK)

    INNER JOIN sys.objects o WITH (NOLOCK) ON

    o.object_id = i.object_id

    CROSS JOIN (

    SELECT create_date AS sql_startup_date FROM sys.databases WHERE name = 'tempdb'

    ) AS ca1

    CROSS APPLY (

    SELECT OBJECT_NAME(i.object_id/*, DB_ID()*/) AS table_name

    ) AS ca2

    OUTER APPLY (

    SELECT

    ', ' + COL_NAME(object_id, ic.column_id)

    FROM sys.index_columns ic

    WHERE

    ic.key_ordinal > 0 AND

    ic.object_id = i.object_id AND

    ic.index_id = i.index_id

    ORDER BY

    ic.key_ordinal

    FOR XML PATH('')

    ) AS key_cols (key_cols)

    OUTER APPLY (

    SELECT

    ', ' + COL_NAME(object_id, ic.column_id)

    FROM sys.index_columns ic

    WHERE

    ic.key_ordinal = 0 AND

    ic.object_id = i.object_id AND

    ic.index_id = i.index_id

    ORDER BY

    COL_NAME(object_id, ic.column_id)

    FOR XML PATH('')

    ) AS nonkey_cols (nonkey_cols)

    LEFT OUTER JOIN sys.dm_db_partition_stats dps WITH (NOLOCK) ON

    dps.object_id = i.object_id AND

    dps.index_id = i.index_id

    LEFT OUTER JOIN sys.dm_db_index_usage_stats ius WITH (NOLOCK) ON

    ius.database_id = DB_ID() AND

    ius.object_id = i.object_id AND

    ius.index_id = i.index_id

    LEFT OUTER JOIN (

    SELECT

    database_id, object_id, MAX(user_scans) AS user_scans,

    ROW_NUMBER() OVER (ORDER BY MAX(user_scans) DESC) AS row_num --user_scans|user_seeks+user_scans

    FROM sys.dm_db_index_usage_stats WITH (NOLOCK)

    WHERE

    database_id = DB_ID()

    --AND index_id > 0

    GROUP BY

    database_id, object_id

    ) AS ius2 ON

    ius2.database_id = DB_ID() AND

    ius2.object_id = i.object_id

    LEFT OUTER JOIN (

    SELECT

    referenced_object_id, COUNT(*) AS Reference_Count

    FROM sys.foreign_keys

    WHERE

    is_disabled = 0

    GROUP BY

    referenced_object_id

    ) AS fk ON

    fk.referenced_object_id = i.object_id

    WHERE

    i.object_id > 100 AND

    i.is_hypothetical = 0 AND

    i.type IN (0, 1, 2) AND

    o.type NOT IN ( 'IF', 'IT', 'TF', 'TT' ) AND

    (

    o.name LIKE @table_name_pattern AND

    o.name NOT LIKE 'dtprop%' AND

    o.name NOT LIKE 'filestream[_]' AND

    o.name NOT LIKE 'MSpeer%' AND

    o.name NOT LIKE 'MSpub%' AND

    --o.name NOT LIKE 'queue[_]%' AND

    o.name NOT LIKE 'sys%'

    )

    --AND OBJECT_NAME(i.object_id /*, DB_ID()*/) IN ('tbl1', 'tbl2', 'tbl3')

    ORDER BY

    --row_count DESC,

    --ius.user_scans DESC,

    --ius2.row_num, --user_scans&|user_seeks

    db_name, table_name,

    -- list clustered index first, if any, then other index(es)

    CASE WHEN i.index_id IN (0, 1) THEN 1 ELSE 2 END,

    key_cols

    PRINT 'Ended @ ' + CONVERT(varchar(30), GETDATE(), 120)

    SET DEADLOCK_PRIORITY NORMAL

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • If the PK is the single most accessed path to the data, yeah, I agree, it should be clustered, testing or not. Although, I prefer testing and advocate for it. But, if there's question about which column or columns define the most common access path to the data, I would absolutely resist putting on the PK just because it's the PK.

    In terms of a link to a white paper, I'm not sure which one would tell you what you need. Of course testing something before doing it is the preferred method for just about anything you can do to your production system. You should measure before and after a change in order to know if the change worked or not. Just slapping stuff into the production system is dangerous (although, I agree, a clustered index isn't going to crash the system, but it sure will change how plans get generated, that's the idea).

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Wow thanks for all the replies. Very helpful.

    I'll chew through it all tomorrow.

  • ScottPletcher (4/16/2015)


    For example:

    --USE [<your_db_name_here>] --naturally make sure you are in the right db

    SET DEADLOCK_PRIORITY LOW --probably irrelevant, but just in case

    DECLARE @list_missing_indexes bit

    DECLARE @table_name_pattern sysname

    --NOTE: showing missing indexes can take some time; set to 0 if you don't want to wait.

    SET @list_missing_indexes = 1

    SET @table_name_pattern = '??' --<<-- !table name goes here!

    --SET @table_name_pattern = '%'

    PRINT 'Started @ ' + CONVERT(varchar(30), GETDATE(), 120)

    IF @list_missing_indexes = 1

    BEGIN

    SELECT

    GETDATE() AS capture_date,

    DB_NAME(mid.database_id) AS Db_Name,

    dps.row_count,

    OBJECT_NAME(mid.object_id /*, mid.database_id*/) AS Table_Name,

    mid.equality_columns, mid.inequality_columns, mid.included_columns,

    user_seeks, user_scans, ca1.max_days_active, unique_compiles,

    last_user_seek, last_user_scan, avg_total_user_cost, avg_user_impact,

    system_seeks, system_scans, last_system_seek, last_system_scan, avg_total_system_cost, avg_system_impact,

    mid.statement, mid.object_id, mid.index_handle

    FROM sys.dm_db_missing_index_details mid WITH (NOLOCK)

    CROSS APPLY (

    SELECT DATEDIFF(DAY, create_date, GETDATE()) AS max_days_active FROM sys.databases WHERE name = 'tempdb'

    ) AS ca1

    LEFT OUTER JOIN sys.dm_db_missing_index_groups mig WITH (NOLOCK) ON

    mig.index_handle = mid.index_handle

    LEFT OUTER JOIN sys.dm_db_missing_index_group_stats migs WITH (NOLOCK) ON

    migs.group_handle = mig.index_group_handle

    LEFT OUTER JOIN sys.dm_db_partition_stats dps WITH (NOLOCK) ON

    dps.object_id = mid.object_id AND

    dps.index_id IN (0, 1)

    --order by

    --DB_NAME, Table_Name, equality_columns

    WHERE

    1 = 1

    AND mid.database_id = DB_ID() --only current db

    AND OBJECT_NAME(mid.object_id) LIKE @table_name_pattern

    --AND mid.object_id IN (OBJECT_ID('<table_name_1>'), OBJECT_ID('<table_name_2>'))

    ORDER BY

    --avg_total_user_cost * (user_seeks + user_scans) DESC,

    Db_Name, Table_Name, equality_columns, inequality_columns

    END --IF

    PRINT 'Midpoint @ ' + CONVERT(varchar(30), GETDATE(), 120)

    -- list index usage stats (seeks, scans, etc.)

    SELECT

    ius2.row_num, DB_NAME() AS db_name,

    CASE WHEN i.name LIKE ca2.table_name + '%'

    THEN '~' + SUBSTRING(i.name, LEN(ca2.table_name) + 1, 200)

    ELSE i.name END AS index_name,

    CASE WHEN i.is_unique = 0 THEN 'N' ELSE 'Y' END + '.' +

    CASE WHEN i.is_primary_key = 0 AND i.is_unique_constraint = 0 THEN 'N' ELSE 'Y' END AS [uniq?],

    ca2.table_name,

    i.index_id, --ius.user_seeks + ius.user_scans AS total_reads,

    dps.row_count,

    SUBSTRING(key_cols, 3, 8000) AS key_cols, SUBSTRING(nonkey_cols, 3, 8000) AS nonkey_cols,

    ius.user_seeks, ius.user_scans, ius.user_lookups, ius.user_updates,

    ius.last_user_seek, ius.last_user_scan, ius.last_user_lookup, ius.last_user_update,

    fk.Reference_Count AS fk_ref_count,

    DATEDIFF(DAY, CASE WHEN o.create_date > ca1.sql_startup_date THEN o.create_date

    ELSE ca1.sql_startup_date END, GETDATE()) AS max_days_active,

    FILEGROUP_NAME(i.data_space_id) AS filegroup_name,

    ius.system_seeks, ius.system_scans, ius.system_lookups, ius.system_updates,

    ius.last_system_seek, ius.last_system_scan, ius.last_system_lookup, ius.last_system_update

    FROM sys.indexes i WITH (NOLOCK)

    INNER JOIN sys.objects o WITH (NOLOCK) ON

    o.object_id = i.object_id

    CROSS JOIN (

    SELECT create_date AS sql_startup_date FROM sys.databases WHERE name = 'tempdb'

    ) AS ca1

    CROSS APPLY (

    SELECT OBJECT_NAME(i.object_id/*, DB_ID()*/) AS table_name

    ) AS ca2

    OUTER APPLY (

    SELECT

    ', ' + COL_NAME(object_id, ic.column_id)

    FROM sys.index_columns ic

    WHERE

    ic.key_ordinal > 0 AND

    ic.object_id = i.object_id AND

    ic.index_id = i.index_id

    ORDER BY

    ic.key_ordinal

    FOR XML PATH('')

    ) AS key_cols (key_cols)

    OUTER APPLY (

    SELECT

    ', ' + COL_NAME(object_id, ic.column_id)

    FROM sys.index_columns ic

    WHERE

    ic.key_ordinal = 0 AND

    ic.object_id = i.object_id AND

    ic.index_id = i.index_id

    ORDER BY

    COL_NAME(object_id, ic.column_id)

    FOR XML PATH('')

    ) AS nonkey_cols (nonkey_cols)

    LEFT OUTER JOIN sys.dm_db_partition_stats dps WITH (NOLOCK) ON

    dps.object_id = i.object_id AND

    dps.index_id = i.index_id

    LEFT OUTER JOIN sys.dm_db_index_usage_stats ius WITH (NOLOCK) ON

    ius.database_id = DB_ID() AND

    ius.object_id = i.object_id AND

    ius.index_id = i.index_id

    LEFT OUTER JOIN (

    SELECT

    database_id, object_id, MAX(user_scans) AS user_scans,

    ROW_NUMBER() OVER (ORDER BY MAX(user_scans) DESC) AS row_num --user_scans|user_seeks+user_scans

    FROM sys.dm_db_index_usage_stats WITH (NOLOCK)

    WHERE

    database_id = DB_ID()

    --AND index_id > 0

    GROUP BY

    database_id, object_id

    ) AS ius2 ON

    ius2.database_id = DB_ID() AND

    ius2.object_id = i.object_id

    LEFT OUTER JOIN (

    SELECT

    referenced_object_id, COUNT(*) AS Reference_Count

    FROM sys.foreign_keys

    WHERE

    is_disabled = 0

    GROUP BY

    referenced_object_id

    ) AS fk ON

    fk.referenced_object_id = i.object_id

    WHERE

    i.object_id > 100 AND

    i.is_hypothetical = 0 AND

    i.type IN (0, 1, 2) AND

    o.type NOT IN ( 'IF', 'IT', 'TF', 'TT' ) AND

    (

    o.name LIKE @table_name_pattern AND

    o.name NOT LIKE 'dtprop%' AND

    o.name NOT LIKE 'filestream[_]' AND

    o.name NOT LIKE 'MSpeer%' AND

    o.name NOT LIKE 'MSpub%' AND

    --o.name NOT LIKE 'queue[_]%' AND

    o.name NOT LIKE 'sys%'

    )

    --AND OBJECT_NAME(i.object_id /*, DB_ID()*/) IN ('tbl1', 'tbl2', 'tbl3')

    ORDER BY

    --row_count DESC,

    --ius.user_scans DESC,

    --ius2.row_num, --user_scans&|user_seeks

    db_name, table_name,

    -- list clustered index first, if any, then other index(es)

    CASE WHEN i.index_id IN (0, 1) THEN 1 ELSE 2 END,

    key_cols

    PRINT 'Ended @ ' + CONVERT(varchar(30), GETDATE(), 120)

    SET DEADLOCK_PRIORITY NORMAL

    Scott, thanks for the code! Very interesting results, definitely food for thought. It will be interesting to see how the results change as I move from development through user testing to limited deployment.

    -----
    [font="Arial"]Knowledge is of two kinds. We know a subject ourselves or we know where we can find information upon it. --Samuel Johnson[/font]

  • Remember the clustered index represents the order of the table, the primary key represents what is unique about each row. They are not necessarily the same as you know by now. You have a valid point as some queries will benefit from the new order and others may not. Keep in mind that new non-clustered indexes with an INCLUDE part can help you overcome any performance queries that result.

    The simple way to go about it is to drop all indexes , then create the clustered index , then the non clustered ones. Not 100% but this method may also give you space back (if your heap had rows deleted from time to time).

    ----------------------------------------------------

  • Your biggest hint as to best candidate for clustered index is highest number of the sum of seeks grouped by the leading index key.

    Also, a symptom of poor choice of clustering or non existent clustering would be a smattering of either many space eating covering indexes, or extreme amount of lookups.

    It's a no-brainer really. (except for vendors who never seem to include a DBA in their designs).

    Probably best to find all the queries in the plan cache which hit the table and store the statistics and exec plans.

    Then with those numbers, you have a baseline.

    Then you can figure out, by looking at the queries and plans, which queries will benefit, and which will not,visually before even changing the indexing.

    I've changes nonclustered PK's to clustered for most cases found and lookups disappear, blocking was heavily reduced, fat covering indexes no longer needed and dropped, deadlocks disappeared, and these were tables which were originally clustered on the wrong column, i.e. not as bad as heaps.

    Be careful of fillfactor, else it is possible that your table ends up taking up more space, unless that is what is expected.

    Catch-all queries done right [/url]
    Gail Shaw's Performance Blog[/url]

  • The simple way to go about it is to drop all indexes , then create the clustered index , then the non clustered ones. Not 100% but this method may also give you space back (if your heap had rows deleted from time to time).

    I second this method. If you don't drop the non-clustered indexes first, you may find that the process takes a considerable amount of time adding the clustered index. It will normally go much faster the other way.

    I've changes nonclustered PK's to clustered for most cases found and lookups disappear, blocking was heavily reduced, fat covering indexes no longer needed and dropped, deadlocks disappeared, and these were tables which were originally clustered on the wrong column, i.e. not as bad as heaps. Be careful of fillfactor, else it is possible that your table ends up taking up more space, unless that is what is expected.

    If your choice of a clustered index is such that new records could be inserted anywhere, you should have a fill factor. This will reduce the speed at which the table fragments. If it's sequential (even if not an identity but something like a creation date), a fill factor is probably not necessary.

    As for whether the PK could be clustered or not, that is a table by table decision. We have recently seen a lot of improvements in an application by moving the PKs from the primary key to some other more appropriate column or combination of columns (although rarely is more than one column used). The tables had been reflexively clustered on the PK. This is often not the best decision.

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply