Index question for a monthly summary table

  • Our company has a monthly summary table that is populated via stored procedure. It has around 100,000 rows, growing by about 8000 records per month.

    There are > 200 columns, and I can't post the true table definition, but I will give a simplified example.

    CREATE TABLE [dbo].[sample_summary_table](

    [project_code] [varchar](10) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,

    [period] [int] NOT NULL,

    [period_end_date] [datetime] NULL,

    [sample_summary_data1] [numeric](28, 2) NULL,

    [sample_summary_data2] [numeric](28, 2) NULL,

    [sample_summary_data3] [numeric](28, 2) NULL,

    CONSTRAINT [pk_sample_summary_table] PRIMARY KEY CLUSTERED

    (

    [project_code] ASC,

    [periodd] ASC

    )WITH (PAD_INDEX = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]

    ) ON [PRIMARY]

    The integer "period" is actually an integer representation of the "fiscal year's" year and month in yyyydd order. So if this were the first month of fiscal 2011, this month's records would all get a "period" of 201101. There is a clustered index to match the primary key.

    We are working on the performance of the job which populates this table. That job runs for all project nums, but for only the current period, and currently it is multi-step, with an initial insert of the project records, and then multiple updates. We will likely be changing that to either be simply an insert, or at least to cut down the number of updates. Each update specifies the period within the where clause, and updates every project for that period (through a join).

    The biggest performance bottlenecks are the queries used to build the table, and we are making progress in fixing them. But I am wondering if we should consider rebuilding this table's index while we are at i, by reversing the order of the fields in the clustered index.

    On the one hand, project_code is by far the more selective field than period. On the other hand, when updating records (as well as when selecting them later in reports), there is always a period provided, but project code either is not provided (reports), or is joined through a query bringing in all the project records. In addition, period is ascending, while project codes are repeated across periods, so having period first might help prevent page splits, etc.

    Any thoughts on this? Would it be worthwhile rebuilding the clustered index? Would it be a mistake

    NOTE: I have considered the idea of having an identity column for primary key and clustered index, and using a non-clustered index for this, but the issue is that both the updates and the queries that access the tables pull from 100+ columns, so making it a covering index would be pretty difficult.

  • Nevyn (9/29/2010)


    ...NOTE: I have considered the idea of having an identity column for primary key and clustered index, and using a non-clustered index for this, but the issue is that both the updates and the queries that access the tables pull from 100+ columns, so making it a covering index would be pretty difficult.

    Give it a TRY! Create the Identity Clustered PK, create and index for columns used for filtering (WHERE clauses) and JOINs to another table. You may want to try to play with including other columns into your indices:

    http://msdn.microsoft.com/en-us/library/ms190806(v=SQL.90).aspx

    Would be very interesting in seeing the results... I would expect that populating this table will no longer be a problem.

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • I guess my question is, where does the slowdown occur? Is it in the insert? If so, are you getting lots of page splits or is it just an I/O issue?

    I agree that if this is the most frequent access method, it should probably stay as the clustered index, but, just so you know, that's a pretty wide key for a cluster, so you will see some performance hits from that, few rows on a page, more page splits, etc.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Period should definitely be the first column in the clus index. For now, just swap the two columns. That should end the issue with the INSERTs; it's likely now that project is all over the db, whereas period will be just at the end.

    Remember to sort the rows in clus key order during the INSERT.

    That is, your INSERT should not be just:

    INSERT INTO

    SELECT period, project, ...

    FROM ...

    WHERE ...

    It should be:

    INSERT INTO

    SELECT period, project, ...

    FROM ...

    WHERE ...

    ORDER BY period, project

    The optimizer will then "know" that the new rows are in order and can avoid page splits during INSERT (well, as much as it can avoid them).

    The project will contribute to some fragmentation. You might consider dropping it from the PK if you need to.

    Scott Pletcher, SQL Server MVP 2008-2010

  • Eugene Elutin (9/29/2010)

    Give it a TRY! Create the Identity Clustered PK, create and index for columns used for filtering (WHERE clauses) and JOINs to another table. You may want to try to play with including other columns into your indices:

    http://msdn.microsoft.com/en-us/library/ms190806(v=SQL.90).aspx

    Would be very interesting in seeing the results... I would expect that populating this table will no longer be a problem.

    Actually, the bigger problem with populating the table is the queries (from a 3rd party DB) needed to compile the data, which Im working on now.

    If the indexing of this table was the only problem, the process would run at acceptable speed. I'm just looking to anywhere in here we can improve performance. This is a small fish in this particular performance pond.

    On the subject of non-clustered indexes, the issue would be the number of columns to be included. Even though they are only at the leaf level, it seems to me that including 100 odd columns to an index would be a bit counterproductive.

    I could also use some clarification on how these indexes work when doing an update statement. If an update statement selects the rows to update based on a nonclustered index, but is updating columns that are not a part of that index, how efficiently will that run (vs finding the rows with a clustered index)?

  • Nevyn (9/29/2010)


    ...

    On the subject of non-clustered indexes, the issue would be the number of columns to be included. Even though they are only at the leaf level, it seems to me that including 100 odd columns to an index would be a bit counterproductive.

    I could also use some clarification on how these indexes work when doing an update statement. If an update statement selects the rows to update based on a nonclustered index, but is updating columns that are not a part of that index, how efficiently will that run (vs finding the rows with a clustered index)?

    Have you read http://msdn.microsoft.com/en-us/library/ms190806(v=SQL.90).aspx ?

    You can include columns into index without making them index keys.

    You can include upto 1023 columns into your indes as nonkey columns!

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Grant Fritchey (9/29/2010)


    I guess my question is, where does the slowdown occur? Is it in the insert? If so, are you getting lots of page splits or is it just an I/O issue?

    I agree that if this is the most frequent access method, it should probably stay as the clustered index, but, just so you know, that's a pretty wide key for a cluster, so you will see some performance hits from that, few rows on a page, more page splits, etc.

    Right now, an insert is performed setting up the key and defaulting the data columns to null values, and then a series of update statements find and update all the records for the current period. I haven't run the stats on each individual step yet, because it should be possible to replace that method with a single insert (or at worst either an insert and one update, or building data in a temp table then doing one insert). And because there is one query in the process that is the biggest bottleneck. Once we have done what we can with that query, and cleaned up the process so that it isn't updating 8000 rows 5 different times, we will do some benchmarking of what remains and I'll have a better picture of how this table specifically is performing.

    For now its more of a theoretical question, of putting the more selective field first, or the more used field (and properly ascending) field.

  • The best thing I would suggest is to try insert all 8000 rows once (or in batches) with all required data populated.

    If it is absolutely impossible in your case, I would recommend inserting 8000 rows into intermidiate table (it can be temp table or permanent work/staging table), perform all required updates there and then - insert final results into destination table at once.

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Eugene Elutin (9/29/2010)


    Nevyn (9/29/2010)


    ...

    On the subject of non-clustered indexes, the issue would be the number of columns to be included. Even though they are only at the leaf level, it seems to me that including 100 odd columns to an index would be a bit counterproductive.

    I could also use some clarification on how these indexes work when doing an update statement. If an update statement selects the rows to update based on a nonclustered index, but is updating columns that are not a part of that index, how efficiently will that run (vs finding the rows with a clustered index)?

    Have you read http://msdn.microsoft.com/en-us/library/ms190806(v=SQL.90).aspx ?

    You can include columns into index without making them index keys.

    You can include upto 1023 columns into your indes as nonkey columns!

    Yes I read it, and I understand what an include column is. But if I need most of the table included to "cover" the query, isn't my non-clustered index behaving a lot like a copy of the table with a different clustered index?

    Example:

    If my clustered index is period, project_code and I select 90% of the columns in the table in a query with period in the product code, it ought to seek based on the clustered index and get me my data.

    If I have an identity clustered index, and a non-clustered index of period, project_code including all of those columns at the leaf level, and I do the same query, is it really performing any better? It would seem to me that this would increase the space taken by the table, put a performance hit on inserts and deletes.

    In other words, I understand I am allowed to include all those columns, I'm just not sure I get how that would perform better. The only advantage I see to doing it this way would be to have a truly selective and ascending clustered index, and thus maybe less fragmentation. Am I wrong?

  • Eugene Elutin (9/29/2010)


    The best thing I would suggest is to try insert all 8000 rows once (or in batches) with all required data populated.

    If it is absolutely impossible in your case, I would recommend inserting 8000 rows into intermidiate table (it can be temp table or permanent work/staging table), perform all required updates there and then - insert final results into destination table at once.

    That is the plan. I believe it should be possible in one insert (the sql statement to do so would be massively long, but the joins are reasonably straightforward and I believe they can be done at the same time).

  • No, that's exactly right. You should be extremely careful about INCLUDEing columns in an index. It's a very useful feature, but it should be selectively, only to make an index a covering index, and only for a very limited number of columns.

    You always want the primary access to your table to be through the clus index, because that is the most efficient way for SQL to access your data.

    Scott Pletcher, SQL Server MVP 2008-2010

  • 1. If your clustered index is period and project_code, you better not insert anything which will be placed between existing data. Also, as you insert all data will be sorted by these two columns. Using clustered key on identity column will not suffer from any of these.

    2. If you deside to have an identity clustered index, and a non-clustered index of period, project_code. Don't include all other solumns straight away. Just see if the performance is acceptable as it is.

    Will the same query perform any better? I have no idea. It depends! Try and see.

    Do you have problem with disk space to concern about table/index size?

    Updating non-clustered index (even with many included columns) while inserting records may be still cheapper than re-paging which can be caused by inserting into the middle of the clustered index.

    Actually, in your case, regardless of what your clustered index will be, and in case if you have non-fixed-length columns in your tables (eg. varchar) you should avoid updates of these columns! As updates of columns of non-fixed-length data types will also cause re-paging.

    Using clustered index on identity column will help to achieve minimal fragmentation.

    Also, I do recommend to update table statistric straight after you inserting a lot of records into it.

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Nevyn (9/29/2010)


    Grant Fritchey (9/29/2010)


    I guess my question is, where does the slowdown occur? Is it in the insert? If so, are you getting lots of page splits or is it just an I/O issue?

    I agree that if this is the most frequent access method, it should probably stay as the clustered index, but, just so you know, that's a pretty wide key for a cluster, so you will see some performance hits from that, few rows on a page, more page splits, etc.

    Right now, an insert is performed setting up the key and defaulting the data columns to null values, and then a series of update statements find and update all the records for the current period. I haven't run the stats on each individual step yet, because it should be possible to replace that method with a single insert (or at worst either an insert and one update, or building data in a temp table then doing one insert). And because there is one query in the process that is the biggest bottleneck. Once we have done what we can with that query, and cleaned up the process so that it isn't updating 8000 rows 5 different times, we will do some benchmarking of what remains and I'll have a better picture of how this table specifically is performing.

    For now its more of a theoretical question, of putting the more selective field first, or the more used field (and properly ascending) field.

    If the question is which field should be first, more select or less, I'd go with more selective. It's going to be used to generate statistics, so it'll be more useful in front.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • I would stick with the more used field (and properly ascending) field. On an index, the most important thing is to match the first column in the index, esp. on a clus index.

    Scott Pletcher, SQL Server MVP 2008-2010

  • scott.pletcher (9/29/2010)


    I would stick with the more used field (and properly ascending) field. On an index, the most important thing is to match the first column in the index, esp. on a clus index.

    Ha! And the final answer? It depends...

    I guess with conflicting advice the thing to do is test it.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply