Composite Index performance

  • Hello, thanks in advance for any help. I have a large table with 500M records. It has a primary clustered index on it's key and a non-clustered unique index on the surrogate key that consists of 3 varchar(200) fields. My ETL merges from a extract table to the final large table on the 3 large text fields. Typically this consists of a merge of 10M records into the 500M record table. Performance has become terrible. It takes almost 12 hours to complete in out SQL Server 2012 database.

    Any recommendations on how I can improve performance?

  • You probably have a wide index scan, and indexe(s) don't contribute to the query. Can you post the execution plan and someone give you a good advise.

    If you can use a static and narrow index (numeric is better) it should be better.

    Some reffs:

    http://www.sqlservercentral.com/articles/Indexing/68563/

    http://www.sqlservercentral.com/articles/Indexing/68636/

    http://www.sqlskills.com/blogs/kimberly/more-considerations-for-the-clustering-key-the-clustered-index-debate-continues/

    Maybe a partitioning of the table will help a lot. You could take in consideration that as well because 500M is a big number.

    Igor Micev,My blog: www.igormicev.com

  • Thanks for the reply. I attached the explain plan for your review. The three varchars are a necessary join. I'll look into the links you provided.

  • 97% of the total query cost goes for Table scan and Sort, in the most right-upper part of the execution plan.

    Can you additionally post the tables structure and the indexes definitions?

    Igor Micev,My blog: www.igormicev.com

  • The table scan actually doesn't take that long. I ran the "select" part of the merge statement and that part returned 10M rows into a temp table is 4 minutes. Attached is the table structure and the merge.

  • --Extend this index to make it coverable for the query:

    CREATE UNIQUE NONCLUSTERED INDEX [UIX_did_uid_sid] ON [Dim].[UserDevice]

    (

    [DeviceID] ASC,

    [UserID] ASC,

    [SubscriberID] ASC

    )

    INCLUDE(

    Locale,[LastUsedDate]

    )

    --Create this index

    CREATE UNIQUE NONCLUSTERED INDEX [UIX_NCSTransaction_IDs] ON [Ext].[NCSTransaction]

    (

    [SubscriberID],[UserID],[DeviceID]

    )

    INCLUDE ([Locale],location,[TransactionStartDateTime])

    If the "select" part of the merge statement takes 4 minutes, then you'd better replace that code with a temp table and put the second index on the temp table.

    I think there is a chance to improve it a lot. So it's your turn

    Igor Micev,My blog: www.igormicev.com

  • I strongly agree that you should substitute numeric values in place of the (very) long varchar key columns: [SubscriberID],[UserID] and [DeviceID].

    But, until then, based on what you've posted, you should very likely cluster the tables as follows. (Yes, forget clustering on the identity column).

    UserDevice: [SubscriberID],[UserID],[DeviceID]

    NCSTransaction: [SubscriberID],[UserID],[DeviceID],TransactionStartDateTime

    For best possible performance, it's very important that both indexes have exactly the same order of columns.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Thanks for the help. I will try the temp tables, indexing the temp tables and using the covered index.

  • I'd like to suggest an alternative, since making multiple columns a clustered key increases the size, and degrades performance of all your non-clustered indexes on the target table. Remember, the cluster key gets stored in all non-clustered indexes whether you ask for it or not.

    Merges perform best when both source and target tables are in the same sequence order. Presumably you have your source table ordered by the compound key. So read through it, joining to the target table using the compound key. But write your data out to a new working table that includes the single (primary) key, and is indexed on that key. This working table will be the source of your merge.

    Yes, you are reading through your input values twice, but as you pointed out, scanning a table of 10 million rows doesn't take that long at all.

    Something to try anyway.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Using a temp table, indexing the temp tables using a covered index and selecting from the temp table worked extremely well. The merge is down to 20 minutes. Thanks again.

  • It will cover that specific process as it exists now, yes, but you're doubling one table's size. And you'll have to periodically go back and re-do the "covering" index when a column is added to the query, making the index no longer covering. You'd get a much greater benefit overall, to all processes, by implementing the best clustered index first, then adding other indexes / doing other tuning as needed.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Scott, I respectfully have to ask for clarification. Whenever you add multiple columns to the clustered index, you increase the size of ALL nonclustered indexes on the same table. You also make seeks to the clustered index slightly less efficient because more pages are needed to hold the index structure. In my experience, it is better to limit the number of columns in the clustered index as much as possible and use non-clustered covering indexes when multi-column indexes are needed.

    There may be exceptions such as when you have an order_header and order_detail table, and you want to index the order_detail on order_number, line_item_number so that all items for an order are likely to be on the same page. But as a general rule, keeping the clustered index simple as possible and creating non-clustered covering indexes has proven to be the best performer for us.

    Could you give us an example of what you would consider a proper clustered index key?

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • The Dixie Flatline (7/15/2014)


    Scott, I respectfully have to ask for clarification. Whenever you add multiple columns to the clustered index, you increase the size of ALL nonclustered indexes on the same table. You also make seeks to the clustered index slightly less efficient because more pages are needed to hold the index structure. In my experience, it is better to limit the number of columns in the clustered index as much as possible and use non-clustered covering indexes when multi-column indexes are needed.

    There may be exceptions such as when you have an order_header and order_detail table, and you want to index the order_detail on order_number, line_item_number so that all items for an order are likely to be on the same page. But as a general rule, keeping the clustered index simple as possible and creating non-clustered covering indexes has proven to be the best performer for us.

    Could you give us an example of what you would consider a proper clustered index key?

    As a general rule, yes. But it becomes an absolute rule for some, and it should not. Besides, that's what marks an expert: he/she knows when to override general rules.

    Too often instead you have to build and maintain gazillion nonclustered, covering indexes simply because you have the wrong clustered index. And you have multiple indexes with those same leading key(s). Huge waste of resources, including on-going constant adjustment of indexes.

    I clearly stated that the very long varchar values needed to be replaced by numeric subtitutes/equivalents, but the proper clustering key is still best, until that can be done. Given that those columns are the vast majority of the width of one of those tables anyway, just having to creating a nonclustered index with that key doubles the table size.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Statement of fact: No single clustered index can ideally serve the needs of multiple queries searching against different columns of the table. Non-clustered indexes are indeed duplicated data, but they are a necessary evil to buy performance for critical queries.

    For future readers, I would recommend the following posts by Kimberly Tripp:

    http://www.sqlskills.com/blogs/kimberly/ever-increasing-clustering-key-the-clustered-index-debate-again/[/url]

    Pay attention to the reasoning supporting a focus on keys that are unique, narrow, static, and ever-increasing.

    http://www.sqlskills.com/blogs/kimberly/the-clustered-index-debate-continues/[/url]

    Look at what inserts can do to performance with an index that isn't based on ever-increasing numbers.

    http://www.sqlskills.com/blogs/kimberly/more-considerations-for-the-clustering-key-the-clustered-index-debate-continues/[/url]

    Why non-nullable, fixed-width columns are preferable.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Igor Micev (7/14/2014)


    --Extend this index to make it coverable for the query:

    CREATE UNIQUE NONCLUSTERED INDEX [UIX_did_uid_sid] ON [Dim].[UserDevice]

    (

    [DeviceID] ASC,

    [UserID] ASC,

    [SubscriberID] ASC

    )

    INCLUDE(

    Locale,[LastUsedDate]

    )

    --Create this index

    CREATE UNIQUE NONCLUSTERED INDEX [UIX_NCSTransaction_IDs] ON [Ext].[NCSTransaction]

    (

    [SubscriberID],[UserID],[DeviceID]

    )

    INCLUDE ([Locale],location,[TransactionStartDateTime])

    If the "select" part of the merge statement takes 4 minutes, then you'd better replace that code with a temp table and put the second index on the temp table.

    I think there is a chance to improve it a lot. So it's your turn

    Since the ROW_NUMBER partitioning in the query is...

    row_number() over(partition by [SubscriberID],[UserID],[DeviceID] order by transactionstartdatetime desc) as rn

    ... you might erg out a bit more performance if you made that second index include the transactionstartdatetime desc as a part of the key rather than a part of the INCLUDE.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply