Composite Index performance

  • jjf112 (7/14/2014)


    Hello, thanks in advance for any help. I have a large table with 500M records. It has a primary clustered index on it's key and a non-clustered unique index on the surrogate key that consists of 3 varchar(200) fields. My ETL merges from a extract table to the final large table on the 3 large text fields. Typically this consists of a merge of 10M records into the 500M record table. Performance has become terrible. It takes almost 12 hours to complete in out SQL Server 2012 database.

    Any recommendations on how I can improve performance?

    As a bit of a side interest on this problem, does this table suffer from any INSERTs from a GUI?

    If so, are the INSERTs done in order by [SubscriberID],[UserID],[DeviceID] ?

    I assume that the bulk 10M row inserts are not, correct?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • The Dixie Flatline (7/15/2014)

    Statement of fact: No single clustered index can ideally serve the needs of multiple queries searching against different columns of the table.

    Another statement of fact: You can't possibly know ahead of designing a table whether that specific table will need nonclustered index(es) or not. Nor could you possibly accurate assess potential trade-offs between one set of clustering key(s) and another.

    I've seen Ms. Tripp's articles. Again, it's a great general rule to have a naturally ascending clustering key whenever possible. But is should NOT be taken as a commandment. Indeed, with the tens of thousands of dbs I've tuned, I've found that in (only) ~15-25% percent of tables does an identity key work best. Often changing it to the proper clustering key reduces overall I/O over 99%! (Merge joins, when available and appropriate, can perform astoundingly well even on some very large tables.)

    Some minor amount of fragmentation is actually a far better arrangement than corrupting the clustering key. After all, even identity-keyed tables end up with some fragmentation.

    To summarize, my main point is simply this: The clustering key is the single most critical performance factor for a table, and therefore there should NEVER be a "default" clustering key for a table. Using rote rules to pre-determine the cluster key instead of analyzing that specific table's usage is a grave mistake.

    If you prefer an appeal to authority, I can cite Kalen Delaney. She's much more careful and circumspect when dealing with this issue:

    "Since you can have only one clustered index per table, you should choose it carefully based on the most critical retrieval operations. ... If a table is declared with a primary key (which is advisable), by default a clustered index will be built on the primary key columns. However, this is not always the best choice. The primary key is, by definition, unique, and nonclustered indexes are very efficient at finding a single qualifying rows and can still enforce the primary key's uniqueness. So save your clustered index for something that will benefit more from it by adding the keyword NONCLUSTERED when you declare the PRIMARY KEY constraint." Inside Microsoft SQL Server 2005: Query Tuning and Optimization, pages 232-3.

    So, yes, make your own key choice, but do it based on careful analysis, not just to follow some simplistic "rule".

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • As a bit of a side interest on this problem, does this table suffer from any INSERTs from a GUI?

    If so, are the INSERTs done in order by [SubscriberID],[UserID],[DeviceID] ?

    I assume that the bulk 10M row inserts are not, correct?

    Hi Jeff. This table does not have any inserts from a GUI. Are you asking if the source table is sorted by [SubscriberID],[UserID],[DeviceID]? It is not.

  • Scott,

    I'm about to go on a long weekend to New Orleans, so I'm in a really mellow mood right now and full of goodwill towards all mankind. To quote an ancient meme: "What we've got here is failure to communicate." But maybe we can end this on some common ground before the day.

    You seem to think that I am propounding an absolute rule, like an eleventh commandment. I am not. I have even pointed out at least one exception to that rule, and the reason for that exception. The Tripp articles were cited for their reasoning about why an identity column makes an excellent clustered key. I respect Kalen Delaney and her work as much as I respect Kimberly Tripp. I also respect people like Jeff Moden and Gail Shaw and Grant Fritchey and a lot of other people who are willing to share their knowledge without ever getting an MVP. But none of there statements should be followed blindly, it's the reasoning behind their positions that are important.

    You said that is was important to choose a clustered key carefully. You and I are in 100% agreement on that. But you gave no guidance as to what sort of things should be considered. Saying "create a good clustered index" is like telling someone to "have a good db design" It's telling them what to do, without giving them any guidance as to how to do it. That's why I respectfully asked you for an example.

    We also agree that you never know in advance what nonclustered indexes will need to be created on a table. But this is why a small clustered index key is important, to avoid the bloat created by the inclusion of clustered key column(s) in all non-clustered indexes. The link below illustrates that bloat, but unfortunately it also comes from Kimberly Tripp. That doesn't make the math wrong, though.

    http://www.sqlskills.com/blogs/kimberly/how-much-does-that-key-cost-plus-sp_helpindex9/

    [/url]

    Why do we care about bloat? Because non-clustered indexes, like tables fight, for space in memory. Paging in and out slows us down. That's why we don't create a "gazillion" indexes. We have to tune the environment, not just individual queries.

    I also agree that merges go best when both tables are sorted by the clustered key. Its a known best practice, the reasoning behind it makes sense, and I've seen the results firsthand through testing. But you have to ask is the merge the most critical thing. Sometimes it is, sometimes it isn't. Do we insert data once and read it 10,000 times? As in all things SQL, it depends. You have to know your data and know your applications.

    Please don't think I'm talking down to you. I'm sure you are aware of all these things. It's really meant for any future readers. At least our discussion gave them something to think about. Hope you have as enjoyable a weekend as I'm planning on having.

    Best regards,

    Bob

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • But that's the problem that is over-looked in the all-out desire to achieve a "small clustering key". Often the "always a small clustering key" approach requires creating far more nonclustered, covering indexes than would otherwise be required.

    I would certainly never normally suggest keys on varchar(100)+ columns unless it's absolutely unavoidable. That's an extreme case obviously.

    But if I needed, say, 3 ints instead of 1 int to get a proper clustering key, that wouldn't faze me a bit.

    Edit: Corrected wording.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • And I realize that an absolute rule is not the intent of what you're doing, but that is what very often happens now. Identity keys have become so much of a "general rule" that they are often applied immediately to every table created, by default, no qs asked. And it stems back to this obsession with a short, ever-increasing key, which becomes then the ultimate goal. But that should not be the ultimate goal, just a worthy goal when it can be applied.

    Yes, I'm only specifying a general principle. I can't possibly tell you how to properly cluster future tables now, that requires the relevant data knowledge and skill to analyze the trade-offs. Btw, and thank goodness for that, that's MY job!

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Jeff Moden (7/16/2014)


    Igor Micev (7/14/2014)


    --Extend this index to make it coverable for the query:

    CREATE UNIQUE NONCLUSTERED INDEX [UIX_did_uid_sid] ON [Dim].[UserDevice]

    (

    [DeviceID] ASC,

    [UserID] ASC,

    [SubscriberID] ASC

    )

    INCLUDE(

    Locale,[LastUsedDate]

    )

    --Create this index

    CREATE UNIQUE NONCLUSTERED INDEX [UIX_NCSTransaction_IDs] ON [Ext].[NCSTransaction]

    (

    [SubscriberID],[UserID],[DeviceID]

    )

    INCLUDE ([Locale],location,[TransactionStartDateTime])

    If the "select" part of the merge statement takes 4 minutes, then you'd better replace that code with a temp table and put the second index on the temp table.

    I think there is a chance to improve it a lot. So it's your turn

    Since the ROW_NUMBER partitioning in the query is...

    row_number() over(partition by [SubscriberID],[UserID],[DeviceID] order by transactionstartdatetime desc) as rn

    ... you might erg out a bit more performance if you made that second index include the transactionstartdatetime desc as a part of the key rather than a part of the INCLUDE.

    Hi Jeff,

    I'm not sure about your proposal.

    Example:

    CREATE NONCLUSTERED INDEX [Index1] ON [dbo].[FactProductInventory]

    ([ProductKey] ASC,

    [DateKey] ASC)

    CREATE NONCLUSTERED INDEX [Index2] ON [dbo].[FactProductInventory]

    ([ProductKey] ASC)

    INCLUDE([DateKey])

    DBCC DROPCLEANBUFFERS

    set statistics io on

    SELECT t.ProductKey,t.DateKey

    FROM DBO.FactProductInventory t

    set statistics io off

    /* Index2 used automatically by SQL Server

    (776286 row(s) affected)

    Table 'FactProductInventory'. Scan count 1, logical reads 1352, physical reads 3, read-ahead reads 1362, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (1 row(s) affected)

    */

    DBCC DROPCLEANBUFFERS

    set statistics io on

    SELECT t.ProductKey,t.DateKey

    FROM DBO.FactProductInventory t with (index (Index1))

    set statistics io off

    /* Forcing Index1

    (776286 row(s) affected)

    Table 'FactProductInventory'. Scan count 1, logical reads 1352, physical reads 3, read-ahead reads 1362, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (1 row(s) affected)

    */

    Both columns are INT type. The query optimizer chose to use the index which is narrower (Index2). I think it would be still more firm for the real scenario from above.

    Igor Micev,My blog: www.igormicev.com

Viewing 7 posts - 16 through 22 (of 22 total)

You must be logged in to reply to this topic. Login to reply