Whether to leave a clustered index on a staging table

  • Hi

    Using dataflows to load data from one SQL table to another (seperate databases; not an exact copy)

    Historically, best practice suggests loading data to a table that has no (nonclustered) indexes on it then adding the indexes post insert

    So, disable and enable

    When following the truncate destination and load method, what's the best approach with clustered indexes

    I've been playing around with sizable test data and it doesn't seem to make much difference whether they're left there permanently or dropped and recreated. There is a minor difference when left on though.

    Reading around there's contrasting views on the drop/recreate vs leave approach specifically for clustered indexes

    What's your thoughts?

     

    Thanks

    - Damian

  • Thanks for posting your issue and hopefully someone will answer soon.

    This is an automated bump to increase visibility of your question.

  • It's a "staging" table, right?  Do you actually need a Clustered Index on it?  While it usually is a "Best Practice" to have a good Clustered Index on permanent tables, the rules can easily change when it comes to "staging" tables.

    And be really careful about adopting things that people say are "Best Practices".   There are a lot of good ones and probably more that are not actually a "Best Practice".  For example, the "Best Practice" of reorganizing indexes that are between 5 and 30% logically fragmented and rebuilding only after they're greater than 30% was never intended to be a "Best Practice".  Some rather unfortunate wording in the SQL Server documentation followed by the fact that no one pays attention to the related footnotes compounded by the fact that most people don't actually know what REORGANIZE does has totally misled people and it's actually a "Worst Practice" in a whole lot of cases.  Then there's the fellow that said it's a "Best Practice" to use rCTEs (recursive CTEs) to increment a count instead of using a Tally table or cCTEs (cascading CTEs) and another fellow that supposed "proved" that using XML to split CSVs is the "best" method.

    You have to try different things for your given requirements and conditions.  For example, are you aware of of all that is necessary to do "Minimally Logged" loads into empty tables that have no NCIs but does have a CI?  Have a look at the MS documentation on the subject and then try it and also try it without the CI and build the CI (only if you need it).  Sometimes it makes a heck of a difference and sometimes it doesn't.  The words "It Depends" ring true, especially on things like this, and you need to experiment for your given situation.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Testing is your buddy. Lots and lots of testing. There is no true one-size-fits-all approach to doing things in SQL Server.

    Since the clustered index itself redefines data storage, the choice to have or not have a clustered index is not the same as nonclustered indexes. If your staging table is just used to capture data in flight and then transfer it elsewhere, a clustered index may not be useful. However, if you join and/or filter that staging table a lot after, or during, the load process, a clustered index could be very useful indeed.

    You'll have to experiment & test to be sure for your situation. I do, generally, default to having a clustered index, but, again, testing to be sure.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • I agree with the comments made by Jeff and Grant. There are many variables here, and it all depends on what you are doing to the data once it's in that destination table.

    Personally speaking, I usually don't have any indexes on my staging tables (i.e. heaps) and have found that unless I am dealing with millions of records and/or complex logic, I don't need any.

    Given the way my ETL processes are designed, it usually benefits more from the heap inserts than the overhead associated with index updates, and scanning the staging table a few times during the rest of the process is typically not worth creating indexes at all in most of my use cases.

    That being said, this is my starting point and I adjust as I need to. There are always exceptions and unless you test for your specific scenario, you just won't know for sure.

  • Thanks all

    So, I've spent a little bit of time on this

    I am doing a little bit of manipulation on staging (also comparing to what exists) before moving it live - so I am finding staging indexes useful

    Here I do all this in an interim live table and partition switch live as a final step

    It's a full load every time as it's ever changing 3rd party data

    So, (in with the testing theme that you have all suggested above), some tables are faster and some slower

    I've asked the question as I was advised not to drop and recreate clustered indexes every time. To either leave them on or not have them at all (as pointed out in your answers). Again, downstream, this slows some of my tranformation actions down.

    Thing is, I have a couple of examples where leaving them on is a lot slower

    Is there an appraoch to ordering first that's required. Must admit I've tried alsorts here and read a lot of different articles with no real success.

    Seems to me as though dropping and recreating an index sometimes is the best way.

    Are there any negatices to that approach?

     

    Thanks

     

     

    - Damian

  • The only negative to any approach is when it's either destructive of your data, or doesn't work well. If, in your testing, dropping and recreating the indexes works better, then do that. It's really that simple. If it doesn't work better, don't do that. It all comes back around to testing in your environment, with your processes, and your data. As I said before, there are very few objects, steps, processes, methods, or anything else, within SQL Server that will always work better or will always work worse. Almost everything comes down to testing within your environment.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply