Index on table with 70+ million rows

  • Hi.

    I have a table with more that 70 mil rows that will be used by many different 'select' queries. This table gets truncated once a day.

    Creating and rebuilding indexes on this table obviously takes a long time.

    If the table was only going to be accessed by a few queries then creating the necessary indexes would be fine, but this table is going to be accessed by a lot of different select queries all with different query predicates and creating indexes that would optimise all these queries might not be worth it due to the size of the table and the amount of time it takes to rebuild an index once the table has been populated.

    Any suggestions?

  • If you're truncating your table every day then you're starting with an empty table. I can't think of any reason, therefore, why you'd want to rebuild your indexes at all... unless I've missed something?

    John

  • Well the table is part of an ETL process, so it gets truncated and re-populated with approx 70 mil rows.

    Before I populate the table I disable all indexes and then once it has been populated I rebuild the indexes.

  • Just before the table is truncated, use the index use DMVs to find out which are the most useful indexes. Do this on several consecutive days, until you've got a good picture of index use patterns. Use that to ditch any indexes that are infrequently accessed. You might even want to look at the missing index DMV to see whether there are any indexes you could create that would help your queries. You now have to decide whether the performance benefit you get from the remaining indexes is worth the extra time they take to build.

    John

  • Cool thanks John

    That is pretty much the line of thinking I had as well, I just needed another point of view.

  • Rowan-283474 (8/18/2016)


    Well the table is part of an ETL process, so it gets truncated and re-populated with approx 70 mil rows.

    Before I populate the table I disable all indexes and then once it has been populated I rebuild the indexes.

    My suggestion is to put it in a database that's in the SIMPLE Recovery Model and add synonyms to the other databases to point at it. Then, it's time to do the ol' "Now you see it, now you see it again" methodology.

    First, since the data in this new database is totally replaceable, you never have to back it up.

    Second, since it's in the SIMPLE Recovery Model, it's a fairly easy thing to take full advantage of "Minimal Logging". You won't have to rebuild the indexes. Just truncate and load following the guidelines for allowing minimal logging to work.

    Third, there should be two tables that are actually named differently that what the original table was. Your synonyms should be named the same as the original table. Let's call the original table TableO and the two new tables Table1 and Table 2.

    Originally, your synonyms (which are named "TableO") would all be pointing at Table1. Time comes to do your ETL thing. Instead of doing anything to Table1, do it to Table2 (truncate and load with minimal logging). If that's successful, repoint the TableO synonyms to point to Table2 instead of Table1. Total "downtime" for the users is measured in milliseconds and you don't have to worry about midquery interruptions because the drop/repoint of the synonym won't be allowed if someone is using it and it'll wait until clear. If the load of Table2 errors out for any reason, then you're still pointing at Table1. While the data may not be up to date, I'm thinking that most of it will be and so you avoid crushing the users that may need it for legacy data. Users looking for today's data will simply be disappointed but it won't crush them, either.

    Again, because of the minimal logging, you won't have to disable your indexes before loading stuff up if you do it all in one step. If you can't do it all in one step and still need to disable and rebuild indexes, then you can take advantage of minimal logging for the index rebuilds. Either way, because of the synonym repointing, your uses will see no perceptible outage.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Rowan-283474 (8/18/2016)


    Creating and rebuilding indexes on this table obviously takes a long time.

    Does populating the table itself take not so long time?

    I'd start from deciding on an optimal clustered index, and then populating the table with records in the same order as per clustered index.

    That would kill about 80% of the index rebuild burden.

    The clustered index should not be dropped when the table is truncated.

    _____________
    Code for TallyGenerator

  • I have a few thoughts:

    1. Why truncate the table daily? Consider instead an incremental insert / update process.

    2. If your queries share even one common predicate, then a non-clustered index on that would help prevent a full table scan.

    3. If you were using a more recent version of SQL Server, like 2012 or higher, then I'd reccomend a ColumnStore index for large mostly static tables routinely accessed by non-sargable ad-hoc style queries. However, you are limited to standard clustered and non-clustered indexes in 2008.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Unfortunately delta operations aren't possible as the source is line of business type transactional records and there is no way to link up rows from my table back to the source table thus the need for a fresh pull of data.

    This in turn creates the need to disable or drop indexes on my table before I pull the data as having indexes on tables while pulling the data is more resource intensive. Once the data has been drawn the indexes get rebuilt.

    Because this table will be heavily used for reporting reasons, it will contain more and more indexes as time goes on, with the trade off of having to rebuild more and more indexes to improve the speed of which data gets retrieved

  • Rowan-283474 (8/22/2016)


    Unfortunately delta operations aren't possible as the source is line of business type transactional records and there is no way to link up rows from my table back to the source table thus the need for a fresh pull of data.

    This in turn creates the need to disable or drop indexes on my table before I pull the data as having indexes on tables while pulling the data is more resource intensive. Once the data has been drawn the indexes get rebuilt.

    Because this table will be heavily used for reporting reasons, it will contain more and more indexes as time goes on, with the trade off of having to rebuild more and more indexes to improve the speed of which data gets retrieved

    I'm guessing this table is reloaded daily and is read-only between loads, so there is no cost in creating as many non-clustered indexes as you need, except for disk space and initial index build time. Consider enabling compression on both the table and the indexes. In addition to saving disk space, I've seen performance improvements when when querying from compressed tables and indexes, at least for bulk aggregate type queries.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell (8/22/2016)


    Rowan-283474 (8/22/2016)


    Unfortunately delta operations aren't possible as the source is line of business type transactional records and there is no way to link up rows from my table back to the source table thus the need for a fresh pull of data.

    This in turn creates the need to disable or drop indexes on my table before I pull the data as having indexes on tables while pulling the data is more resource intensive. Once the data has been drawn the indexes get rebuilt.

    Because this table will be heavily used for reporting reasons, it will contain more and more indexes as time goes on, with the trade off of having to rebuild more and more indexes to improve the speed of which data gets retrieved

    I'm guessing this table is reloaded daily and is read-only between loads, so there is no cost in creating as many non-clustered indexes as you need, except for disk space and initial index build time. Consider enabling compression on both the table and the indexes. In addition to saving disk space, I've seen performance improvements when when querying from compressed tables and indexes, at least for bulk aggregate type queries.

    Thanks Eric.

    I have data compression enabled already.

    Please elaborate with what you mean by read-only between loads causing it to not have a cost in creating the indexes?

  • Rowan-283474 (8/22/2016)


    Eric M Russell (8/22/2016)


    Rowan-283474 (8/22/2016)


    Unfortunately delta operations aren't possible as the source is line of business type transactional records and there is no way to link up rows from my table back to the source table thus the need for a fresh pull of data.

    This in turn creates the need to disable or drop indexes on my table before I pull the data as having indexes on tables while pulling the data is more resource intensive. Once the data has been drawn the indexes get rebuilt.

    Because this table will be heavily used for reporting reasons, it will contain more and more indexes as time goes on, with the trade off of having to rebuild more and more indexes to improve the speed of which data gets retrieved

    I'm guessing this table is reloaded daily and is read-only between loads, so there is no cost in creating as many non-clustered indexes as you need, except for disk space and initial index build time. Consider enabling compression on both the table and the indexes. In addition to saving disk space, I've seen performance improvements when when querying from compressed tables and indexes, at least for bulk aggregate type queries.

    Thanks Eric.

    I have data compression enabled already.

    Please elaborate with what you mean by read-only between loads causing it to not have a cost in creating the indexes?

    What I mean is that, if the table is read-only between loads, then the indexes won't cost you anything in terms of performance. For example, if you over-index a transactional table, then it can cause more I/O activity whenever new rows are inserted, updated, or deleted. That's why there is general advice about being conservative about adding new indexes. However, in your case it sounds like a read-only table, so there is no cost to maintain the indexes, only the initial time required to build them.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell (8/22/2016)


    Rowan-283474 (8/22/2016)


    Eric M Russell (8/22/2016)


    Rowan-283474 (8/22/2016)


    Unfortunately delta operations aren't possible as the source is line of business type transactional records and there is no way to link up rows from my table back to the source table thus the need for a fresh pull of data.

    This in turn creates the need to disable or drop indexes on my table before I pull the data as having indexes on tables while pulling the data is more resource intensive. Once the data has been drawn the indexes get rebuilt.

    Because this table will be heavily used for reporting reasons, it will contain more and more indexes as time goes on, with the trade off of having to rebuild more and more indexes to improve the speed of which data gets retrieved

    I'm guessing this table is reloaded daily and is read-only between loads, so there is no cost in creating as many non-clustered indexes as you need, except for disk space and initial index build time. Consider enabling compression on both the table and the indexes. In addition to saving disk space, I've seen performance improvements when when querying from compressed tables and indexes, at least for bulk aggregate type queries.

    Thanks Eric.

    I have data compression enabled already.

    Please elaborate with what you mean by read-only between loads causing it to not have a cost in creating the indexes?

    What I mean is that, if the table is read-only between loads, then the indexes won't cost you anything in terms of performance. For example, if you over-index a transactional table, then it can cause more I/O activity whenever new rows are inserted, updated, or deleted. That's why there is general advice about being conservative about adding new indexes. However, in your case it sounds like a read-only table, so there is no cost to maintain the indexes, only the initial time required to build them.

    Ah yes understood. For sure.

    Thank for the input

  • Do you really have the case when the source table does not have any kind of time stamp or an auto-incremental ID?

    _____________
    Code for TallyGenerator

  • Rowan-283474 (8/22/2016)


    ...the source is line of business type transactional records and there is no way to link up rows from my table back to the source table thus the need for a fresh pull of data.

    That may be a part of the key to all of this. How are you loading all that data from the remote source into your table? Is it something like a CSV or Tab delimited file or ???

    Also, how big is the table in GB?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 28 total)

You must be logged in to reply to this topic. Login to reply