Creating indexes on a very large table takes over 5 hours

  • I have a table that is 125GB.  It only has 5 Columns but has almost a billing rows.  This is a fact table that gets regenerated nightly .  I currently do this by creating a staging table inserting the data and then creating the indexes.  I have tried doing the opposite.  Creating the indexes first and then inserting the data but it completely blows up the transaction log and I've never been able to have enough space.   there is currently no way to incrementally populate the table.

    Is there anything else I can do to speed up the indexing process?

    If exists(select 1 from sys.tables where name='Staging') DROP TABLE  Staging


    select distinct
    ul.primaryaccountcode,
    ul.USERID,
    l.loss_id_nb 'LossID',
    l.franid 'FranID',
    l.loss_nm 'LossNumber',
    l.FNOL_PROGRAM_ID as ProgramID
    INTO FactLoss_Staging
    From Table 1 l
    INNER JOIN Table 2 a on a.franid = l.franid and a.pri_acct_cd = l.pri_acct_cd
    INNER JOIN Table 3ul on ul.franid = l.franid and ul.primaryaccountcode = l.pri_acct_cd

    Union ALL
    select distinct
    ul.primaryaccountcode,
    ul.USERID,
    l.loss_id_nb 'LossID',
    l.franid 'FranID',
    l.loss_nm 'LossNumber',
    c.programid
    From loss l
    INNER JOIN table 1 a on a.franid = l.franid and a.pri_acct_cd = l.pri_acct_cd
    INNER JOIN table2 c on c.mxlossid = l.prj_loss_id_nb and c.active = 1
    INNER JOIN table 3 ul on ul.programid = c.programid

    GO

    CREATE CLUSTERED INDEX IX_FactLoss_LossID
    ON Staging ([LossID])
    WITH (FILLFACTOR=80, ONLINE=ON, SORT_IN_TEMPDB=OFF)
    GO

    CREATE NONCLUSTERED INDEX IX_FactLoss_LossID_PrimAcctCd_UserID
    ONStaging ([LossID], [UserID], [primaryaccountcode])
    WITH (FILLFACTOR=80, ONLINE=ON, SORT_IN_TEMPDB=OFF)
    GO

    CREATE NONCLUSTERED INDEX IX_FactLoss_UserID_PrimAcctCd_Includes
    ON [dbo].Staging ([USERID], [primaryaccountcode])
    INCLUDE ([LossID],[LossNumber])
    WITH (FILLFACTOR=80, ONLINE=ON, SORT_IN_TEMPDB=OFF)
    GO

    CREATE NONCLUSTERED INDEX IX_FactLoss_PrimAcctCd_UserID_Includes2
    ON [dbo].Staging ([primaryaccountcode],[USERID])
    INCLUDE ([LossID],[FranID],[LossNumber],[ProgramID])
    WITH (FILLFACTOR=80, ONLINE=ON, SORT_IN_TEMPDB=OFF)
    GO

    If exists(select 1 from sys.tables where name='FactLoss_Old') Drop Table FactLoss_Old


    EXEC sp_rename 'FactLoss', 'FactLoss_Old'
    EXEC sp_rename 'Staging', 'FactLoss'
  • there is currently no way to incrementally populate the table.

    ...  a [Huge] fact table that gets regenerated nightly

    This is a design/process issue more than a technical one. I'll put my 2 cents in to bump this topic up but can't offer a solution that doesn't involve incrementally building the index.

    If incrementally repopulating the table isn't an option then, for this type of thing I generally:

    1. Create a duplicate table with the indexes in place

    2. Perform an incremental batch insert routine to the new table

    3. Drop the old one and point data requests to the new table.

    If this IS the solution as far as I know. Since you're not talking about OLTP, there shouldn't be any interpretation.

    The only other possible solution is to see of the index builds are happening with a parallel execution plan. That's an option in 2016 and will speed things up a ton if your getting a serial plan for some reason.

    Again, I'd yell until someone allowed incremental updates. A big fact table repopulating daily is no way to live.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • I'd recommend trying the approach below.

    Specifically, the code below:

    1. Changes the fillfactor to 99.  80 is way too small (unless you do massive updates to almost the entire table later?! And even then, not really).

    2. Creates the clustered index before loading the table.  That saves a full extra load of the table.  If at all possible, make the index unique, since unique indexes perform much better overall in SQL Server.

    3. Uses page compression (if you're using SQL 2016, this is available in Standard Edition).

    4. Reduces the number of indexes.  If you truly must have an index headed by USERID -- and test that first -- add that index back in.

    5. Sets ONLINE = OFF and SORT_IN_TEMPDB = ON /*Edit*/

    If exists(select 1 from sys.tables where object_id = OBJECT_NAME('dbo.Staging')) 
    DROP TABLE dbo.Staging

    /* create table structure *only* */
    select distinct top (0)
    ul.primaryaccountcode,
    ul.USERID,
    l.loss_id_nb 'LossID',
    l.franid 'FranID',
    l.loss_nm 'LossNumber',
    l.FNOL_PROGRAM_ID as ProgramID
    INTO FactLoss_Staging
    From Table 1 l
    INNER JOIN Table 2 a on a.franid = l.franid and a.pri_acct_cd = l.pri_acct_cd
    INNER JOIN Table 3ul on ul.franid = l.franid and ul.primaryaccountcode = l.pri_acct_cd

    /* create clustered index *before* initial load of table */
    CREATE /*UNIQUE*/ CLUSTERED INDEX IX_FactLoss_LossID /* make UNIQUE if at all possible */
    ON dbo.Staging ( LossID, USERID )
    WITH ( DATA_COMPRESSION = PAGE, FILLFACTOR=99, ONLINE=OFF, SORT_IN_TEMPDB=ON )

    /* load the table */
    INSERT INTO Staging WITH (TABLOCK)
    select distinct
    ul.primaryaccountcode,
    ul.USERID,
    l.loss_id_nb 'LossID',
    l.franid 'FranID',
    l.loss_nm 'LossNumber',
    l.FNOL_PROGRAM_ID as ProgramID
    From Table 1 l
    INNER JOIN Table 2 a on a.franid = l.franid and a.pri_acct_cd = l.pri_acct_cd
    INNER JOIN Table 3ul on ul.franid = l.franid and ul.primaryaccountcode = l.pri_acct_cd
    UNION ALL
    select distinct
    ul.primaryaccountcode,
    ul.USERID,
    l.loss_id_nb 'LossID',
    l.franid 'FranID',
    l.loss_nm 'LossNumber',
    c.programid
    From loss l
    INNER JOIN table 1 a on a.franid = l.franid and a.pri_acct_cd = l.pri_acct_cd
    INNER JOIN table2 c on c.mxlossid = l.prj_loss_id_nb and c.active = 1
    INNER JOIN table 3 ul on ul.programid = c.programid
    GO

    /* create nonclustered index(es) */

    CREATE NONCLUSTERED INDEX IX_FactLoss_PrimAcctCd_UserID_Includes2
    ON dbo.Staging ( primaryaccountcode, USERID ) /*whichever key first that better matches your queries*/
    INCLUDE ( FranID, LossID, LossNumber, ProgramID )
    WITH ( DATA_COMPRESSION = PAGE, FILLFACTOR=99, ONLINE=OFF, SORT_IN_TEMPDB=ON )
    GO

     

    • This reply was modified 3 years, 4 months ago by  ScottPletcher. Reason: Added ONLINE = OFF to list of changes I made. I forgot to mention this before, but I did make the change

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • using the ONLINE=ON is most likely what is killing the index creation.

    but all other comments still valid.

  • pamkagel wrote:

    I have a table that is 125GB.  It only has 5 Columns but has almost a billing rows.  This is a fact table that gets regenerated nightly .  I currently do this by creating a staging table inserting the data and then creating the indexes.  I have tried doing the opposite.  Creating the indexes first and then inserting the data but it completely blows up the transaction log and I've never been able to have enough space.   there is currently no way to incrementally populate the table.

    Is there anything else I can do to speed up the indexing process?

    I have a couple of questions...

    1. How long does the code currently take just to repopulate the truncated table WITHOUT doing anything else such as rebuilding the indexes?
    2. Is the table to be repopulated in the same database as the source data?

      1. If not, what is the Recovery Model of the target database?
      2. If so, what is the Recovery Model of the current database?
      3. If the answer to question 2 is "Full Recovery", can we move the table you're repopulating to a different database that uses the Simple Recovery Model and point synonyms at it?

    3. Is there some temporal column in the database where we could "divy up" the target table and use a partitioned view to do a more rapid "parallel load"?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Are you sure you need that table at all?

    Why wouldn't you try to create a view base on the same query you use to populate the table and then query that view instead on the table?

     

    _____________
    Code for TallyGenerator

  • Tried creating the view.  It was non performant .

  • pamkagel wrote:

    Tried creating the view.  It was non performant .

    Then you're doing it wrong.  Please answer the questions I posted or I don't know enough how to help you.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thank you all for your responses.  I am testing out some of your suggestions.

    @jeffModen

    1. It currently takes about two hours to load the table.
    2. Same database and Database is in Simple Recovery right now
    3. And unfortunately no there is no temporal column

     

     

  • I'm also trying to improve overall performance, not just the load time.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • pamkagel wrote:

    Tried creating the view.  It was non performant .

    I had a closer look at the query and I'm not surprised it's not performant.

     

    First - what do you need the "distinct" thing for? Are you trying to eliminate some duplicates or is it "just in case"?

    Either way you should not use it. Ever. Just forget that keyword.

    If you know cases of data duplication - find where do they come from and eliminate the source of it.

    If it's not possible to eliminate them at the source - use GROUP BY.

    Second. No data from table "a" used in SELECT. therefore the table must be taken from FROM part and placed into WHERE EXISTS.

    I modified the query to be a better performer:

    select 
    ul.primaryaccountcode,
    ul.USERID,
    l.loss_id_nb 'LossID',
    l.franid 'FranID',
    l.loss_nm 'LossNumber',
    l.FNOL_PROGRAM_ID as ProgramID
    From Table1 l
    INNER JOIN Table3 ul on ul.franid = l.franid and ul.primaryaccountcode = l.pri_acct_cd
    WHERE EXISTS (select * from Table2 a where a.franid = l.franid and a.pri_acct_cd = l.pri_acct_cd)
    UNION ALL
    select
    ul.primaryaccountcode,
    ul.USERID,
    l.loss_id_nb 'LossID',
    l.franid 'FranID',
    l.loss_nm 'LossNumber',
    c.programid
    From loss l
    INNER JOIN table2 c on c.mxlossid = l.prj_loss_id_nb and c.active = 1
    INNER JOIN table3 ul on ul.programid = c.programid
    WHERE EXISTS (select * from table1 a on a.franid = l.franid and a.pri_acct_cd = l.pri_acct_cd)

    Third. Try to run your reports from the view with the modified version of the query. Check the execution plans. See if some index tweaking on the underlying tables may improve performance.

    You may wish to post execution plans here - someone must be able to show you the way.

    _____________
    Code for TallyGenerator

  • Thank you Sergy.  I'm still doing some testing but those query changes made a difference so far.

  • @pamkagel ,

    Ok... you said that the entire table is generated nightly.  That makes me want to assume that you're not actually doing any updates or additional inserts to the table.  With that, I have to ask why on Earth your indexes are being built with an 80% Fill Factor.  That means that you're wasting (100/80-1)*100 percent of disk space and memory or, in the case of your 125GB table, you're wasting 31.25GB of disk space and RAM and that's just for the Clustered Index.

    So, if you're not doing any updates or inserts (and you should NOT be because you throw the table away every day), build the indexes either with the default (which should default to 0 and that's the same as 100 unless you made the terrible mistake of changing the default) or explicitly build them at 100.

    You also mentioned that the database is in the SIMPLE recovery model.  That's actually good for this.  It will easily allow for "Minimal Logging", which will make your original build twice as fast from the git IF you setup for minimal logging.  Here's how to do that...

    Don't use SELECT/INTO because that won't build your clustered index.  Instead, explicitly create the table WITH the clustered index in place on the empty table.  Then, whatever SELECT you end up with, it must be in the following form to achieve "Minimal Logging", which is nasty fast and won't require you to build the clustered index separately...

    1.  Create the table with an explicit CREATE TABLE statement that also creates the Clustered Index.  This can be in the form of a Clustered PK constraint or as an explicit index.
    2. Do NOT create any non-clustered indexes yet and don't listen to people that will rail about Trace Flag 610.  It can work sometimes but about 99% of the time it won't (personal experience talking there).
    3. Do your insert in the following form...
     INSERT INTO dbo.yourtablename WITH (TABLOCK)
    (full column list here)
    SELECT full column list here
    FROM dbo.whatever
    JOIN dbo.whatever etc
    WHERE whatever relationship is necessary to get the right data but must be SARGable
    ORDER BY list of column(s) in the exact same order as the clustered index
    OPTION (RECOMPILE) --undocumented but usually required and won't be in execution plan if you're lucky with your joins
    ;

    The WITH(TABLOCK), ORDER BY in the correct column order, and the OPTION (RECOMPILE) are absolutely essential to get the minimal logging.

    Now, that code will be slower than SELECT/INTO but it will be much faster than SELECT/INTO followed by you building a Clustered Index.  It will also take half the memory, half the disk space, and a remarkably small amount of transaction log file space.  In one of my tests, I built two 120 MILLION row tables with row widths of 124 bytes each (including the unseen row headers) with the clustered indexes in place in 4 minutes flat for a total of 25.153GB of input and it only used 600MB of log file space.  Adding a non-clustered index to each table took an additional minute and caused no additional growth.  It's a small example but the 33GB total size of the database was all created in a little over 5 minutes and each table contain 120 MILLION rows.

    Let me know if there's a column that has a range that's different in the first query than in the second.  If so, there's another trick I can show you to pull of parallel loads into two separate tables using the high performance method of loading above and then using a "Partitioned View" to make them look like one table.

    Last but not least and when the table is loaded with the clustered index in place, THEN build your non-clustered indexes at a 100 FILL FACTOR.

    Of course, you actually have already been told partially how to do this by Scott in his post above.   The two things he left out is the ORDER BY to guarantee the inserts will "follow the rules" for "Minimal Logging" and the undocumented OPTION(RECOMPILE), which won't hurt if it's actually not necessary and helps guarantee that you'll get the "Minimal Logging".  In that dual 120 Million row row insert test I told you about above, IT was the difference between a 4 minute load/600MB log file and a 20 minutes load/40GB log file.

    I still question the need to rebuild a table with "billings of rows" every night but don't know your data well enough to make a suggestion there.  Both Sergiy and Scott are also correct about the performance of the SELECTs being critical but I can't make an informed suggestion because, like I said, I don't know your data and you've not included an Actual Execution Plan in your postings.  See the article at the second link in my signature line below for one of the best ways to post performance issues.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden wrote:

    @pamkagel ,

    Ok... you said that the entire table is generated nightly.  That makes me want to assume that you're not actually doing any updates or additional inserts to the table.  With that, I have to ask why on Earth your indexes are being built with an 80% Fill Factor.  That means that you're wasting (100/80-1)*100 percent of disk space and memory or, in the case of your 125GB table, you're wasting 31.25GB of disk space and RAM and that's just for the Clustered Index.

    So, if you're not doing any updates or inserts (and you should NOT be because you throw the table away every day), build the indexes either with the default (which should default to 0 and that's the same as 100 unless you made the terrible mistake of changing the default) or explicitly build them at 100.

    You also mentioned that the database is in the SIMPLE recovery model.  That's actually good for this.  It will easily allow for "Minimal Logging", which will make your original build twice as fast from the git IF you setup for minimal logging.  Here's how to do that...

    Don't use SELECT/INTO because that won't build your clustered index.  Instead, explicitly create the table WITH the clustered index in place on the empty table.  Then, whatever SELECT you end up with, it must be in the following form to achieve "Minimal Logging", which is nasty fast and won't require you to build the clustered index separately...

    1.  Create the table with an explicit CREATE TABLE statement that also creates the Clustered Index.  This can be in the form of a Clustered PK constraint or as an explicit index.
    2. Do NOT create any non-clustered indexes yet and don't listen to people that will rail about Trace Flag 610.  It can work sometimes but about 99% of the time it won't (personal experience talking there).
    3. Do your insert in the following form...
     INSERT INTO dbo.yourtablename WITH (TABLOCK)
    (full column list here)
    SELECT full column list here
    FROM dbo.whatever
    JOIN dbo.whatever etc
    WHERE whatever relationship is necessary to get the right data but must be SARGable
    ORDER BY list of column(s) in the exact same order as the clustered index
    OPTION (RECOMPILE) --undocumented but usually required and won't be in execution plan if you're lucky with your joins
    ;

    The WITH(TABLOCK), ORDER BY in the correct column order, and the OPTION (RECOMPILE) are absolutely essential to get the minimal logging.

    Now, that code will be slower than SELECT/INTO but it will be much faster than SELECT/INTO followed by you building a Clustered Index.  It will also take half the memory, half the disk space, and a remarkably small amount of transaction log file space.  In one of my tests, I built two 120 MILLION row tables with row widths of 124 bytes each (including the unseen row headers) with the clustered indexes in place in 4 minutes flat for a total of 25.153GB of input and it only used 600MB of log file space.  Adding a non-clustered index to each table took an additional minute and caused no additional growth.  It's a small example but the 33GB total size of the database was all created in a little over 5 minutes and each table contain 120 MILLION rows.

    Let me know if there's a column that has a range that's different in the first query than in the second.  If so, there's another trick I can show you to pull of parallel loads into two separate tables using the high performance method of loading above and then using a "Partitioned View" to make them look like one table.

    Last but not least and when the table is loaded with the clustered index in place, THEN build your non-clustered indexes at a 100 FILL FACTOR.

    Of course, you actually have already been told partially how to do this by Scott in his post above.   The two things he left out is the ORDER BY to guarantee the inserts will "follow the rules" for "Minimal Logging" and the undocumented OPTION(RECOMPILE), which won't hurt if it's actually not necessary and helps guarantee that you'll get the "Minimal Logging".  In that dual 120 Million row row insert test I told you about above, IT was the difference between a 4 minute load/600MB log file and a 20 minutes load/40GB log file.

    I still question the need to rebuild a table with "billings of rows" every night but don't know your data well enough to make a suggestion there.  Both Sergiy and Scott are also correct about the performance of the SELECTs being critical but I can't make an informed suggestion because, like I said, I don't know your data and you've not included an Actual Execution Plan in your postings.  See the article at the second link in my signature line below for one of the best ways to post performance issues.

    Those extra steps are not required for SQL 2016, at least according to MS internals folks.  For example, your statement of needing an explicit CREATE TABLE statement?  Again, I think that is for older SQL versions only.

     

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher wrote:

    Those extra steps are not required for SQL 2016, at least according to MS internals folks.  For example, your statement of needing an explicit CREATE TABLE statement?  Again, I think that is for older SQL versions only.

    Apologies... You basically have an explicit creation of a CREATE  TABLE statement in the "rowless" SELECT/INTO statement and that will also do.  The thing I'm trying to point out is that something needs to explicitly create the empty table and add the Clustered Index (like we've both done) prior to doing inserts.  My CREATE TABLE isn't an "extra step" any more than your "rowless" SELECT/INTO.  Either will work.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 24 total)

You must be logged in to reply to this topic. Login to reply