Primary Key creation - before and after bulk insert

  • I need to insert a large (less than half a million) number of records into a table at once. This happens in a query, so the table in question is a temporary table. I need to use this temporary table in other parts of the query for a lot of processing; so it is essential to add a clustered index (I create a Primary Key).

    What will yield me better performance -

    PK creation after insert is completed

    or

    PK creation at the time of table creation (before insert)

    or

    does it depend on other factors such as the number of columns in the table and other indexes and constraints on the table?

    I have believed that it is always better to create the PK after a bulk insert. I vaguely remember an instance when it yielded me better performance. But I tried to reproduce both the cases using Jeff's Tally table example and some other resource. In this case, PK creation at the time of table creation proved to be faster.

    I will list my questions below.

    1. Do I create PK before or after the insert (or if it depends, what are the factors)?

    2. The two queries attached have different execution times. From the execution plans, in case of primary key creation after insert, after insertion of data the following steps happen - Table scan-->Sort-->Index Insert. I had thought that data will only be reorganized in this case. What happens during the steps 'in bold' and the 'Index Insert' step in particular?

    3. If I use table instead of temporary table, is the behavior similar?

    Please see the queries below.

    --=== Index creation before insert

    DECLARE @StartTime DATETIME --Timer to measure total duration

    SET @StartTime = GETDATE() --Start the timer

    --=============================================================================

    -- Create and populate a Tally table

    --=============================================================================

    --===== Conditionally drop

    IF OBJECT_ID('tempdb..#Tally') IS NOT NULL

    DROP TABLE #Tally

    CREATE TABLE #Tally(N INT, PRIMARY KEY(N))

    --===== Insert (1000 x 100) number of rows

    INSERT INTO #Tally(N)

    SELECT ROW_NUMBER() OVER(ORDER BY a.number ASC)

    FROM master.dbo.spt_values a

    CROSS JOIN master.dbo.spt_values b

    WHERE a.type = 'P' AND b.type = 'P' AND

    a.number < 1000 AND b.number < 100

    --===== Display the total duration

    SELECT STR(DATEDIFF(ms,@StartTime,GETDATE())) + ' Milliseconds duration' [PK at table creation]

    --=== Index creation after insert

    DECLARE @StartTime DATETIME --Timer to measure total duration

    SET @StartTime = GETDATE() --Start the timer

    --=============================================================================

    -- Create and populate a Tally table

    --=============================================================================

    --===== Conditionally drop

    IF OBJECT_ID('tempdb..#Tally') IS NOT NULL

    DROP TABLE #Tally

    CREATE TABLE #Tally(N INT not null)

    INSERT INTO #Tally(N)

    SELECT ROW_NUMBER() OVER(ORDER BY a.number ASC)

    FROM master.dbo.spt_values a

    CROSS JOIN master.dbo.spt_values b

    WHERE a.type = 'P' AND b.type = 'P' AND

    a.number < 1000 AND b.number < 100

    --===== Add a Primary Key

    ALTER TABLE #Tally

    ADD PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100

    --===== Display the total duration

    SELECT STR(DATEDIFF(ms,@StartTime,GETDATE())) + ' Milliseconds duration' [PK after table creation]

    Thanks in advance.

    - arjun

    https://sqlroadie.com/

  • Arjun Sivadasan (8/13/2010)

    1. Do I create PK before or after the insert (or if it depends, what are the factors)?

    well it depends , sorting willbe done in either of cases, but in case of inserting data into heap then creating index give benefit.i think two operation ( insertion and index creation ) hapoen one by one ,results in less locking.but again , testing would give concrete answer for your query.

    Arjun Sivadasan (8/13/2010)

    2. The two queries attached have different execution times. From the execution plans, in case of primary key creation after insert, after insertion of data the following steps happen - Table scan-->Sort-->Index Insert. I had thought that data will only be reorganized in this case. What happens during the steps 'in bold' and the 'Index Insert' step in particular?

    index page creation

    Arjun Sivadasan (8/13/2010)

    3. If I use table instead of temporary table, is the behavior similar?

    it should.

    -------Bhuvnesh----------
    I work only to learn Sql Server...though my company pays me for getting their stuff done;-)

  • Do I create PK before or after the insert (or if it depends, what are the factors)?

    I would definitely create my indexes/PK's after the data is loaded. That way the statistics will be up to date and provide a better query plan. From what you said, your own testing has verified this. There is one issue which I have run into, and that is when the table is created and the indexes/statistics are created - If you need to delete and reload the data, then the queryplan can be way of. Also, I have run into an issue where deleting the data from a table and recompiling the stored procedures on that table, whether temp or permanent, the query plan has been way off.

    If I use table instead of temporary table, is the behavior similar?

    This one should be based upon your system. A Temp Table (#<tablename) is actually created in tempdb, and thus on the drive where your tempdb is located. If you have a very busy system, then there will definitely be an I/O issue as other processes will be competing for disk space and IO. You would have to test various configurations to determine the best. Optins could be to create a temp table, a permanent table, maybe a file group on another drive and place the table on that. These are just of the top of my head.

    Good luck

    Steve Jimmo
    Sr DBA
    “If we ever forget that we are One Nation Under God, then we will be a Nation gone under." - Ronald Reagan

  • One additional thing to consider is what is the order of your data? If the order of the data you are inserting very closly matches the clustered index then you will get much better performance than if it doesn't.

    I wrote a piece of code a number of years back to upgrade between 2 versions of a piece of software. In the end for speed purposes we had to drop and save all of the indexes then re-create them at the end. This was not always the fastest way to move the data though so we had to put in a flag, drop indexes or not. The determining factor was of course size of the table. And unfortunatly it wasn't an exact number of rows, and we ended up using testing and experience to decide which way to go.

    I think however that in your case (aprox half a mil) you are safe enough creating the index at the end. However, again you will get alot of bang for your buck if you insert the data in the order of the clustered index. Even realizing the cost of doing an ORDER BY on your query.

    Kenneth FisherI was once offered a wizards hat but it got in the way of my dunce cap.--------------------------------------------------------------------------------For better, quicker answers on T-SQL questions, click on the following... http://www.sqlservercentral.com/articles/Best+Practices/61537/[/url]For better answers on performance questions, click on the following... http://www.sqlservercentral.com/articles/SQLServerCentral/66909/[/url]Link to my Blog Post --> www.SQLStudies.com[/url]

  • If the clus index key is an IDENTITY column, you can create that beforehand; SQL "knows" the INSERTs will always be sequential.

    You may be able to create the clus index beforehand if you make sure the data is sorted in clus key sequence when loaded. That is:

    CREATE TABLE tablename ( ..., UNIQUE CLUSTERED (col1, col2) )

    INSERT INTO tablename

    SELECT col1, col2, ...

    FROM ...

    WHERE ...

    ORDER BY col1, col2

    You will have to test this to be sure.

    Scott Pletcher, SQL Server MVP 2008-2010

  • Hey Arjun,

    There are a few important things here that determine the fastest method:

    First, you must avoid an explicit sort operation if possible. In your first script (where the clustered index exists before the load) you are loading the result of a ROW_NUMBER ranking function into an INTEGER column. ROW_NUMBER returns BIGINT; the type mismatch is enough to confuse the optimiser in thinking it needs to sort the results before inserting into the table.

    Second, try to achieve a minimally-logged load. In SQL Server 2005, an INSERT...SELECT statement is always fully logged. The minimally-logged alternative is SELECT...INTO. This does mean creating the clustered index as a second operation, but the benefit of minimal logging more than makes up for this overhead. Just to confuse matters, SQL Server 2008 can do a minimally-logged INSERT...SELECT in many circumstances.

    Finally, try not to use master.dbo.spt_values - use a documented system table instead. The reason is that the internals of spt_values changed significantly in SQL Server 2008, making it a very slow way to grab rows.

    The fastest method (for me) depends on which version of SQL Server I use. The lack of a minimally-logged INSERT in SQL Server 2005 means that SELECT...INTO followed by a clustered index build is faster. The improvements in SQL Server 2008 make INSERT...SELECT (with an existing clustered index) faster.

    So, my 'improved' version of the first query (pre-existing index) is:

    --

    -- Load into an existing index

    -- (fastest on 2008)

    --

    IF OBJECT_ID(N'tempdb..#Tally', N'U')

    IS NOT NULL

    BEGIN

    DROP TABLE #Tally;

    END;

    GO

    DECLARE @StartTime DATETIME;

    SET @StartTime = GETDATE();

    CREATE TABLE #Tally

    (

    N BIGINT NOT NULL PRIMARY KEY CLUSTERED

    );

    INSERT #Tally

    (N)

    SELECT TOP (10000)

    ROW_NUMBER() OVER (ORDER BY (SELECT 1))

    FROM master.sys.allocation_units A,

    master.sys.allocation_units B;

    SELECT [INSERT...SELECT] = DATEDIFF(MILLISECOND, @StartTime, GETDATE());

    GO

    DROP TABLE #Tally;

    'Improved' version of the second query (load then index) is:

    --

    -- Load then create index

    -- (fastest on 2005)

    --

    IF OBJECT_ID(N'tempdb..#Tally', N'U')

    IS NOT NULL

    BEGIN

    DROP TABLE #Tally;

    END;

    GO

    DECLARE @StartTime DATETIME;

    SET @StartTime = GETDATE();

    SELECT TOP (10000)

    ISNULL(ROW_NUMBER() OVER(ORDER BY (SELECT 1)), -1) AS N

    INTO #Tally

    FROM master.sys.allocation_units A,

    master.sys.allocation_units B;

    ALTER TABLE #Tally

    ADD PRIMARY KEY CLUSTERED (N)

    WITH

    FILLFACTOR = 100;

    SELECT [SELECT...INTO] = DATEDIFF(MILLISECOND, @StartTime, GETDATE());

    GO

    DROP TABLE #Tally;

    Typical run times on 2008:

    Query 1: 13ms

    Query 2: 23ms

    Typical run times on 2005:

    Query 1: 96ms

    Query 2: 20ms

    Paul

  • Gentlemen, thanks for the answers. I have more questions. I will post them a little later. I am currently using my phone to browse. I would like to add that I found creating the primary key after the bulk insert slower in this case. I am sorry for the delay in my response. I will update soon.

    - arjun

    https://sqlroadie.com/

  • Thanks a lot for the response Steve. I am sorry I took some time to post. Weekend Fever!! 🙂

    sjimmo (8/13/2010)


    I would definitely create my indexes/PK's after the data is loaded. That way the statistics will be up to date and provide a better query plan. From what you said, your own testing has verified this.

    No, in the sample I had posted, index creation after insertion took more time but, I think there were other factors at play, as Paul has explained. Say, I create the index in the table definition. How will that affect statistics or query plan? If temporary tables are used, execution plan will be changed once the table has data, won't it be? (Ref: http://support.microsoft.com/kb/243586)

    There is one issue which I have run into, and that is when the table is created and the indexes/statistics are created - If you need to delete and reload the data, then the queryplan can be way of. Also, I have run into an issue where deleting the data from a table and recompiling the stored procedures on that table, whether temp or permanent, the query plan has been way off.

    By way off, do you mean the new plan yields less performance or that it is way different from the older plan?

    - arjun

    https://sqlroadie.com/

  • Thanks Kenneth and Scott. I will reexamine my queries with this in mind.

    -arjun

    https://sqlroadie.com/

  • Hi Paul, it's good to hear from you. 🙂 And a great answer, thanks a lot.

    Paul White NZ (8/13/2010)


    First, you must avoid an explicit sort operation if possible. In your first script (where the clustered index exists before the load) you are loading the result of a ROW_NUMBER ranking function into an INTEGER column. ROW_NUMBER returns BIGINT; the type mismatch is enough to confuse the optimiser in thinking it needs to sort the results before inserting into the table.

    Finally, try not to use master.dbo.spt_values - use a documented system table instead. The reason is that the internals of spt_values changed significantly in SQL Server 2008, making it a very slow way to grab rows.

    I don't usually use undocumented stuff as they are bound to change or get dropped. I picked that spt_values part from another post. I was not aware that row_number returns bigint. It is really good to know that as I tend to use row_number function a lot.

    Second, try to achieve a minimally-logged load. In SQL Server 2005, an INSERT...SELECT statement is always fully logged. The minimally-logged alternative is SELECT...INTO. This does mean creating the clustered index as a second operation, but the benefit of minimal logging more than makes up for this overhead. Just to confuse matters, SQL Server 2008 can do a minimally-logged INSERT...SELECT in many circumstances.

    The fastest method (for me) depends on which version of SQL Server I use. The lack of a minimally-logged INSERT in SQL Server 2005 means that SELECT...INTO followed by a clustered index build is faster. The improvements in SQL Server 2008 make INSERT...SELECT (with an existing clustered index) faster.

    I was not aware of these benefits of SELECT.. INTO. It's really good to know this.

    I have another question for you Paul. Say, I have an SP which uses a temporary table. The number of rows in the temporary table depends on the SP parameters. So, for one set of parameters, the table has 100 rows and for another set, the table has 100, 000 rows. Will the execution plan be changed once the 100,000 rows are inserted? I've always assumed it to. Ref: http://support.microsoft.com/kb/243586

    - arjun

    https://sqlroadie.com/

  • Arjun Sivadasan (8/16/2010)


    I have another question for you Paul. Say, I have an SP which uses a temporary table. The number of rows in the temporary table depends on the SP parameters. So, for one set of parameters, the table has 100 rows and for another set, the table has 100, 000 rows. Will the execution plan be changed once the 100,000 rows are inserted? I've always assumed it to. Ref: http://support.microsoft.com/kb/243586- arjun

    Obviously, Paul is very right person to respond on this.but i can also give a try. less records might lead to table scan and more records will get benefit from indexes.So there you can see different exec plan.

    -------Bhuvnesh----------
    I work only to learn Sql Server...though my company pays me for getting their stuff done;-)

  • Paul two question for you 🙂

    first, why optimizer will get confused by rownumber ( it return bigint value) for int type output ?

    Second, SELECT ...INTO approach is minimal logged operation. why ? or any related article ?

    -------Bhuvnesh----------
    I work only to learn Sql Server...though my company pays me for getting their stuff done;-)

  • Hey, thanks Bhuvnesh. Steve had said - "I would definitely create my indexes/PK's after the data is loaded. That way the statistics will be up to date and provide a better query plan." I am trying to figure this out; will wait for Paul's and Steve's comments on this.

    - arjun

    https://sqlroadie.com/

  • Arjun Sivadasan (8/16/2010)


    Hi Paul, it's good to hear from you. 🙂 And a great answer, thanks a lot.

    Thanks! - and you're welcome.

    I have another question for you Paul. Say, I have an SP which uses a temporary table. The number of rows in the temporary table depends on the SP parameters. So, for one set of parameters, the table has 100 rows and for another set, the table has 100, 000 rows. Will the execution plan be changed once the 100,000 rows are inserted? I've always assumed it to.

    Adding just 6 rows to an empty temporary table in a procedure, will cause later statements that depend on that table to recompile for plan-optimality reasons. See http://technet.microsoft.com/en-us/library/ee343986(SQL.100).aspx - the section Causes of Recompilations, about half-way through. Read the whole paper though, it is excellent, and full of things most people don't know.

    Paul

  • Thanks Paul. I will read it and trouble you if i have any doubt 🙂

    - arjun

    https://sqlroadie.com/

Viewing 15 posts - 1 through 15 (of 28 total)

You must be logged in to reply to this topic. Login to reply