Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 123»»»

Primary Key creation - before and after bulk insert Expand / Collapse
Author
Message
Posted Friday, August 13, 2010 5:15 AM


SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Tuesday, December 2, 2014 11:59 PM
Points: 406, Visits: 780
I need to insert a large (less than half a million) number of records into a table at once. This happens in a query, so the table in question is a temporary table. I need to use this temporary table in other parts of the query for a lot of processing; so it is essential to add a clustered index (I create a Primary Key).

What will yield me better performance -
PK creation after insert is completed
or
PK creation at the time of table creation (before insert)
or
does it depend on other factors such as the number of columns in the table and other indexes and constraints on the table?

I have believed that it is always better to create the PK after a bulk insert. I vaguely remember an instance when it yielded me better performance. But I tried to reproduce both the cases using Jeff's Tally table example and some other resource. In this case, PK creation at the time of table creation proved to be faster.

I will list my questions below.

1. Do I create PK before or after the insert (or if it depends, what are the factors)?
2. The two queries attached have different execution times. From the execution plans, in case of primary key creation after insert, after insertion of data the following steps happen - Table scan-->Sort-->Index Insert. I had thought that data will only be reorganized in this case. What happens during the steps 'in bold' and the 'Index Insert' step in particular?
3. If I use table instead of temporary table, is the behavior similar?

Please see the queries below.

--=== Index creation before insert

DECLARE @StartTime DATETIME --Timer to measure total duration
SET @StartTime = GETDATE() --Start the timer
--=============================================================================
-- Create and populate a Tally table
--=============================================================================
--===== Conditionally drop
IF OBJECT_ID('tempdb..#Tally') IS NOT NULL
DROP TABLE #Tally
CREATE TABLE #Tally(N INT, PRIMARY KEY(N))

--===== Insert (1000 x 100) number of rows
INSERT INTO #Tally(N)
SELECT ROW_NUMBER() OVER(ORDER BY a.number ASC)
FROM master.dbo.spt_values a
CROSS JOIN master.dbo.spt_values b
WHERE a.type = 'P' AND b.type = 'P' AND
a.number < 1000 AND b.number < 100

--===== Display the total duration
SELECT STR(DATEDIFF(ms,@StartTime,GETDATE())) + ' Milliseconds duration' [PK at table creation]


--=== Index creation after insert

DECLARE @StartTime DATETIME --Timer to measure total duration
SET @StartTime = GETDATE() --Start the timer

--=============================================================================
-- Create and populate a Tally table
--=============================================================================
--===== Conditionally drop
IF OBJECT_ID('tempdb..#Tally') IS NOT NULL
DROP TABLE #Tally
CREATE TABLE #Tally(N INT not null)

INSERT INTO #Tally(N)
SELECT ROW_NUMBER() OVER(ORDER BY a.number ASC)
FROM master.dbo.spt_values a
CROSS JOIN master.dbo.spt_values b
WHERE a.type = 'P' AND b.type = 'P' AND
a.number < 1000 AND b.number < 100

--===== Add a Primary Key
ALTER TABLE #Tally
ADD PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100

--===== Display the total duration
SELECT STR(DATEDIFF(ms,@StartTime,GETDATE())) + ' Milliseconds duration' [PK after table creation]

Thanks in advance.

- arjun
Post #968793
Posted Friday, August 13, 2010 6:22 AM


SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Wednesday, December 17, 2014 2:16 AM
Points: 2,840, Visits: 3,983
Arjun Sivadasan (8/13/2010)
1. Do I create PK before or after the insert (or if it depends, what are the factors)?
well it depends , sorting willbe done in either of cases, but in case of inserting data into heap then creating index give benefit.i think two operation ( insertion and index creation ) hapoen one by one ,results in less locking.but again , testing would give concrete answer for your query.


Arjun Sivadasan (8/13/2010)
2. The two queries attached have different execution times. From the execution plans, in case of primary key creation after insert, after insertion of data the following steps happen - Table scan-->Sort-->Index Insert. I had thought that data will only be reorganized in this case. What happens during the steps 'in bold' and the 'Index Insert' step in particular?
index page creation


Arjun Sivadasan (8/13/2010)
3. If I use table instead of temporary table, is the behavior similar?
it should.


-------Bhuvnesh----------
I work only to learn Sql Server...though my company pays me for getting their stuff done
Post #968843
Posted Friday, August 13, 2010 6:37 AM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Wednesday, December 3, 2014 8:17 AM
Points: 2,917, Visits: 2,537

Do I create PK before or after the insert (or if it depends, what are the factors)?

I would definitely create my indexes/PK's after the data is loaded. That way the statistics will be up to date and provide a better query plan. From what you said, your own testing has verified this. There is one issue which I have run into, and that is when the table is created and the indexes/statistics are created - If you need to delete and reload the data, then the queryplan can be way of. Also, I have run into an issue where deleting the data from a table and recompiling the stored procedures on that table, whether temp or permanent, the query plan has been way off.

If I use table instead of temporary table, is the behavior similar?

This one should be based upon your system. A Temp Table (#<tablename) is actually created in tempdb, and thus on the drive where your tempdb is located. If you have a very busy system, then there will definitely be an I/O issue as other processes will be competing for disk space and IO. You would have to test various configurations to determine the best. Optins could be to create a temp table, a permanent table, maybe a file group on another drive and place the table on that. These are just of the top of my head.

Good luck



Steve Jimmo
Sr DBA
“If we ever forget that we are One Nation Under God, then we will be a Nation gone under." - Ronald Reagan
Post #968855
Posted Friday, August 13, 2010 8:01 AM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: Yesterday @ 1:17 PM
Points: 3,467, Visits: 1,840
One additional thing to consider is what is the order of your data? If the order of the data you are inserting very closly matches the clustered index then you will get much better performance than if it doesn't.

I wrote a piece of code a number of years back to upgrade between 2 versions of a piece of software. In the end for speed purposes we had to drop and save all of the indexes then re-create them at the end. This was not always the fastest way to move the data though so we had to put in a flag, drop indexes or not. The determining factor was of course size of the table. And unfortunatly it wasn't an exact number of rows, and we ended up using testing and experience to decide which way to go.

I think however that in your case (aprox half a mil) you are safe enough creating the index at the end. However, again you will get alot of bang for your buck if you insert the data in the order of the clustered index. Even realizing the cost of doing an ORDER BY on your query.


Kenneth Fisher
I strive to live in a world where a chicken can cross the road without being questioned about its motives.
--------------------------------------------------------------------------------
For better, quicker answers on T-SQL questions, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Link to my Blog Post --> www.SQLStudies.com
Post #968949
Posted Friday, August 13, 2010 4:42 PM
SSC Veteran

SSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC Veteran

Group: General Forum Members
Last Login: Friday, December 31, 2010 9:46 AM
Points: 274, Visits: 473
If the clus index key is an IDENTITY column, you can create that beforehand; SQL "knows" the INSERTs will always be sequential.

You may be able to create the clus index beforehand if you make sure the data is sorted in clus key sequence when loaded. That is:

CREATE TABLE tablename ( ..., UNIQUE CLUSTERED (col1, col2) )
INSERT INTO tablename
SELECT col1, col2, ...
FROM ...
WHERE ...
ORDER BY col1, col2

You will have to test this to be sure.


Scott Pletcher, SQL Server MVP 2008-2010
Post #969312
Posted Friday, August 13, 2010 8:21 PM


SSCrazy Eights

SSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy Eights

Group: General Forum Members
Last Login: Yesterday @ 4:18 PM
Points: 9,928, Visits: 11,206
Hey Arjun,

There are a few important things here that determine the fastest method:

First, you must avoid an explicit sort operation if possible. In your first script (where the clustered index exists before the load) you are loading the result of a ROW_NUMBER ranking function into an INTEGER column. ROW_NUMBER returns BIGINT; the type mismatch is enough to confuse the optimiser in thinking it needs to sort the results before inserting into the table.

Second, try to achieve a minimally-logged load. In SQL Server 2005, an INSERT...SELECT statement is always fully logged. The minimally-logged alternative is SELECT...INTO. This does mean creating the clustered index as a second operation, but the benefit of minimal logging more than makes up for this overhead. Just to confuse matters, SQL Server 2008 can do a minimally-logged INSERT...SELECT in many circumstances.

Finally, try not to use master.dbo.spt_values - use a documented system table instead. The reason is that the internals of spt_values changed significantly in SQL Server 2008, making it a very slow way to grab rows.

The fastest method (for me) depends on which version of SQL Server I use. The lack of a minimally-logged INSERT in SQL Server 2005 means that SELECT...INTO followed by a clustered index build is faster. The improvements in SQL Server 2008 make INSERT...SELECT (with an existing clustered index) faster.

So, my 'improved' version of the first query (pre-existing index) is:

--
-- Load into an existing index
-- (fastest on 2008)
--

IF OBJECT_ID(N'tempdb..#Tally', N'U')
IS NOT NULL
BEGIN
DROP TABLE #Tally;
END;
GO
DECLARE @StartTime DATETIME;
SET @StartTime = GETDATE();

CREATE TABLE #Tally
(
N BIGINT NOT NULL PRIMARY KEY CLUSTERED
);

INSERT #Tally
(N)
SELECT TOP (10000)
ROW_NUMBER() OVER (ORDER BY (SELECT 1))
FROM master.sys.allocation_units A,
master.sys.allocation_units B;

SELECT [INSERT...SELECT] = DATEDIFF(MILLISECOND, @StartTime, GETDATE());

GO
DROP TABLE #Tally;


'Improved' version of the second query (load then index) is:
--
-- Load then create index
-- (fastest on 2005)
--

IF OBJECT_ID(N'tempdb..#Tally', N'U')
IS NOT NULL
BEGIN
DROP TABLE #Tally;
END;
GO
DECLARE @StartTime DATETIME;
SET @StartTime = GETDATE();


SELECT TOP (10000)
ISNULL(ROW_NUMBER() OVER(ORDER BY (SELECT 1)), -1) AS N
INTO #Tally
FROM master.sys.allocation_units A,
master.sys.allocation_units B;

ALTER TABLE #Tally
ADD PRIMARY KEY CLUSTERED (N)
WITH
FILLFACTOR = 100;

SELECT [SELECT...INTO] = DATEDIFF(MILLISECOND, @StartTime, GETDATE());
GO
DROP TABLE #Tally;

Typical run times on 2008:
Query 1: 13ms
Query 2: 23ms

Typical run times on 2005:
Query 1: 96ms
Query 2: 20ms

Paul




Paul White
SQL Server MVP
SQLblog.com
@SQL_Kiwi
Post #969333
Posted Saturday, August 14, 2010 12:57 AM


SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Tuesday, December 2, 2014 11:59 PM
Points: 406, Visits: 780
Gentlemen, thanks for the answers. I have more questions. I will post them a little later. I am currently using my phone to browse. I would like to add that I found creating the primary key after the bulk insert slower in this case. I am sorry for the delay in my response. I will update soon.

- arjun
Post #969364
Posted Monday, August 16, 2010 12:22 AM


SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Tuesday, December 2, 2014 11:59 PM
Points: 406, Visits: 780
Thanks a lot for the response Steve. I am sorry I took some time to post. Weekend Fever!!

sjimmo (8/13/2010)

I would definitely create my indexes/PK's after the data is loaded. That way the statistics will be up to date and provide a better query plan. From what you said, your own testing has verified this.


No, in the sample I had posted, index creation after insertion took more time but, I think there were other factors at play, as Paul has explained. Say, I create the index in the table definition. How will that affect statistics or query plan? If temporary tables are used, execution plan will be changed once the table has data, won't it be? (Ref: http://support.microsoft.com/kb/243586)


There is one issue which I have run into, and that is when the table is created and the indexes/statistics are created - If you need to delete and reload the data, then the queryplan can be way of. Also, I have run into an issue where deleting the data from a table and recompiling the stored procedures on that table, whether temp or permanent, the query plan has been way off.


By way off, do you mean the new plan yields less performance or that it is way different from the older plan?

- arjun
Post #969566
Posted Monday, August 16, 2010 12:27 AM


SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Tuesday, December 2, 2014 11:59 PM
Points: 406, Visits: 780
Thanks Kenneth and Scott. I will reexamine my queries with this in mind.

-arjun
Post #969569
Posted Monday, August 16, 2010 12:47 AM


SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Tuesday, December 2, 2014 11:59 PM
Points: 406, Visits: 780
Hi Paul, it's good to hear from you. And a great answer, thanks a lot.

Paul White NZ (8/13/2010)
First, you must avoid an explicit sort operation if possible. In your first script (where the clustered index exists before the load) you are loading the result of a ROW_NUMBER ranking function into an INTEGER column. ROW_NUMBER returns BIGINT; the type mismatch is enough to confuse the optimiser in thinking it needs to sort the results before inserting into the table.

Finally, try not to use master.dbo.spt_values - use a documented system table instead. The reason is that the internals of spt_values changed significantly in SQL Server 2008, making it a very slow way to grab rows.


I don't usually use undocumented stuff as they are bound to change or get dropped. I picked that spt_values part from another post. I was not aware that row_number returns bigint. It is really good to know that as I tend to use row_number function a lot.


Second, try to achieve a minimally-logged load. In SQL Server 2005, an INSERT...SELECT statement is always fully logged. The minimally-logged alternative is SELECT...INTO. This does mean creating the clustered index as a second operation, but the benefit of minimal logging more than makes up for this overhead. Just to confuse matters, SQL Server 2008 can do a minimally-logged INSERT...SELECT in many circumstances.

The fastest method (for me) depends on which version of SQL Server I use. The lack of a minimally-logged INSERT in SQL Server 2005 means that SELECT...INTO followed by a clustered index build is faster. The improvements in SQL Server 2008 make INSERT...SELECT (with an existing clustered index) faster.


I was not aware of these benefits of SELECT.. INTO. It's really good to know this.

I have another question for you Paul. Say, I have an SP which uses a temporary table. The number of rows in the temporary table depends on the SP parameters. So, for one set of parameters, the table has 100 rows and for another set, the table has 100, 000 rows. Will the execution plan be changed once the 100,000 rows are inserted? I've always assumed it to. Ref: http://support.microsoft.com/kb/243586

- arjun
Post #969575
« Prev Topic | Next Topic »

Add to briefcase 123»»»

Permissions Expand / Collapse